Python爬虫如何高效获取网页数据？-杰瑞科技汇

Of course! Building a Python web crawler is a fantastic project that combines several important skills: networking, data parsing, and data storage.

（图片来源网络，侵删）

Let's break down the process from the absolute basics to a more advanced, production-ready crawler.

What is a Web Crawler?

A web crawler (or spider) is a bot that systematically browses the web, typically for the purpose of web indexing. It does this by:

Starting with a list of initial URLs (a "seed" list).
Fetching the content of a URL.
Parsing the content to find new URLs.
Adding these new URLs to a list of pages to visit.
Repeating the process until it has visited all desired pages or a stopping condition is met.

The Core Concepts

Fetching (HTTP Requests): Getting the HTML content of a web page. We use libraries like requests.
Parsing (HTML Parsing): Navigating the HTML structure to extract the data we want or find links. We use libraries like BeautifulSoup4 or lxml.
Politeness & Respect: A good crawler is polite to the website it's crawling.
- Rate Limiting: Don't send too many requests in a short period. This can overload the server and get your IP address blocked.
- Check robots.txt: Most websites have a file at www.example.com/robots.txt that tells bots which parts of the site they are not allowed to access. Always respect this.
- Identify Yourself: Use a User-Agent in your request headers to identify your bot.
Storage: Saving the scraped data. This can be as simple as printing to the console or saving to a file (CSV, JSON). For larger projects, you'd use a database.

The Tools You'll Need

You'll need a few Python libraries. Install them using pip:

pip install requests
pip install beautifulsoup4
pip install lxml  # A faster parser, optional but recommended

Example 1: A Simple Crawler (One Page)

This crawler will fetch a single page, parse it, and print all the links it finds.

（图片来源网络，侵删）

import requests
from bs4 import BeautifulSoup
# The URL of the page we want to crawl
url = 'http://quotes.toscrape.com/'
# 1. Fetch the page content
try:
    response = requests.get(url)
    # Raise an exception if the request was unsuccessful (e.g., 404, 500)
    response.raise_for_status() 
except requests.exceptions.RequestException as e:
    print(f"Error fetching the URL: {e}")
    exit()
# 2. Parse the HTML content
soup = BeautifulSoup(response.text, 'lxml')
# 3. Find all the <a> tags (which define links)
links = soup.find_all('a')
# 4. Print the links
print(f"Found {len(links)} links on {url}:\n")
for link in links:
    # Get the 'href' attribute, which contains the URL
    href = link.get('href')
    # Get the visible text of the link
    text = link.text.strip()
    # Avoid printing empty or irrelevant links
    if href and text:
        print(f"Text: {text} -> URL: {href}")

Example 2: A Recursive Crawler (Multiple Pages)

This is a more advanced crawler that can follow links to discover new pages. To prevent it from running forever, we'll use a set to keep track of visited URLs.

Warning: A recursive crawler can easily get out of control. For this example, we'll limit it to a maximum depth and a maximum number of pages to crawl.

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse
def crawl(start_url, max_pages=10, max_depth=2):
    """
    A recursive web crawler.
    :param start_url: The initial URL to start crawling from.
    :param max_pages: The maximum number of pages to crawl.
    :param max_depth: The maximum depth to crawl from the start_url.
    """
    # Use a set to store visited URLs to avoid duplicates and infinite loops
    visited_urls = set()
    # Use a list to manage the crawling queue (URL and its depth)
    urls_to_visit = [(start_url, 0)] # (url, current_depth)
    while urls_to_visit and len(visited_urls) < max_pages:
        current_url, current_depth = urls_to_visit.pop(0)
        # Skip if we've already visited this URL or exceeded depth
        if current_url in visited_urls or current_depth > max_depth:
            continue
        print(f"Crawling (Depth {current_depth}): {current_url}")
        visited_urls.add(current_url)
        try:
            # Set a User-Agent to identify our bot
            headers = {'User-Agent': 'MySimpleCrawler/1.0'}
            response = requests.get(current_url, headers=headers, timeout=5)
            response.raise_for_status()
        except requests.exceptions.RequestException as e:
            print(f"Failed to fetch {current_url}: {e}")
            continue
        # Parse the page content
        soup = BeautifulSoup(response.text, 'lxml')
        # Find all links on the page if we haven't reached max depth
        if current_depth < max_depth:
            for link_tag in soup.find_all('a', href=True):
                href = link_tag['href']
                # Make the URL absolute (e.g., /page -> http://domain.com/page)
                absolute_url = urljoin(current_url, href)
                # Clean the URL (e.g., remove fragments like #section)
                absolute_url = urlparse(absolute_url).scheme + "://" + urlparse(absolute_url).netloc + urlparse(absolute_url).path
                # Add the new URL to our queue if it's from the same domain
                if urlparse(absolute_url).netloc == urlparse(start_url).netloc:
                    urls_to_visit.append((absolute_url, current_depth + 1))
    print("\nCrawling finished.")
    print(f"Total pages visited: {len(visited_urls)}")
# --- Run the crawler ---
if __name__ == "__main__":
    start_url = 'http://quotes.toscrape.com/'
    crawl(start_url, max_pages=15, max_depth=2)

How to Improve and Scale Up

The recursive example is good, but for real-world use, you need more robust solutions.

Politeness: Adding Delays

Don't hammer the server. Add a time.sleep() between requests.

（图片来源网络，侵删）

import time
# ...
try:
    response = requests.get(current_url, headers=headers, timeout=5)
    response.raise_for_status()
    # Be polite! Wait for a second before the next request.
    time.sleep(1) 
except ...

Data Storage: Saving to a CSV

Let's modify the crawler to scrape quotes and save them to a CSV file.

import csv
import requests
from bs4 import BeautifulSoup
def scrape_quotes_to_csv(url, output_file='quotes.csv'):
    """Scrapes quotes from a single page and saves them to a CSV file."""
    try:
        response = requests.get(url, headers={'User-Agent': 'MyQuoteScraper/1.0'})
        response.raise_for_status()
    except requests.exceptions.RequestException as e:
        print(f"Error fetching {url}: {e}")
        return
    soup = BeautifulSoup(response.text, 'lxml')
    quotes = soup.find_all('div', class_='quote')
    with open(output_file, 'a', newline='', encoding='utf-8') as f:
        writer = csv.writer(f)
        # Write header only if the file is new (optional)
        # if f.tell() == 0:
        #     writer.writerow(['Quote', 'Author'])
        for quote in quotes:
            text = quote.find('span', class_='text').get_text(strip=True)
            author = quote.find('small', class_='author').get_text(strip=True)
            writer.writerow([text, author])
            print(f"Scraped: '{text}' - {author}")
# --- Run the scraper ---
if __name__ == "__main__":
    target_url = 'http://quotes.toscrape.com/'
    scrape_quotes_to_csv(target_url)

Advanced: Using Scrapy Framework

For large-scale crawling, Scrapy is the industry-standard Python framework. It's powerful, fast, and handles many complexities for you.

Asynchronous Requests: Scrapy is asynchronous, meaning it can send many requests at once without waiting for a response, making it much faster.
Built-in Data Pipelines: It has a robust pipeline for processing scraped data (cleaning, validating, storing in databases).
Automatic Throttling & Politeness: It automatically handles rate limiting and respects robots.txt.
Extensible: It's easy to add custom middlewares and extensions.

Scrapy Project Example:

Install Scrapy: pip install scrapy
Create a project: scrapy startproject myproject
Generate a spider: cd myproject then scrapy genspider quotes quotes.toscrape.com

Here's what the quotes_spider.py might look like:

# myproject/spiders/quotes_spider.py
import scrapy
class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    # allowed_domains = ['quotes.toscrape.com'] # Optional: restricts crawling to this domain
    start_urls = ['http://quotes.toscrape.com/']
    def parse(self, response):
        # This method is called for each response
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }
        # Follow the "Next" link to the next page
        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)

To run this spider and save the output to a JSON file: scrapy crawl quotes -o quotes.json

Ethical and Legal Considerations

robots.txt: Always check this file first. It's the first rule of web scraping.
Terms of Service (ToS): Some websites explicitly forbid scraping in their ToS. Violating this can have legal consequences.
Public vs. Private Data: Only scrape data that is publicly available. Do not attempt to scrape private or user-generated content without permission.
Be Gentle: Use delays, respect robots.txt, and identify your bot. A good crawler leaves no trace.

Python爬虫如何高效获取网页数据？

What is a Web Crawler?

The Core Concepts

The Tools You'll Need

Example 1: A Simple Crawler (One Page)

Example 2: A Recursive Crawler (Multiple Pages)

How to Improve and Scale Up

Politeness: Adding Delays

Data Storage: Saving to a CSV

Advanced: Using Scrapy Framework

Ethical and Legal Considerations

99ANYc3cd6

Eclipse Java D开发环境如何配置？

qlv格式怎么转mp4？

Mac如何升级Python版本？

Python基础教程第2版PDF哪里能免费下载？

Java程序设计教程第三版适合哪些学习者？

Java Socket编程实例具体怎么实现？

Java KindEditor图片上传如何配置实现？

Python ImageFont API 如何正确使用？

Java String转Object有几种方法？

Python 2.7.4 Windows如何安装配置？

string转datetime java，如何正确转换？

Photoshop CS4中文版教程怎么学？

Ghost Win10系统安装教程详细步骤是怎样的？

Java实现Redis共享Session，如何保证高并发安全？

ghost win7系统硬盘安装步骤详解？

Tornado如何高效实现异步编程？

Python爬虫如何高效获取网页数据？

What is a Web Crawler?

The Core Concepts

The Tools You'll Need

Example 1: A Simple Crawler (One Page)

Example 2: A Recursive Crawler (Multiple Pages)

How to Improve and Scale Up

Politeness: Adding Delays

Data Storage: Saving to a CSV

Advanced: Using Scrapy Framework

Ethical and Legal Considerations

相关推荐

Java Socket编程实例具体怎么实现？