杰瑞科技汇

Python爬虫如何高效获取网页数据?

Of course! Building a Python web crawler is a fantastic project that combines several important skills: networking, data parsing, and data storage.

Python爬虫如何高效获取网页数据?-图1
(图片来源网络,侵删)

Let's break down the process from the absolute basics to a more advanced, production-ready crawler.

What is a Web Crawler?

A web crawler (or spider) is a bot that systematically browses the web, typically for the purpose of web indexing. It does this by:

  1. Starting with a list of initial URLs (a "seed" list).
  2. Fetching the content of a URL.
  3. Parsing the content to find new URLs.
  4. Adding these new URLs to a list of pages to visit.
  5. Repeating the process until it has visited all desired pages or a stopping condition is met.

The Core Concepts

  1. Fetching (HTTP Requests): Getting the HTML content of a web page. We use libraries like requests.
  2. Parsing (HTML Parsing): Navigating the HTML structure to extract the data we want or find links. We use libraries like BeautifulSoup4 or lxml.
  3. Politeness & Respect: A good crawler is polite to the website it's crawling.
    • Rate Limiting: Don't send too many requests in a short period. This can overload the server and get your IP address blocked.
    • Check robots.txt: Most websites have a file at www.example.com/robots.txt that tells bots which parts of the site they are not allowed to access. Always respect this.
    • Identify Yourself: Use a User-Agent in your request headers to identify your bot.
  4. Storage: Saving the scraped data. This can be as simple as printing to the console or saving to a file (CSV, JSON). For larger projects, you'd use a database.

The Tools You'll Need

You'll need a few Python libraries. Install them using pip:

pip install requests
pip install beautifulsoup4
pip install lxml  # A faster parser, optional but recommended

Example 1: A Simple Crawler (One Page)

This crawler will fetch a single page, parse it, and print all the links it finds.

Python爬虫如何高效获取网页数据?-图2
(图片来源网络,侵删)
import requests
from bs4 import BeautifulSoup
# The URL of the page we want to crawl
url = 'http://quotes.toscrape.com/'
# 1. Fetch the page content
try:
    response = requests.get(url)
    # Raise an exception if the request was unsuccessful (e.g., 404, 500)
    response.raise_for_status() 
except requests.exceptions.RequestException as e:
    print(f"Error fetching the URL: {e}")
    exit()
# 2. Parse the HTML content
soup = BeautifulSoup(response.text, 'lxml')
# 3. Find all the <a> tags (which define links)
links = soup.find_all('a')
# 4. Print the links
print(f"Found {len(links)} links on {url}:\n")
for link in links:
    # Get the 'href' attribute, which contains the URL
    href = link.get('href')
    # Get the visible text of the link
    text = link.text.strip()
    # Avoid printing empty or irrelevant links
    if href and text:
        print(f"Text: {text} -> URL: {href}")

Example 2: A Recursive Crawler (Multiple Pages)

This is a more advanced crawler that can follow links to discover new pages. To prevent it from running forever, we'll use a set to keep track of visited URLs.

Warning: A recursive crawler can easily get out of control. For this example, we'll limit it to a maximum depth and a maximum number of pages to crawl.

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse
def crawl(start_url, max_pages=10, max_depth=2):
    """
    A recursive web crawler.
    :param start_url: The initial URL to start crawling from.
    :param max_pages: The maximum number of pages to crawl.
    :param max_depth: The maximum depth to crawl from the start_url.
    """
    # Use a set to store visited URLs to avoid duplicates and infinite loops
    visited_urls = set()
    # Use a list to manage the crawling queue (URL and its depth)
    urls_to_visit = [(start_url, 0)] # (url, current_depth)
    while urls_to_visit and len(visited_urls) < max_pages:
        current_url, current_depth = urls_to_visit.pop(0)
        # Skip if we've already visited this URL or exceeded depth
        if current_url in visited_urls or current_depth > max_depth:
            continue
        print(f"Crawling (Depth {current_depth}): {current_url}")
        visited_urls.add(current_url)
        try:
            # Set a User-Agent to identify our bot
            headers = {'User-Agent': 'MySimpleCrawler/1.0'}
            response = requests.get(current_url, headers=headers, timeout=5)
            response.raise_for_status()
        except requests.exceptions.RequestException as e:
            print(f"Failed to fetch {current_url}: {e}")
            continue
        # Parse the page content
        soup = BeautifulSoup(response.text, 'lxml')
        # Find all links on the page if we haven't reached max depth
        if current_depth < max_depth:
            for link_tag in soup.find_all('a', href=True):
                href = link_tag['href']
                # Make the URL absolute (e.g., /page -> http://domain.com/page)
                absolute_url = urljoin(current_url, href)
                # Clean the URL (e.g., remove fragments like #section)
                absolute_url = urlparse(absolute_url).scheme + "://" + urlparse(absolute_url).netloc + urlparse(absolute_url).path
                # Add the new URL to our queue if it's from the same domain
                if urlparse(absolute_url).netloc == urlparse(start_url).netloc:
                    urls_to_visit.append((absolute_url, current_depth + 1))
    print("\nCrawling finished.")
    print(f"Total pages visited: {len(visited_urls)}")
# --- Run the crawler ---
if __name__ == "__main__":
    start_url = 'http://quotes.toscrape.com/'
    crawl(start_url, max_pages=15, max_depth=2)

How to Improve and Scale Up

The recursive example is good, but for real-world use, you need more robust solutions.

Politeness: Adding Delays

Don't hammer the server. Add a time.sleep() between requests.

Python爬虫如何高效获取网页数据?-图3
(图片来源网络,侵删)
import time
# ...
try:
    response = requests.get(current_url, headers=headers, timeout=5)
    response.raise_for_status()
    # Be polite! Wait for a second before the next request.
    time.sleep(1) 
except ...

Data Storage: Saving to a CSV

Let's modify the crawler to scrape quotes and save them to a CSV file.

import csv
import requests
from bs4 import BeautifulSoup
def scrape_quotes_to_csv(url, output_file='quotes.csv'):
    """Scrapes quotes from a single page and saves them to a CSV file."""
    try:
        response = requests.get(url, headers={'User-Agent': 'MyQuoteScraper/1.0'})
        response.raise_for_status()
    except requests.exceptions.RequestException as e:
        print(f"Error fetching {url}: {e}")
        return
    soup = BeautifulSoup(response.text, 'lxml')
    quotes = soup.find_all('div', class_='quote')
    with open(output_file, 'a', newline='', encoding='utf-8') as f:
        writer = csv.writer(f)
        # Write header only if the file is new (optional)
        # if f.tell() == 0:
        #     writer.writerow(['Quote', 'Author'])
        for quote in quotes:
            text = quote.find('span', class_='text').get_text(strip=True)
            author = quote.find('small', class_='author').get_text(strip=True)
            writer.writerow([text, author])
            print(f"Scraped: '{text}' - {author}")
# --- Run the scraper ---
if __name__ == "__main__":
    target_url = 'http://quotes.toscrape.com/'
    scrape_quotes_to_csv(target_url)

Advanced: Using Scrapy Framework

For large-scale crawling, Scrapy is the industry-standard Python framework. It's powerful, fast, and handles many complexities for you.

  • Asynchronous Requests: Scrapy is asynchronous, meaning it can send many requests at once without waiting for a response, making it much faster.
  • Built-in Data Pipelines: It has a robust pipeline for processing scraped data (cleaning, validating, storing in databases).
  • Automatic Throttling & Politeness: It automatically handles rate limiting and respects robots.txt.
  • Extensible: It's easy to add custom middlewares and extensions.

Scrapy Project Example:

  1. Install Scrapy: pip install scrapy
  2. Create a project: scrapy startproject myproject
  3. Generate a spider: cd myproject then scrapy genspider quotes quotes.toscrape.com

Here's what the quotes_spider.py might look like:

# myproject/spiders/quotes_spider.py
import scrapy
class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    # allowed_domains = ['quotes.toscrape.com'] # Optional: restricts crawling to this domain
    start_urls = ['http://quotes.toscrape.com/']
    def parse(self, response):
        # This method is called for each response
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }
        # Follow the "Next" link to the next page
        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)

To run this spider and save the output to a JSON file: scrapy crawl quotes -o quotes.json


Ethical and Legal Considerations

  • robots.txt: Always check this file first. It's the first rule of web scraping.
  • Terms of Service (ToS): Some websites explicitly forbid scraping in their ToS. Violating this can have legal consequences.
  • Public vs. Private Data: Only scrape data that is publicly available. Do not attempt to scrape private or user-generated content without permission.
  • Be Gentle: Use delays, respect robots.txt, and identify your bot. A good crawler leaves no trace.
分享:
扫描分享到社交APP
上一篇
下一篇