杰瑞科技汇

Python requests结合XPath如何高效解析网页?

Of course! Using Python to make an HTTP request and then parsing the HTML response with XPath is a very common and powerful web scraping technique.

Python requests结合XPath如何高效解析网页?-图1
(图片来源网络,侵删)

Here's a complete guide, breaking it down into the necessary steps, the required libraries, and a practical example.

The Core Workflow

  1. Request: Use a library like requests to fetch the HTML content of a webpage.
  2. Parse: Use a library like lxml or BeautifulSoup (with a parser like lxml) to convert the raw HTML into a structured document that you can navigate.
  3. Extract: Use XPath expressions to query the structured document and extract the specific data you need.

Step 1: Install Necessary Libraries

You'll need two main libraries:

  1. requests: To send HTTP requests and get the HTML content.
  2. lxml: A powerful and fast library for parsing HTML and XML, which supports XPath. You can also use BeautifulSoup with the lxml parser as an alternative.

Open your terminal or command prompt and install them using pip:

pip install requests
pip install lxml

Step 2: The Python Code

Let's scrape the headlines from the BBC News homepage as a practical example. Our goal is to get the main headline and the summary text.

Python requests结合XPath如何高效解析网页?-图2
(图片来源网络,侵删)

Here is the complete, commented code.

import requests
from lxml import html
# 1. Define the URL to scrape
url = 'https://www.bbc.com/news'
# 2. Send an HTTP GET request to the URL
# It's good practice to set a User-Agent to mimic a real browser
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
try:
    # Make the request with a timeout to avoid hanging indefinitely
    response = requests.get(url, headers=headers, timeout=10)
    # Raise an exception if the request was unsuccessful (e.g., 404, 500)
    response.raise_for_status()
    # 3. Parse the HTML content using lxml
    # The 'html' parser from lxml is specifically designed for parsing HTML
    tree = html.fromstring(response.content)
    # 4. Define and use XPath expressions to extract data
    # NOTE: XPath selectors can change if the website updates its HTML structure.
    # You can find these by using your browser's "Inspect Element" tool.
    # Extract the main headline
    # This XPath looks for an <h2> tag inside an <a> tag with a specific class, within a specific container.
    main_headline = tree.xpath('//div[@class="sc-5fef9d2d-0 kYmCZK"]/h2/a/text()')
    # Extract the summary text for the main headline
    main_summary = tree.xpath('//div[@class="sc-5fef9d2d-0 kYmCZK"]/p/text()')
    # Extract other headlines (using a more general selector)
    # This XPath finds all <h3> tags inside links within a container with a different class.
    other_headlines = tree.xpath('//div[@class="sc-5fef9d2d-2 dCwzUB"]/h3/a/text()')
    # 5. Print the extracted data
    print("--- BBC News Headlines ---\n")
    if main_headline:
        print(f"Main Headline: {main_headline[0].strip()}")
    if main_summary:
        print(f"Summary: {main_summary[0].strip()}")
    print("\n--- Other Headlines ---")
    for i, headline in enumerate(other_headlines, 1):
        print(f"{i}. {headline.strip()}")
except requests.exceptions.RequestException as e:
    print(f"Error during requests to {url}: {e}")
except Exception as e:
    print(f"An unexpected error occurred: {e}")

Explanation of the Code

The Request

response = requests.get(url, headers=headers, timeout=10)
response.raise_for_status()
  • requests.get(url): Sends an HTTP GET request to the specified URL.
  • headers: We send a User-Agent header. Some websites block requests that don't look like they're coming from a real browser. This helps avoid being blocked.
  • timeout=10: If the server doesn't respond within 10 seconds, the request will fail, preventing your script from hanging indefinitely.
  • response.raise_for_status(): This is a useful method that will raise an HTTPError if the HTTP request returned an unsuccessful status code (4xx or 5xx).

The Parsing

tree = html.fromstring(response.content)
  • response.content: This gives you the raw HTML content of the page as bytes.
  • html.fromstring(...): This function from the lxml library parses the raw HTML and turns it into an ElementTree object, which we've named tree. This tree object is the key to using XPath.

The XPath Extraction (The Core Part)

main_headline = tree.xpath('//div[@class="sc-5fef9d2d-0 kYmCZK"]/h2/a/text()')

Let's break down this XPath expression:

  • Selects nodes in the document from the current node that match the selection no matter where they are. It's a "global" search.
  • div: Selects all <div> elements.
  • [@class="sc-5fef9d2d-0 kYmCZK"]: This is a predicate. It filters the <div> elements to find only those that have an attribute named class with the exact value "sc-5fef9d2d-0 kYmCZK". This is the most common way to pinpoint a specific element.
  • Selects the direct child of the previous node.
  • h2: Selects the <h2> child of the <div>.
  • /a: Selects the <a> (link) child of the <h2>.
  • /text(): Selects the text content of the <a> tag.

Important: The classes in the XPath (sc-5fef9d2d-0, kYmCZK, etc.) are likely generated dynamically and can change at any time. If you run this code in a few months and it stops working, the first thing to check is if the website's HTML structure has changed.

Handling the Results

  • tree.xpath() always returns a list, even if it only finds one matching element. That's why we access the first result with main_headline[0].
  • The strip() method is used to remove any leading or trailing whitespace from the extracted text.

Alternative: Using BeautifulSoup with lxml

Many developers prefer the more user-friendly API of BeautifulSoup. It can use lxml as its underlying parser, giving you the best of both worlds: easy-to-use syntax and powerful XPath capabilities.

Python requests结合XPath如何高效解析网页?-图3
(图片来源网络,侵删)

First, install beautifulsoup4: pip install beautifulsoup4

Here's how you would write the same scraper using BeautifulSoup:

import requests
from bs4 import BeautifulSoup
from lxml import etree # Import the lxml parser
url = 'https://www.bbc.com/news'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}
try:
    response = requests.get(url, headers=headers, timeout=10)
    response.raise_for_status()
    # Use BeautifulSoup for parsing
    soup = BeautifulSoup(response.content, 'lxml')
    # To use XPath with BeautifulSoup, you need to use the underlying lxml parser
    # and convert the BeautifulSoup object to an lxml Element
    # This is a bit more advanced but shows how to combine them.
    # For simpler cases, BeautifulSoup's CSS selectors are often preferred.
    # For this example, let's stick to pure lxml as it's more direct for XPath.
    # --- The XPath part is identical to the first example ---
    tree = html.fromstring(response.content) # Re-use the tree from before
    main_headline = tree.xpath('//div[@class="sc-5fef9d2d-0 kYmCZK"]/h2/a/text()')
    main_summary = tree.xpath('//div[@class="sc-5fef9d2d-0 kYmCZK"]/p/text()')
    other_headlines = tree.xpath('//div[@class="sc-5fef9d2d-2 dCwzUB"]/h3/a/text()')
    # --- The printing part is also identical ---
    print("--- BBC News Headlines (BeautifulSoup + lxml) ---\n")
    if main_headline:
        print(f"Main Headline
分享:
扫描分享到社交APP
上一篇
下一篇