Python requests结合XPath如何高效解析网页？-杰瑞科技汇

Of course! Using Python to make an HTTP request and then parsing the HTML response with XPath is a very common and powerful web scraping technique.

（图片来源网络，侵删）

Here's a complete guide, breaking it down into the necessary steps, the required libraries, and a practical example.

The Core Workflow

Request: Use a library like requests to fetch the HTML content of a webpage.
Parse: Use a library like lxml or BeautifulSoup (with a parser like lxml) to convert the raw HTML into a structured document that you can navigate.
Extract: Use XPath expressions to query the structured document and extract the specific data you need.

Step 1: Install Necessary Libraries

You'll need two main libraries:

requests: To send HTTP requests and get the HTML content.
lxml: A powerful and fast library for parsing HTML and XML, which supports XPath. You can also use BeautifulSoup with the lxml parser as an alternative.

Open your terminal or command prompt and install them using pip:

pip install requests
pip install lxml

Step 2: The Python Code

Let's scrape the headlines from the BBC News homepage as a practical example. Our goal is to get the main headline and the summary text.

（图片来源网络，侵删）

Here is the complete, commented code.

import requests
from lxml import html
# 1. Define the URL to scrape
url = 'https://www.bbc.com/news'
# 2. Send an HTTP GET request to the URL
# It's good practice to set a User-Agent to mimic a real browser
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
try:
    # Make the request with a timeout to avoid hanging indefinitely
    response = requests.get(url, headers=headers, timeout=10)
    # Raise an exception if the request was unsuccessful (e.g., 404, 500)
    response.raise_for_status()
    # 3. Parse the HTML content using lxml
    # The 'html' parser from lxml is specifically designed for parsing HTML
    tree = html.fromstring(response.content)
    # 4. Define and use XPath expressions to extract data
    # NOTE: XPath selectors can change if the website updates its HTML structure.
    # You can find these by using your browser's "Inspect Element" tool.
    # Extract the main headline
    # This XPath looks for an <h2> tag inside an <a> tag with a specific class, within a specific container.
    main_headline = tree.xpath('//div[@class="sc-5fef9d2d-0 kYmCZK"]/h2/a/text()')
    # Extract the summary text for the main headline
    main_summary = tree.xpath('//div[@class="sc-5fef9d2d-0 kYmCZK"]/p/text()')
    # Extract other headlines (using a more general selector)
    # This XPath finds all <h3> tags inside links within a container with a different class.
    other_headlines = tree.xpath('//div[@class="sc-5fef9d2d-2 dCwzUB"]/h3/a/text()')
    # 5. Print the extracted data
    print("--- BBC News Headlines ---\n")
    if main_headline:
        print(f"Main Headline: {main_headline[0].strip()}")
    if main_summary:
        print(f"Summary: {main_summary[0].strip()}")
    print("\n--- Other Headlines ---")
    for i, headline in enumerate(other_headlines, 1):
        print(f"{i}. {headline.strip()}")
except requests.exceptions.RequestException as e:
    print(f"Error during requests to {url}: {e}")
except Exception as e:
    print(f"An unexpected error occurred: {e}")

Explanation of the Code

The Request

response = requests.get(url, headers=headers, timeout=10)
response.raise_for_status()

requests.get(url): Sends an HTTP GET request to the specified URL.
headers: We send a User-Agent header. Some websites block requests that don't look like they're coming from a real browser. This helps avoid being blocked.
timeout=10: If the server doesn't respond within 10 seconds, the request will fail, preventing your script from hanging indefinitely.
response.raise_for_status(): This is a useful method that will raise an HTTPError if the HTTP request returned an unsuccessful status code (4xx or 5xx).

The Parsing

tree = html.fromstring(response.content)

response.content: This gives you the raw HTML content of the page as bytes.
html.fromstring(...): This function from the lxml library parses the raw HTML and turns it into an ElementTree object, which we've named tree. This tree object is the key to using XPath.

The XPath Extraction (The Core Part)

main_headline = tree.xpath('//div[@class="sc-5fef9d2d-0 kYmCZK"]/h2/a/text()')

Let's break down this XPath expression:

Selects nodes in the document from the current node that match the selection no matter where they are. It's a "global" search.
div: Selects all <div> elements.
[@class="sc-5fef9d2d-0 kYmCZK"]: This is a predicate. It filters the <div> elements to find only those that have an attribute named class with the exact value "sc-5fef9d2d-0 kYmCZK". This is the most common way to pinpoint a specific element.
Selects the direct child of the previous node.
h2: Selects the <h2> child of the <div>.
/a: Selects the <a> (link) child of the <h2>.
/text(): Selects the text content of the <a> tag.

Important: The classes in the XPath (sc-5fef9d2d-0, kYmCZK, etc.) are likely generated dynamically and can change at any time. If you run this code in a few months and it stops working, the first thing to check is if the website's HTML structure has changed.

Handling the Results

tree.xpath() always returns a list, even if it only finds one matching element. That's why we access the first result with main_headline[0].
The strip() method is used to remove any leading or trailing whitespace from the extracted text.

Alternative: Using `BeautifulSoup` with `lxml`

Many developers prefer the more user-friendly API of BeautifulSoup. It can use lxml as its underlying parser, giving you the best of both worlds: easy-to-use syntax and powerful XPath capabilities.

（图片来源网络，侵删）

First, install beautifulsoup4: pip install beautifulsoup4

Here's how you would write the same scraper using BeautifulSoup:

import requests
from bs4 import BeautifulSoup
from lxml import etree # Import the lxml parser
url = 'https://www.bbc.com/news'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}
try:
    response = requests.get(url, headers=headers, timeout=10)
    response.raise_for_status()
    # Use BeautifulSoup for parsing
    soup = BeautifulSoup(response.content, 'lxml')
    # To use XPath with BeautifulSoup, you need to use the underlying lxml parser
    # and convert the BeautifulSoup object to an lxml Element
    # This is a bit more advanced but shows how to combine them.
    # For simpler cases, BeautifulSoup's CSS selectors are often preferred.
    # For this example, let's stick to pure lxml as it's more direct for XPath.
    # --- The XPath part is identical to the first example ---
    tree = html.fromstring(response.content) # Re-use the tree from before
    main_headline = tree.xpath('//div[@class="sc-5fef9d2d-0 kYmCZK"]/h2/a/text()')
    main_summary = tree.xpath('//div[@class="sc-5fef9d2d-0 kYmCZK"]/p/text()')
    other_headlines = tree.xpath('//div[@class="sc-5fef9d2d-2 dCwzUB"]/h3/a/text()')
    # --- The printing part is also identical ---
    print("--- BBC News Headlines (BeautifulSoup + lxml) ---\n")
    if main_headline:
        print(f"Main Headline

Python requests结合XPath如何高效解析网页？

The Core Workflow

Step 1: Install Necessary Libraries

Step 2: The Python Code

Explanation of the Code

The Request

The Parsing

The XPath Extraction (The Core Part)

Handling the Results

Alternative: Using `BeautifulSoup` with `lxml`

99ANYc3cd6

Wireshark怎么用？新手入门教程看这里！

Oracle 11g安装步骤有哪些？

Python在Sublime中如何调试？

Python中hex函数如何正确导入使用？

电子测量与虚拟仪器技术教程的核心是什么？

Java Socket编程实例具体怎么实现？

Java Mission Control如何高效排查JVM问题？

Java Base64如何转成图片？

Android OpenGL ES教程从哪开始学？

Java中equals与hashCode为何要同时重写？

python unicode print

Photoshop CS5教程，从入门到精通怎么学？

Python如何高效读取response响应内容？

java properties util

Python debug symbols有何用？如何生成与使用？

哪里能下载Adobe Premiere教程？

Python requests结合XPath如何高效解析网页？

The Core Workflow

Step 1: Install Necessary Libraries

Step 2: The Python Code

Explanation of the Code

The Request

The Parsing

The XPath Extraction (The Core Part)

Handling the Results

Alternative: Using BeautifulSoup with lxml

相关推荐

Java Socket编程实例具体怎么实现？

Alternative: Using `BeautifulSoup` with `lxml`