杰瑞科技汇

Python requests text如何高效获取网页内容?

Of course! The requests library in Python is the standard for making HTTP requests. Getting text from a URL is one of its most common uses.

Python requests text如何高效获取网页内容?-图1
(图片来源网络,侵删)

Here’s a complete guide, from the basics to more advanced and practical examples.


The Basic Request: requests.get()

The core of getting text is the requests.get() function, which sends an HTTP GET request to a specified URL. The server responds, and requests gives you an object that contains the server's response.

To get the text content, you use the .text attribute on the response object.

Step 1: Install the requests library

If you don't have it installed, open your terminal or command prompt and run:

Python requests text如何高效获取网页内容?-图2
(图片来源网络,侵删)
pip install requests

Step 2: Simple Example

This is the most basic way to fetch and print the text of a webpage.

import requests
# The URL you want to get text from
url = 'https://www.python.org'
try:
    # Send a GET request to the URL
    response = requests.get(url)
    # This will raise an exception for bad status codes (4xx or 5xx)
    response.raise_for_status()
    # Get the text content from the response
    # The .text attribute returns the content as a string
    page_text = response.text
    # Print the first 500 characters of the text
    print(page_text[:500])
except requests.exceptions.RequestException as e:
    # Handle any errors that occur during the request
    print(f"An error occurred: {e}")

What's happening here?

  1. import requests: Imports the library.
  2. requests.get(url): Sends the HTTP GET request. The server sends back a response, which is stored in the response object.
  3. response.raise_for_status(): This is a good practice. It checks if the request was successful (status code 200-299). If not (e.g., 404 Not Found, 500 Server Error), it raises an HTTPError.
  4. response.text: This is the key part. It decodes the response body (which is in bytes) into a string using the encoding specified in the response headers (e.g., Content-Type: text/html; charset=utf-8).

Important Attributes of the Response Object

When you get a response, it's not just text. The Response object contains a lot of useful information.

import requests
url = 'https://httpbin.org/get' # A great testing URL
response = requests.get(url)
# --- Status Code ---
# Indicates whether the request was successful (e.g., 200), not found (404), etc.
print(f"Status Code: {response.status_code}")
# --- Headers ---
# The headers sent by the server.
# Note: The 'requests' library adds its own headers (like 'User-Agent').
print("\nServer Headers:")
print(response.headers)
# --- Request Headers ---
# The headers that your request sent.
print("\nRequest Headers (sent by us):")
print(response.request.headers)
# --- Encoding ---
# The encoding used to decode the response content.
# requests tries to guess this from the headers.
print(f"\nEncoding: {response.encoding}")
# --- Raw Content (in bytes) ---
# The raw content of the response, as bytes.
# This is useful if you're dealing with non-text data or want to control the decoding.
print(f"\nRaw Content (first 50 bytes): {response.content[:50]}")

Handling Real-World Complications

In a real application, you'll need to handle more than just a simple request.

Python requests text如何高效获取网页内容?-图3
(图片来源网络,侵删)

a) Handling Errors

Networks are unreliable. The server might be down, the URL might be wrong, or you might lose your connection. Always wrap your requests in a try...except block.

import requests
from requests.exceptions import RequestException, Timeout, HTTPError
url = 'https://this-domain-does-not-exist.com'
try:
    # Set a timeout in seconds for the request and the read operation
    response = requests.get(url, timeout=5)
    # If the request was successful, raise_for_status() does nothing.
    # If not, it raises an HTTPError.
    response.raise_for_status()
    print("Success! The page loaded.")
    print(f"Text length: {len(response.text)}")
except HTTPError as http_err:
    print(f"HTTP error occurred: {http_err}") # e.g., 404, 500
except Timeout as err:
    print(f"Request timed out: {err}")
except RequestException as err:
    # This is a catch-all for any requests-related errors
    print(f"An error occurred during the request: {err}")

b) Handling Different Encodings

Sometimes the server doesn't specify the encoding correctly, and response.text might look like gibberish. You can force requests to use a specific encoding.

import requests
# This URL is known to have an encoding issue if not handled correctly
url = 'https://www.nytimes.com/2025/10/27/us/politics/biden-polling.html'
try:
    response = requests.get(url)
    response.raise_for_status()
    # Let's see what encoding requests guessed
    print(f"Guessed Encoding: {response.encoding}") # Often 'ISO-8859-1' for problematic pages
    # The text might be corrupted
    # print(response.text) 
    # You can manually set the encoding. 'utf-8' is a common and safe choice.
    response.encoding = 'utf-8'
    # Now get the text with the correct encoding
    page_text = response.text
    print("\nSuccessfully decoded text with UTF-8:")
    print(page_text[:500])
except RequestException as e:
    print(f"An error occurred: {e}")

c) Adding Headers (e.g., User-Agent)

Some websites block default requests User-Agent strings because they know it's a script. To make your request look like it's coming from a real browser, you can add custom headers.

import requests
url = 'https://httpbin.org/user-agent' # This URL tells you what User-Agent it sees
# A common Chrome User-Agent string
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
response = requests.get(url, headers=headers)
print(response.json()) # The server will echo back the User-Agent it received
# Output: {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}

Advanced: Streaming Large Responses

If you are downloading a very large text file (or a large file in general), you don't want to load it all into memory at once. You can "stream" the response, processing it chunk by chunk.

import requests
url = 'https://www.gutenberg.org/files/11/11-0.txt' # Alice in Wonderland (large text file)
try:
    # Use stream=True to download the content in chunks
    with requests.get(url, stream=True) as response:
        response.raise_for_status()
        line_count = 0
        # Iterate over the response content line by line
        # response.iter_lines() gives you chunks, decoded by default
        for line in response.iter_lines(decode_unicode=True):
            if line: # filter out keep-alive new lines
                # Process each line here
                line_count += 1
                # print(line) # Uncomment to see the lines
                # Stop after 100 lines for this example
                if line_count >= 100:
                    break
        print(f"\nSuccessfully read {line_count} lines.")
except RequestException as e:
    print(f"An error occurred: {e}")

Summary

Task Code Snippet
Basic GET Request response = requests.get('https://example.com')
Get Text Content text = response.text
Check for Success response.raise_for_status()
Handle Errors Wrap in try...except requests.exceptions.RequestException
Set a Timeout requests.get(url, timeout=10)
Set Custom Headers requests.get(url, headers={'User-Agent': '...'})
Handle Encoding response.encoding = 'utf-8'
Stream Large Files with requests.get(url, stream=True) as r:
for chunk in r.iter_content(): ...
分享:
扫描分享到社交APP
上一篇
下一篇