杰瑞科技汇

python http encoding

Of course! Let's break down HTTP encoding in Python, focusing on the two most common types: Character Encoding (for text) and Content-Encoding (for compression).

python http encoding-图1
(图片来源网络,侵删)

This is a fundamental concept for web development, APIs, and scraping.


Character Encoding (The "Text" Encoding)

This is about how text characters (like 'A', 'é', '中', '😊') are represented as bytes for transmission over the network. The standard for the web is UTF-8.

The Problem: Why It Matters

If a server sends text in a different encoding (like ISO-8859-1, also known as Latin-1) and your Python code assumes it's UTF-8, you'll get a UnicodeDecodeError.

The Solution: Always Specify the Encoding

The most important rule is: never rely on the default encoding. Always explicitly specify encoding='utf-8'.

python http encoding-图2
(图片来源网络,侵删)

Examples with the requests Library (Most Common)

The requests library is the de facto standard for making HTTP requests in Python. It handles encoding automatically, but you should still be aware of how to control it.

Scenario: A server sends text in a different encoding.

Let's simulate a server that sends a French string encoded in ISO-8859-1.

import requests
# Simulate a server that sends content with a non-UTF-8 encoding.
# The text "François" encoded in ISO-8859-1 (Latin-1)
# The character 'ç' is byte 0xE9 in Latin-1.
iso_encoded_content = b'Fran\xc3\xa7ois' # Note: This is actually UTF-8 bytes for "François"
# Let's create a more realistic example where the server sends ISO-8859-1 bytes
iso_encoded_content = b'Fran\xe7ois' # 'ç' is byte 0xE7 in ISO-8859-1
# Create a mock server response
# The server incorrectly declares its encoding as UTF-8, but the bytes are actually ISO-8859-1
response = requests.Response()
response.status_code = 200
response._content = iso_encoded_content
response.headers['Content-Type'] = 'text/html; charset=utf-8' # The server lies!
# --- The Problem ---
try:
    # requests will try to decode using the charset from the header ('utf-8')
    # This will fail because the bytes are not valid UTF-8.
    # The byte 0xE7 is not a valid UTF-8 sequence.
    text = response.text
    print("Decoded text (should fail):", text)
except UnicodeDecodeError as e:
    print("Caught expected error:", e)
    print("This happens because the server's declared encoding (UTF-8) doesn't match the actual bytes (ISO-8859-1).")
# --- The Solution: Manually Set the Encoding ---
# You can override the encoding detected from headers.
response.encoding = 'iso-8859-1' # Or 'latin-1'
# Now, when you access .text, it will decode correctly.
correct_text = response.text
print("\n--- SOLUTION ---")
print("Manually set encoding to 'iso-8859-1'")
print("Decoded text correctly:", correct_text)
print("Type of the decoded text:", type(correct_text))

Key takeaway: Always check for UnicodeDecodeError. If you get one, inspect the Content-Type header from the server (response.headers['Content-Type']) and try manually setting response.encoding to the correct value.

python http encoding-图3
(图片来源网络,侵删)

Content-Encoding (The "Compression" Encoding)

This is about compressing the body of the HTTP response to save bandwidth. The client (your Python code) tells the server what compression methods it understands via the Accept-Encoding header.

Common encodings:

  • gzip: The most common and widely supported.
  • deflate: Less common.
  • br: Brotli, a newer, more efficient compression (used by HTTP/2).

How Python Handles It (Automatically)

The requests library handles this for you automatically. When you make a request, requests includes an Accept-Encoding: gzip, deflate, br header. If the server responds with Content-Encoding: gzip, requests will automatically decompress the response body before giving you the content.

You don't need to do anything special!

import requests
# This URL uses gzip compression
url = 'https://httpbin.org/gzip'
print("Sending request to:", url)
# requests automatically adds 'Accept-Encoding: gzip, deflate, br'
response = requests.get(url)
# Check the server's response header
print("\nServer's Content-Encoding header:", response.headers.get('Content-Encoding'))
# It will be 'gzip'
# Check the content type of the response body
# The .content attribute is the raw, decompressed bytes.
print("Type of response.content:", type(response.content))
# It's just regular bytes, not compressed bytes.
# The .text attribute is the decoded string (assuming UTF-8)
print("Decoded response text:", response.text)
# It will contain a JSON object confirming it was gzipped.

Key takeaway: requests handles Content-Encoding for you. You just need to make sure you handle the Character Encoding correctly to avoid UnicodeDecodeError.


Complete Best Practices Checklist

Here is a summary of how to handle HTTP encoding robustly in Python with requests.

import requests
def fetch_url(url):
    try:
        # 1. Make the request. requests handles compression automatically.
        #    It also sends 'Accept-Encoding: gzip, deflate, br'.
        response = requests.get(url, timeout=10)
        response.raise_for_status()  # Raises an exception for bad status codes (4xx or 5xx)
        # 2. Handle Character Encoding (The most important step!)
        #    First, try to use the encoding specified in the headers.
        #    If that fails or is missing, fall back to a sensible default.
        encoding = response.encoding
        # If the encoding is None (not specified in headers) or wrong,
        # .text will fail. We can catch this and retry.
        try:
            text_content = response.text
        except UnicodeDecodeError:
            print(f"Warning: Failed to decode with encoding '{encoding}'. Trying 'utf-8'...")
            try:
                response.encoding = 'utf-8'
                text_content = response.text
            except UnicodeDecodeError:
                print("Warning: UTF-8 also failed. Trying 'latin-1' (never fails but may be wrong).")
                response.encoding = 'latin-1' # This will never raise an error
                text_content = response.text
        print(f"Successfully decoded content using encoding: {response.encoding}")
        print(f"Content (first 100 chars): {text_content[:100]}...")
        # 3. Access the raw bytes if needed
        #    This is the original, decompressed byte content.
        raw_bytes = response.content
        print(f"Type of raw content: {type(raw_bytes)}")
        # 4. Access JSON data (if applicable)
        #    .json() also handles decoding for you.
        if 'application/json' in response.headers.get('Content-Type', ''):
            json_data = response.json()
            print(f"Parsed JSON data: {json_data}")
    except requests.exceptions.RequestException as e:
        print(f"An error occurred: {e}")
# --- Example Usage ---
print("--- Example 1: Standard UTF-8 page ---")
fetch_url('https://www.google.com')
print("\n--- Example 2: API with JSON (handled automatically) ---")
fetch_url('https://api.github.com') # Returns JSON
print("\n--- Example 3: A page that might have encoding issues (hypothetical) ---")
# For this example, we'll use httpbin.org/encoding/utf8 which is safe.
# In a real-world scraper, this is where you'd encounter the errors discussed above.
fetch_url('https://httpbin.org/encoding/utf8')

Summary

Type of Encoding What It Does How Python (requests) Handles It Your Responsibility
Character Encoding Converts text characters (Unicode) to bytes. - Automatically uses the charset from the Content-Type header.
- Defaults to ISO-8859-1 (Latin-1) if no charset is specified.
Be aware of UnicodeDecodeError. If it happens, inspect the headers and manually set response.encoding to the correct value (e.g., 'utf-8', 'latin-1').
Content-Encoding Compresses the HTTP response body (e.g., with gzip). - Automatically adds Accept-Encoding header to requests.
- Automatically decompresses the response body based on the Content-Encoding header.
Nothing! requests handles this for you seamlessly.
分享:
扫描分享到社交APP
上一篇
下一篇