python http encoding-杰瑞科技汇

Of course! Let's break down HTTP encoding in Python, focusing on the two most common types: Character Encoding (for text) and Content-Encoding (for compression).

（图片来源网络，侵删）

This is a fundamental concept for web development, APIs, and scraping.

Character Encoding (The "Text" Encoding)

This is about how text characters (like 'A', 'é', '中', '😊') are represented as bytes for transmission over the network. The standard for the web is UTF-8.

The Problem: Why It Matters

If a server sends text in a different encoding (like ISO-8859-1, also known as Latin-1) and your Python code assumes it's UTF-8, you'll get a UnicodeDecodeError.

The Solution: Always Specify the Encoding

The most important rule is: never rely on the default encoding. Always explicitly specify encoding='utf-8'.

（图片来源网络，侵删）

Examples with the `requests` Library (Most Common)

The requests library is the de facto standard for making HTTP requests in Python. It handles encoding automatically, but you should still be aware of how to control it.

Scenario: A server sends text in a different encoding.

Let's simulate a server that sends a French string encoded in ISO-8859-1.

import requests
# Simulate a server that sends content with a non-UTF-8 encoding.
# The text "François" encoded in ISO-8859-1 (Latin-1)
# The character 'ç' is byte 0xE9 in Latin-1.
iso_encoded_content = b'Fran\xc3\xa7ois' # Note: This is actually UTF-8 bytes for "François"
# Let's create a more realistic example where the server sends ISO-8859-1 bytes
iso_encoded_content = b'Fran\xe7ois' # 'ç' is byte 0xE7 in ISO-8859-1
# Create a mock server response
# The server incorrectly declares its encoding as UTF-8, but the bytes are actually ISO-8859-1
response = requests.Response()
response.status_code = 200
response._content = iso_encoded_content
response.headers['Content-Type'] = 'text/html; charset=utf-8' # The server lies!
# --- The Problem ---
try:
    # requests will try to decode using the charset from the header ('utf-8')
    # This will fail because the bytes are not valid UTF-8.
    # The byte 0xE7 is not a valid UTF-8 sequence.
    text = response.text
    print("Decoded text (should fail):", text)
except UnicodeDecodeError as e:
    print("Caught expected error:", e)
    print("This happens because the server's declared encoding (UTF-8) doesn't match the actual bytes (ISO-8859-1).")
# --- The Solution: Manually Set the Encoding ---
# You can override the encoding detected from headers.
response.encoding = 'iso-8859-1' # Or 'latin-1'
# Now, when you access .text, it will decode correctly.
correct_text = response.text
print("\n--- SOLUTION ---")
print("Manually set encoding to 'iso-8859-1'")
print("Decoded text correctly:", correct_text)
print("Type of the decoded text:", type(correct_text))

Key takeaway: Always check for UnicodeDecodeError. If you get one, inspect the Content-Type header from the server (response.headers['Content-Type']) and try manually setting response.encoding to the correct value.

（图片来源网络，侵删）

Content-Encoding (The "Compression" Encoding)

This is about compressing the body of the HTTP response to save bandwidth. The client (your Python code) tells the server what compression methods it understands via the Accept-Encoding header.

Common encodings:

gzip: The most common and widely supported.
deflate: Less common.
br: Brotli, a newer, more efficient compression (used by HTTP/2).

How Python Handles It (Automatically)

The requests library handles this for you automatically. When you make a request, requests includes an Accept-Encoding: gzip, deflate, br header. If the server responds with Content-Encoding: gzip, requests will automatically decompress the response body before giving you the content.

You don't need to do anything special!

import requests
# This URL uses gzip compression
url = 'https://httpbin.org/gzip'
print("Sending request to:", url)
# requests automatically adds 'Accept-Encoding: gzip, deflate, br'
response = requests.get(url)
# Check the server's response header
print("\nServer's Content-Encoding header:", response.headers.get('Content-Encoding'))
# It will be 'gzip'
# Check the content type of the response body
# The .content attribute is the raw, decompressed bytes.
print("Type of response.content:", type(response.content))
# It's just regular bytes, not compressed bytes.
# The .text attribute is the decoded string (assuming UTF-8)
print("Decoded response text:", response.text)
# It will contain a JSON object confirming it was gzipped.

Key takeaway: requests handles Content-Encoding for you. You just need to make sure you handle the Character Encoding correctly to avoid UnicodeDecodeError.

Complete Best Practices Checklist

Here is a summary of how to handle HTTP encoding robustly in Python with requests.

import requests
def fetch_url(url):
    try:
        # 1. Make the request. requests handles compression automatically.
        #    It also sends 'Accept-Encoding: gzip, deflate, br'.
        response = requests.get(url, timeout=10)
        response.raise_for_status()  # Raises an exception for bad status codes (4xx or 5xx)
        # 2. Handle Character Encoding (The most important step!)
        #    First, try to use the encoding specified in the headers.
        #    If that fails or is missing, fall back to a sensible default.
        encoding = response.encoding
        # If the encoding is None (not specified in headers) or wrong,
        # .text will fail. We can catch this and retry.
        try:
            text_content = response.text
        except UnicodeDecodeError:
            print(f"Warning: Failed to decode with encoding '{encoding}'. Trying 'utf-8'...")
            try:
                response.encoding = 'utf-8'
                text_content = response.text
            except UnicodeDecodeError:
                print("Warning: UTF-8 also failed. Trying 'latin-1' (never fails but may be wrong).")
                response.encoding = 'latin-1' # This will never raise an error
                text_content = response.text
        print(f"Successfully decoded content using encoding: {response.encoding}")
        print(f"Content (first 100 chars): {text_content[:100]}...")
        # 3. Access the raw bytes if needed
        #    This is the original, decompressed byte content.
        raw_bytes = response.content
        print(f"Type of raw content: {type(raw_bytes)}")
        # 4. Access JSON data (if applicable)
        #    .json() also handles decoding for you.
        if 'application/json' in response.headers.get('Content-Type', ''):
            json_data = response.json()
            print(f"Parsed JSON data: {json_data}")
    except requests.exceptions.RequestException as e:
        print(f"An error occurred: {e}")
# --- Example Usage ---
print("--- Example 1: Standard UTF-8 page ---")
fetch_url('https://www.google.com')
print("\n--- Example 2: API with JSON (handled automatically) ---")
fetch_url('https://api.github.com') # Returns JSON
print("\n--- Example 3: A page that might have encoding issues (hypothetical) ---")
# For this example, we'll use httpbin.org/encoding/utf8 which is safe.
# In a real-world scraper, this is where you'd encounter the errors discussed above.
fetch_url('https://httpbin.org/encoding/utf8')

Summary

Type of Encoding	What It Does	How Python (`requests`) Handles It	Your Responsibility
Character Encoding	Converts text characters (Unicode) to bytes.	- Automatically uses the `charset` from the `Content-Type` header. - Defaults to `ISO-8859-1` (Latin-1) if no charset is specified.	Be aware of `UnicodeDecodeError`. If it happens, inspect the headers and manually set `response.encoding` to the correct value (e.g., `'utf-8'`, `'latin-1'`).
Content-Encoding	Compresses the HTTP response body (e.g., with gzip).	- Automatically adds `Accept-Encoding` header to requests. - Automatically decompresses the response body based on the `Content-Encoding` header.	Nothing! `requests` handles this for you seamlessly.

python http encoding

Character Encoding (The "Text" Encoding)

The Problem: Why It Matters

The Solution: Always Specify the Encoding

Examples with the `requests` Library (Most Common)

Content-Encoding (The "Compression" Encoding)

How Python Handles It (Automatically)

Complete Best Practices Checklist

Summary

99ANYc3cd6

Linux下如何执行Python脚本文件？

sourceinsight教程

Python jieba库具体怎么用？

matlab2025b安装教程

Mac下Python运行报错怎么办？

androidkiller安装教程

centos安装nginx教程

Java catch块里return，方法最终返回什么？

Python getarguments如何获取命令行参数？

mysql5.7.17安装教程

Project 2010教程，从入门到精通怎么学？

3dmax2025教程视频全集

MyEclipse运行Java程序常见问题有哪些？

JavaScript教程电子书适合零基础入门吗？

一键winxp升级win7教程

Python collections模块有哪些核心工具类？

python http encoding

Character Encoding (The "Text" Encoding)

The Problem: Why It Matters

The Solution: Always Specify the Encoding

Examples with the requests Library (Most Common)

Content-Encoding (The "Compression" Encoding)

How Python Handles It (Automatically)

Complete Best Practices Checklist

Summary

相关推荐

androidkiller安装教程

Examples with the `requests` Library (Most Common)