杰瑞科技汇

Python编码解码如何正确处理字符与字节?

Of course! This is a fundamental concept in Python. Let's break down encoding and decoding in a clear, structured way.

Python编码解码如何正确处理字符与字节?-图1
(图片来源网络,侵删)

The Core Idea: Bytes vs. Strings

Think of it like this:

  • String (str): A sequence of human-readable characters. It's an abstract concept. 'hello', '世界', are all strings.
  • Bytes (bytes): A sequence of raw 8-bit data (numbers from 0 to 255). This is how data is actually stored in files and transmitted over a network. It's not human-readable.

Encoding is the process of converting a String into Bytes. Decoding is the process of converting Bytes back into a String.

You must use a specific encoding (like UTF-8, ASCII, etc.) to do this conversion. It's like a dictionary or a cipher.


The Golden Rule

You will almost always use UTF-8 as your encoding. It's the modern standard, can represent every character in every language, and is backward-compatible with ASCII.

Python编码解码如何正确处理字符与字节?-图2
(图片来源网络,侵删)
# This is the most important pattern to remember:
# 1. Take a string and ENCODE it to bytes
my_bytes = my_string.encode('utf-8')
# 2. Take those bytes and DECODE them back to a string
my_string = my_bytes.decode('utf-8')

Encoding: String → Bytes

You use the .encode() method on a string.

Example: Encoding a Simple String

# Our original string
my_string = "Hello, World!"
# Encode the string into bytes using UTF-8 encoding
# The result is a 'bytes' object, notice the 'b' prefix
encoded_bytes = my_string.encode('utf-8')
print(f"Original String: {my_string}")
print(f"Type of original: {type(my_string)}")
print("-" * 20)
print(f"Encoded Bytes: {encoded_bytes}")
print(f"Type of encoded: {type(encoded_bytes)}")
# You can see the raw byte values
print(f"Raw byte values: {list(encoded_bytes)}")

Output:

Original String: Hello, World!
Type of original: <class 'str'>
--------------------
Encoded Bytes: b'Hello, World!'
Type of encoded: <class 'bytes'>
Raw byte values: [72, 101, 108, 108, 111, 44, 32, 87, 111, 114, 108, 100, 33]

Example: Encoding with Different Characters (Unicode)

UTF-8 shines here because it can handle any character.

# A string with non-ASCII characters (emoji and Chinese)
my_string = "你好,世界! 🚀"
# Encode it
encoded_bytes = my_string.encode('utf-8')
print(f"Original String: {my_string}")
print(f"Encoded Bytes: {encoded_bytes}")
print(f"Raw byte values: {list(encoded_bytes)}")

Output:

Python编码解码如何正确处理字符与字节?-图3
(图片来源网络,侵删)
Original String: 你好,世界! 🚀
Encoded Bytes: b'\xe4\xbd\xa0\xe5\xa5\xbd\xef\xbc\x8c\xe4\xb8\x96\xe7\x95\x8c!\xf0\x9f\x9a\x80'
Raw byte values: [228, 189, 160, 229, 165, 189, 239, 188, 132, 224, 184, 150, 226, 157, 128, 33, 240, 159, 146, 128]

Notice how the emoji and Chinese characters take up multiple bytes each. This is normal for UTF-8.


Decoding: Bytes → String

You use the .decode() method on a bytes object.

Example: Decoding Bytes Back to a String

# Let's use the bytes from our previous example
encoded_bytes = b'Hello, World!'
# Decode the bytes back into a string
decoded_string = encoded_bytes.decode('utf-8')
print(f"Encoded Bytes: {encoded_bytes}")
print(f"Type of encoded: {type(encoded_bytes)}")
print("-" * 20)
print(f"Decoded String: {decoded_string}")
print(f"Type of decoded: {type(decoded_string)}")

Output:

Encoded Bytes: b'Hello, World!'
Type of encoded: <class 'bytes'>
--------------------
Decoded String: Hello, World!
Type of decoded: <class 'str'>

Common Pitfall: The Wrong Encoding

This is where errors happen most often. If you try to decode bytes with the wrong encoding, you'll get a UnicodeDecodeError.

# Let's encode a string with a special character using UTF-8
my_string = "café"
correctly_encoded_bytes = my_string.encode('utf-8')
print(f"UTF-8 Bytes: {correctly_encoded_bytes}") # b'caf\xc3\xa9'
# Now, let's try to decode it using a different encoding, like ASCII
# ASCII doesn't know how to handle the byte \xc3\xa9, so it fails.
try:
    correctly_encoded_bytes.decode('ascii')
except UnicodeDecodeError as e:
    print("\n--- ERROR ---")
    print(f"Failed to decode as ASCII: {e}")

Output:

UTF-8 Bytes: b'caf\xc3\xa9'
--- ERROR ---
Failed to decode as ASCII: 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)

How to fix this? Always know the encoding of the data you're receiving. If you're unsure, UTF-8 is your safest bet.


Practical Use Cases

Reading from and Writing to Files

When you open a file in text mode ('r', 'w'), Python handles the encoding/decoding for you automatically using the system's default encoding (which is usually UTF-8 on modern systems). However, it's best practice to be explicit.

# --- Writing to a file ---
data_to_write = "This is a test with émojis 🚀."
# 'w' for write mode, 'encoding="utf-8"' is explicit
with open("my_file.txt", "w", encoding="utf-8") as f:
    f.write(data_to_write)
print("File written successfully.")
# --- Reading from a file ---
# 'r' for read mode, 'encoding="utf-8"' is explicit
with open("my_file.txt", "r", encoding="utf-8") as f:
    data_read = f.read()
print(f"Data read from file: {data_read}")
# Verify they are the same
print(f"Original == Read from file? {data_to_write == data_read}")

Working with Network Requests (e.g., requests library)

When you get a response from a website, the body is often in bytes. You need to decode it.

# This is a conceptual example. You'd need the 'requests' library to run it.
# import requests
# response = requests.get('https://example.com')
#
# # The raw content from the server is in bytes
# # response.content  # -> b'<html>...'
#
# # The headers usually tell you the encoding, e.g., Content-Type: text/html; charset=UTF-8
# # If not specified, you often have to guess or try 'utf-8'
#
# # Decode the content to a string
# html_string = response.content.decode('utf-8') # Or response.text, which does this for you
#
# print(html_string)

Summary Table

Operation Method Direction Example
Encode .encode() String → Bytes "hello".encode('utf-8') -> b'hello'
Decode .decode() Bytes → String b'hello'.decode('utf-8') -> 'hello'
File Write open(..., 'w') String → File (Bytes) f.write("text")
File Read open(..., 'r') File (Bytes) → String content = f.read()
分享:
扫描分享到社交APP
上一篇
下一篇