杰瑞科技汇

Python importunicode如何正确使用?

Of course! The phrase "python importunicode" likely refers to how to handle Unicode text in Python, which is a fundamental concept. Since Python 3, the default string type is Unicode, making this much easier than in Python 2.

Python importunicode如何正确使用?-图1
(图片来源网络,侵删)

Here’s a comprehensive guide covering the essentials.

The Core Concept: Unicode in Python 3

In Python 3, the str type is a sequence of Unicode characters. This is the most important thing to remember.

# This is a Unicode string
my_string = "Hello, 世界! 🌎"
# Check its type
print(type(my_string))  # <class 'str'>
# You can access individual Unicode characters
print(my_string[0])     # H
print(my_string[7])     # 世
print(my_string[9])     # 🌎 (This is a single Unicode character)

The str type is an abstract representation of text. To store it in a file or send it over a network, you need to encode it into a specific byte representation (like UTF-8). When you read it back, you need to decode it from bytes back into a str.


The Key Functions: encode() and decode()

encode(): From str to bytes

This method converts a Unicode string (str) into a sequence of bytes (bytes).

Python importunicode如何正确使用?-图2
(图片来源网络,侵删)
text = "café"
# Encode the string to bytes using UTF-8 encoding
utf8_bytes = text.encode('utf-8')
print(f"Original string: {text}")
print(f"Type: {type(text)}")
print(f"Encoded bytes: {utf8_bytes}")
print(f"Type: {type(utf8_bytes)}")

Output:

Original string: café
Type: <class 'str'>
Encoded bytes: b'caf\xc3\xa9'
Type: <class 'bytes'>

Notice how the is represented by the two bytes \xc3\xa9. This is the UTF-8 encoding for that character.

decode(): From bytes to str

This method converts a sequence of bytes (bytes) back into a Unicode string (str).

# We have the bytes from the previous example
utf8_bytes = b'caf\xc3\xa9'
# Decode the bytes back into a string
original_text = utf8_bytes.decode('utf-8')
print(f"Bytes object: {utf8_bytes}")
print(f"Decoded string: {original_text}")
print(f"Type: {type(original_text)}")

Output:

Python importunicode如何正确使用?-图3
(图片来源网络,侵删)
Bytes object: b'caf\xc3\xa9'
Decoded string: café
Type: <class 'str'>

Reading and Writing Files with Unicode

This is where encoding becomes critical. When you open a file, you must specify its encoding. The modern, recommended standard is UTF-8.

Writing to a File (open with encoding)

# List of strings with different scripts
lines_to_write = [
    "Hello from English!",
    "Hola desde español!",
    "مرحبا من العربية!", # Arabic
    "こんにちはから日本語!" # Japanese
]
# Use a 'with' statement for safe file handling
# The 'encoding="utf-8"' argument is the key part here
with open('my_unicode_file.txt', 'w', encoding='utf-8') as f:
    for line in lines_to_write:
        f.write(line + '\n')
print("File 'my_unicode_file.txt' written successfully.")

If you don't specify encoding='utf-8', Python will use your system's default encoding, which might not be what you expect and can lead to errors or data corruption, especially on Windows.

Reading from a File (open with encoding)

# Read the file we just created
# Again, specify the encoding to read it correctly
with open('my_unicode_file.txt', 'r', encoding='utf-8') as f:
    content = f.read()
print("\n--- File Contents ---")
print(content)
print("---------------------")

Output:

--- File Contents ---
Hello from English!
Hola desde español!
مرحبا من العربية!
こんにちはから日本語!
---------------------

Common Errors and How to Fix Them

UnicodeDecodeError

This happens when you try to read a file that is not encoded in the format you specified.

Scenario: You have a file saved with latin-1 encoding, but you try to read it as utf-8.

# Let's create a file with latin-1 encoding
# The euro symbol '€' is encoded as 0xA4 in latin-1
euro_bytes = b'The price is \xa420.' # This is a bytes object
with open('price_latin1.txt', 'wb') as f:
    f.write(euro_bytes)
# Now, let's try to read it incorrectly as UTF-8
try:
    with open('price_latin1.txt', 'r', encoding='utf-8') as f:
        content = f.read()
except UnicodeDecodeError as e:
    print(f"Error caught: {e}")

Output:

Error caught: 'utf-8' codec can't decode byte 0xa4 in position 12: invalid start byte

Solution: You must know (or guess) the correct encoding of the source file and use it when reading.

# Correct way to read the latin-1 file
with open('price_latin1.txt', 'r', encoding='latin-1') as f:
    content = f.read()
print(content) # Output: The price is €20.

UnicodeEncodeError

This happens when you try to write a string to a file or stream that cannot support all the characters in your string, and you haven't specified an encoding that can handle them.

Scenario: You try to print a string with an emoji to a console that doesn't support UTF-8 (rare these days, but possible).

text_with_emoji = "This has an emoji: 🚀"
# This will usually work on modern terminals, but might fail in an old one
# or when redirecting output to a file that expects a different encoding.
try:
    # If the terminal's encoding is, for example, 'cp1252' (a common Windows encoding)
    # and you don't handle it, you'll get an error.
    sys.stdout.reconfigure(encoding='cp1252') # Simulate an old terminal
    print(text_with_emoji)
except UnicodeEncodeError as e:
    print(f"Error caught: {e}")

Solution: Ensure the output stream (file, console, etc.) is configured to use a capable encoding like UTF-8. When writing to files, always specify encoding='utf-8'.


Practical Example: Scraping a Web Page

Web pages almost always declare their encoding. You should respect this. The requests library and BeautifulSoup make this easy.

import requests
from bs4 import BeautifulSoup
# A URL that uses non-ASCII characters
url = 'https://zh.wikipedia.org/wiki/中华人民共和国'
try:
    # 1. Fetch the page content. Requests uses the encoding from the HTTP header.
    response = requests.get(url)
    response.raise_for_status()  # Raise an exception for bad status codes (4xx or 5xx)
    # The encoding is often detected automatically from the HTTP headers
    print(f"Detected encoding by requests: {response.encoding}")
    # The .text property gives you a decoded Unicode string
    html_content = response.text
    # 2. Parse the HTML
    soup = BeautifulSoup(html_content, 'html.parser')
    # 3. Extract and print some text= soup.find('title').text
    print(f"\nPage Title: {title}")
    first_paragraph = soup.find('p').text
    print(f"\nFirst Paragraph (first 100 chars): {first_paragraph[:100]}...")
except requests.exceptions.RequestException as e:
    print(f"Error fetching URL: {e}")

Summary

Task Python 3 Method / Concept Key Takeaway
Default Text Type str is Unicode by default. You don't need to do anything special to create a Unicode string.
Convert String to Bytes my_string.encode('utf-8') Use .encode() when saving to a file or sending over a network.
Convert Bytes to String my_bytes.decode('utf-8') Use .decode() when reading from a file or receiving data.
Reading/Writing Files open('file.txt', 'r', encoding='utf-8') Always specify encoding='utf-8' when opening files for text.
Common Error UnicodeDecodeError / UnicodeEncodeError This almost always means you have a mismatch between your data and encoding.
分享:
扫描分享到社交APP
上一篇
下一篇