杰瑞科技汇

Python中str与unicode有何区别?

Of course! This is a fundamental and crucial topic in Python, especially for anyone dealing with text. Let's break it down clearly.

Python中str与unicode有何区别?-图1
(图片来源网络,侵删)

The Short Answer (TL;DR)

  • In Python 2, str was a sequence of bytes, and unicode was a sequence of abstract characters. You had to convert between them explicitly.
  • In Python 3, this was simplified. There is only one text type: str. This str type is now a sequence of Unicode characters, making it much easier to handle international text.
  • The byte-based type from Python 2 was renamed to bytes in Python 3.

The Detailed Explanation

To understand the difference, you have to look at the two major versions of Python.


Part 1: Python 2 (The "Old Way" with str vs. unicode)

In Python 2, there were two distinct types for representing text:

str (The Byte String)

  • What it is: A sequence of bytes. It has no inherent knowledge of which characters those bytes represent. It's just raw data.
  • Encoding: The encoding is implicit and assumed, often based on your system's default locale (like ASCII or UTF-8). This is a common source of bugs.
  • Analogy: Think of str as a box of electrical wires. The wires themselves don't mean anything; you need a schematic (an encoding) to know which wire corresponds to which signal (character).
# In Python 2
# This is a byte string. On a US system, it's likely ASCII.
my_str = "hello"
# len() counts bytes
print len(my_str)  # Output: 5
# What if you try to put a non-ASCII character?
# This will raise a UnicodeDecodeError on a system with default ASCII encoding.
# my_str = "héllo" 

unicode (The Text String)

  • What it is: A sequence of abstract characters. It's a proper, universal representation of text that can hold any character from any language (e.g., 'A', 'é', '世', '😂').
  • Encoding: It is decoupled from any specific byte encoding. It's the "ideal" form of text in memory.
  • Analogy: Think of unicode as the text you see in a word processor. It's the abstract idea of the characters, independent of how they are stored on disk or sent over a network.
# In Python 2
# This is a unicode string. It can hold any character.
my_unicode = u"héllo"
# len() counts characters, not bytes
print len(my_unicode)  # Output: 5
# It can also hold emojis
my_unicode_emoji = u"hello 😂"
print len(my_unicode_emoji) # Output: 7 (5 letters + 1 space + 1 emoji)

The Bridge: Encoding and Decoding

To use text from the real world (like reading from a file or a network), you must convert between str and unicode.

  • Decoding: Converting a str (bytes) to unicode. You are decoding the bytes into characters.
    • my_unicode = my_str.decode('utf-8')
  • Encoding: Converting a unicode to str (bytes). You are encoding the characters into bytes.
    • my_str = my_unicode.encode('utf-8')

Python 2 Example:

# We have some text with an accent, stored as UTF-8 bytes
# The 'u' prefix makes it a unicode string
text_unicode = u"café"
# Let's encode it to UTF-8 bytes
text_utf8_bytes = text_unicode.encode('utf-8') # This is a 'str' in Python 2
print text_utf8_bytes  # Output: 'caf\xc3\xa9'
print len(text_utf8_bytes) # Output: 5 (4 letters + 2 bytes for 'é')
# Now, let's decode it back to unicode
text_unicode_again = text_utf8_bytes.decode('utf-8')
print text_unicode_again # Output: u'caf\xc3\xa9' (it looks the same when printed)
print len(text_unicode_again) # Output: 4

The Problem: If you forgot to encode/decode at the right time, you'd get a UnicodeDecodeError or a UnicodeEncodeError, which was a huge source of frustration for Python 2 developers.


Part 2: Python 3 (The "New Way" with str and bytes)

Python 3 fixed this confusion by making a clear distinction.

str (The Text String)

  • What it is: This is now the one and only text type. It is a sequence of Unicode characters, just like unicode was in Python 2.
  • Encoding: It is inherently Unicode. No more guessing. len() counts characters.
  • Analogy: This is the modern, universal text type. It's what you use for all your string manipulation.
# In Python 3
# This is a 'str' object, and it's Unicode by default.
my_str = "hello"
print(type(my_str)) # Output: <class 'str'>
# len() counts characters
print(len(my_str)) # Output: 5
# It handles non-ASCII characters seamlessly
my_str_with_accents = "héllo"
print(len(my_str_with_accents)) # Output: 5
# It also handles emojis
my_str_with_emoji = "hello 😂"
print(len(my_str_with_emoji)) # Output: 7

bytes (The Byte String)

  • What it is: This is the new name for the old str type. It is a sequence of bytes. It is for raw binary data, not text.
  • Encoding: It has no encoding of its own. You must explicitly tell Python how to encode a str into bytes or decode bytes into a str.
  • Analogy: This is for things like images, network packets, or file formats where the data isn't meant to be interpreted as text.
# In Python 3
# This is a 'bytes' object.
my_bytes = b"hello" # The 'b' prefix is crucial
print(type(my_bytes)) # Output: <class 'bytes'>
# len() counts bytes
print(len(my_bytes)) # Output: 5
# You cannot mix str and bytes directly
# This will raise a TypeError
# my_str + my_bytes 

The Bridge: Encoding and Decoding (Simplified)

The concept is the same, but the types are now more clearly named.

  • Encoding: Converting a str to bytes.
    • my_bytes = my_str.encode('utf-8')
  • Decoding: Converting bytes to a str.
    • my_str = my_bytes.decode('utf-8')

Python 3 Example:

# We have a unicode string (this is just a 'str' in Python 3)
text_str = "café"
# Let's encode it to UTF-8 bytes
text_bytes = text_str.encode('utf-8')
print(text_bytes) # Output: b'caf\xc3\xa9'
print(len(text_bytes)) # Output: 5
# Now, let's decode it back to a string
text_str_again = text_bytes.decode('utf-8')
print(text_str_again) # Output: 'café'
print(len(text_str_again)) # Output: 4

Summary Table

Feature Python 2 Python 3
Text Type unicode str
Text Type Behavior Sequence of abstract characters Sequence of abstract characters (Unicode)
Byte Type str bytes
Byte Type Behavior Sequence of bytes (encoding is implicit) Sequence of bytes (no encoding)
len() on Text Counts characters Counts characters
len() on Bytes Counts bytes Counts bytes
Prefix for Literals u"..." for unicode No prefix needed for str
Prefix for Literals b"..." for bytes b"..." for bytes
Conversion my_str.decode('utf-8')
my_unicode.encode('utf-8')
my_bytes.decode('utf-8')
my_str.encode('utf-8')

Key Takeaway for Modern Python (3.x)

Use str for all your text. It's Unicode by default and handles international characters beautifully. Only use bytes when you are dealing with raw binary data, like reading a file in binary mode, handling network sockets, or working with binary file formats (e.g., PNG, PDF). When you need to move text between your program and the outside world (files, network, console), you will explicitly encode it to bytes or decode it from bytes. This explicitness is what makes Python 3 so much cleaner and less error-prone for text handling.

分享:
扫描分享到社交APP
上一篇
下一篇