Python中str与unicode有何区别？-杰瑞科技汇

Of course! This is a fundamental and crucial topic in Python, especially for anyone dealing with text. Let's break it down clearly.

（图片来源网络，侵删）

The Short Answer (TL;DR)

In Python 2, str was a sequence of bytes, and unicode was a sequence of abstract characters. You had to convert between them explicitly.
In Python 3, this was simplified. There is only one text type: str. This str type is now a sequence of Unicode characters, making it much easier to handle international text.
The byte-based type from Python 2 was renamed to bytes in Python 3.

The Detailed Explanation

To understand the difference, you have to look at the two major versions of Python.

Part 1: Python 2 (The "Old Way" with `str` vs. `unicode`)

In Python 2, there were two distinct types for representing text:

`str` (The Byte String)

What it is: A sequence of bytes. It has no inherent knowledge of which characters those bytes represent. It's just raw data.
Encoding: The encoding is implicit and assumed, often based on your system's default locale (like ASCII or UTF-8). This is a common source of bugs.
Analogy: Think of str as a box of electrical wires. The wires themselves don't mean anything; you need a schematic (an encoding) to know which wire corresponds to which signal (character).

# In Python 2
# This is a byte string. On a US system, it's likely ASCII.
my_str = "hello"
# len() counts bytes
print len(my_str)  # Output: 5
# What if you try to put a non-ASCII character?
# This will raise a UnicodeDecodeError on a system with default ASCII encoding.
# my_str = "héllo"

`unicode` (The Text String)

What it is: A sequence of abstract characters. It's a proper, universal representation of text that can hold any character from any language (e.g., 'A', 'é', '世', '😂').
Encoding: It is decoupled from any specific byte encoding. It's the "ideal" form of text in memory.
Analogy: Think of unicode as the text you see in a word processor. It's the abstract idea of the characters, independent of how they are stored on disk or sent over a network.

# In Python 2
# This is a unicode string. It can hold any character.
my_unicode = u"héllo"
# len() counts characters, not bytes
print len(my_unicode)  # Output: 5
# It can also hold emojis
my_unicode_emoji = u"hello 😂"
print len(my_unicode_emoji) # Output: 7 (5 letters + 1 space + 1 emoji)

The Bridge: Encoding and Decoding

To use text from the real world (like reading from a file or a network), you must convert between str and unicode.

Decoding: Converting a str (bytes) to unicode. You are decoding the bytes into characters.
- my_unicode = my_str.decode('utf-8')
Encoding: Converting a unicode to str (bytes). You are encoding the characters into bytes.
- my_str = my_unicode.encode('utf-8')

Python 2 Example:

# We have some text with an accent, stored as UTF-8 bytes
# The 'u' prefix makes it a unicode string
text_unicode = u"café"
# Let's encode it to UTF-8 bytes
text_utf8_bytes = text_unicode.encode('utf-8') # This is a 'str' in Python 2
print text_utf8_bytes  # Output: 'caf\xc3\xa9'
print len(text_utf8_bytes) # Output: 5 (4 letters + 2 bytes for 'é')
# Now, let's decode it back to unicode
text_unicode_again = text_utf8_bytes.decode('utf-8')
print text_unicode_again # Output: u'caf\xc3\xa9' (it looks the same when printed)
print len(text_unicode_again) # Output: 4

The Problem: If you forgot to encode/decode at the right time, you'd get a UnicodeDecodeError or a UnicodeEncodeError, which was a huge source of frustration for Python 2 developers.

Part 2: Python 3 (The "New Way" with `str` and `bytes`)

Python 3 fixed this confusion by making a clear distinction.

`str` (The Text String)

What it is: This is now the one and only text type. It is a sequence of Unicode characters, just like unicode was in Python 2.
Encoding: It is inherently Unicode. No more guessing. len() counts characters.
Analogy: This is the modern, universal text type. It's what you use for all your string manipulation.

# In Python 3
# This is a 'str' object, and it's Unicode by default.
my_str = "hello"
print(type(my_str)) # Output: <class 'str'>
# len() counts characters
print(len(my_str)) # Output: 5
# It handles non-ASCII characters seamlessly
my_str_with_accents = "héllo"
print(len(my_str_with_accents)) # Output: 5
# It also handles emojis
my_str_with_emoji = "hello 😂"
print(len(my_str_with_emoji)) # Output: 7

`bytes` (The Byte String)

What it is: This is the new name for the old str type. It is a sequence of bytes. It is for raw binary data, not text.
Encoding: It has no encoding of its own. You must explicitly tell Python how to encode a str into bytes or decode bytes into a str.
Analogy: This is for things like images, network packets, or file formats where the data isn't meant to be interpreted as text.

# In Python 3
# This is a 'bytes' object.
my_bytes = b"hello" # The 'b' prefix is crucial
print(type(my_bytes)) # Output: <class 'bytes'>
# len() counts bytes
print(len(my_bytes)) # Output: 5
# You cannot mix str and bytes directly
# This will raise a TypeError
# my_str + my_bytes

The Bridge: Encoding and Decoding (Simplified)

The concept is the same, but the types are now more clearly named.

Encoding: Converting a str to bytes.
- my_bytes = my_str.encode('utf-8')
Decoding: Converting bytes to a str.
- my_str = my_bytes.decode('utf-8')

Python 3 Example:

# We have a unicode string (this is just a 'str' in Python 3)
text_str = "café"
# Let's encode it to UTF-8 bytes
text_bytes = text_str.encode('utf-8')
print(text_bytes) # Output: b'caf\xc3\xa9'
print(len(text_bytes)) # Output: 5
# Now, let's decode it back to a string
text_str_again = text_bytes.decode('utf-8')
print(text_str_again) # Output: 'café'
print(len(text_str_again)) # Output: 4

Summary Table

Feature	Python 2	Python 3
Text Type	`unicode`	`str`
Text Type Behavior	Sequence of abstract characters	Sequence of abstract characters (Unicode)
Byte Type	`str`	`bytes`
Byte Type Behavior	Sequence of bytes (encoding is implicit)	Sequence of bytes (no encoding)
`len()` on Text	Counts characters	Counts characters
`len()` on Bytes	Counts bytes	Counts bytes
Prefix for Literals	`u"..."` for unicode	No prefix needed for `str`
Prefix for Literals	`b"..."` for bytes	`b"..."` for `bytes`
Conversion	`my_str.decode('utf-8')` `my_unicode.encode('utf-8')`	`my_bytes.decode('utf-8')` `my_str.encode('utf-8')`

Key Takeaway for Modern Python (3.x)

Use str for all your text. It's Unicode by default and handles international characters beautifully. Only use bytes when you are dealing with raw binary data, like reading a file in binary mode, handling network sockets, or working with binary file formats (e.g., PNG, PDF). When you need to move text between your program and the outside world (files, network, console), you will explicitly encode it to bytes or decode it from bytes. This explicitness is what makes Python 3 so much cleaner and less error-prone for text handling.

Python中str与unicode有何区别？

The Short Answer (TL;DR)

The Detailed Explanation

Part 1: Python 2 (The "Old Way" with `str` vs. `unicode`)

`str` (The Byte String)

`unicode` (The Text String)

The Bridge: Encoding and Decoding

Part 2: Python 3 (The "New Way" with `str` and `bytes`)

`str` (The Text String)

`bytes` (The Byte String)

The Bridge: Encoding and Decoding (Simplified)

Summary Table

Key Takeaway for Modern Python (3.x)

99ANYc3cd6

CAD 2007基础教程该怎么学？

Java request 如何获取请求参数？

Python tgschedule如何实现定时任务？

Proe5.0扫描怎么用？新手入门教程？

Java中pattern指什么？如何使用？

Java Socket编程实例具体怎么实现？

Graphics2D如何绘制复杂图形？

Python字典合并有哪些常用方法？

Java字节数组如何转字符串？

Android 如何运行 Java 代码？

Java int number如何正确声明与使用？

如何在Chrome中使用Java编程？

opencv c Python

Python中如何设置annotate字体？

Python readline如何实现交互式命令行输入？

Python IP包如何解析与应用？

Python中str与unicode有何区别？

The Short Answer (TL;DR)

The Detailed Explanation

Part 1: Python 2 (The "Old Way" with str vs. unicode)

str (The Byte String)

unicode (The Text String)

The Bridge: Encoding and Decoding

Part 2: Python 3 (The "New Way" with str and bytes)

str (The Text String)

bytes (The Byte String)

The Bridge: Encoding and Decoding (Simplified)

Summary Table

Key Takeaway for Modern Python (3.x)

相关推荐

Java Socket编程实例具体怎么实现？

Part 1: Python 2 (The "Old Way" with `str` vs. `unicode`)

`str` (The Byte String)

`unicode` (The Text String)

Part 2: Python 3 (The "New Way" with `str` and `bytes`)

`str` (The Text String)

`bytes` (The Byte String)