Python中str与unicode有何区别？-杰瑞科技汇

Of course! This is a fundamental topic in Python, especially for developers who have worked with both Python 2 and Python 3. Let's break it down.

（图片来源网络，侵删）

The Short Answer (TL;DR)

In Python 3: str is the native string type, and it is fundamentally Unicode. The str() function creates a Unicode string. The old unicode type from Python 2 has been completely removed.
In Python 2: str was a sequence of bytes, while unicode was a sequence of Unicode code points. str() created a byte string, and unicode() created a Unicode string.

The rest of this answer will explain the "why" and the details behind this distinction.

The Core Problem: Bytes vs. Characters

To understand str and unicode, you must first understand the difference between bytes and characters.

Bytes: A byte is a unit of digital information that consists of 8 bits. It's just a number from 0 to 255. It has no inherent meaning on its own. It could represent part of an image, a piece of an executable program, or a character.
Characters: A character is an abstract symbol, like the letter 'A', the euro sign '€', or a Chinese character '字'. How a computer stores this character is the key.

Historically, different systems used different mappings of bytes to characters. This led to chaos. For example, the byte 0xE9 could mean 'é' on one system and something completely different on another.

The Solution: Unicode

Unicode is a universal character set. It aims to assign a unique number (called a code point) to every character in every language in the world. This code point is typically written as U+xxxx (e.g., U+00E9 for 'é', U+4E2D for '中').

（图片来源网络，侵删）

Unicode itself is just a giant lookup table. It doesn't specify how to store these code points in memory. That's the job of an encoding.

An encoding is a set of rules for converting between Unicode code points and a sequence of bytes.

UTF-8: The most common encoding. It's a variable-width encoding. It uses 1 byte for ASCII characters (which is why it's so space-efficient for English text) and up to 4 bytes for other characters.
UTF-16: Uses 2 or 4 bytes per character. Common in Windows and Java environments.
ASCII: A 7-bit encoding that can only represent 128 English characters. It's a strict subset of UTF-8.

Python 2: The Two-String World

This is where str and unicode were both necessary.

`str` in Python 2

A str object was a sequence of bytes. It had no idea what those bytes meant. It was just a bag of data.

（图片来源网络，侵删）

# Python 2
s = "hello"
print type(s)      # <type 'str'>
print len(s)       # 5 (5 bytes)
print repr(s)      # 'hello' (looks like characters, but it's bytes)
# If you try to put non-ASCII characters in a str, Python 2 gets confused
# unless you declare the source file encoding.
# s = "héllo" # This would raise a SyntaxError in a default .py file

`unicode` in Python 2

A unicode object was a sequence of Unicode code points. It understood characters.

# Python 2
u = u"héllo"
print type(u)      # <type 'unicode'>
print len(u)       # 5 (5 characters)
print repr(u)      # u'h\xe9llo' (The \xe9 is the *byte* representation, but the object itself holds the character 'é')
# You can create a unicode object from a str by decoding it
s = "h\xc3\xa9llo" # This is the UTF-8 byte representation of "héllo"
u = s.decode('utf-8')
print u            # héllo

The Bridge: `encode()` and `decode()`

To move between the two worlds, you had to explicitly convert them:

unicode_string.encode(encoding): Turns a unicode string into a str (bytes) using a specific encoding.
str_string.decode(encoding): Turns a str (bytes) into a unicode string, assuming it was encoded with a specific encoding.

Example in Python 2:

# A string of bytes, encoded in UTF-8
utf8_bytes = "Caf\xc3\xa9" # Represents "Café"
# To work with it as characters, you must DECODE it
unicode_string = utf8_bytes.decode('utf-8')
print unicode_string      # u'Caf\xe9' (internally, it knows this is the character 'é')
print type(unicode_string) # <type 'unicode'>
# To save it to a file or send it over a network, you must ENCODE it
back_to_bytes = unicode_string.encode('utf-8')
print back_to_bytes       # 'Caf\xc3\xa9'
print type(back_to_bytes)  # <type 'str'>

Python 3: The One-String World (The Modern Way)

Python 3 was designed to solve the ambiguities of Python 2. The philosophy is: "Text is text, and bytes are bytes. Never the twain shall meet."

`str` in Python 3

This is the new, improved string. It is natively a sequence of Unicode characters. It is what unicode was in Python 2.

# Python 3
s = "hello"
print(type(s))     # <type 'str'>
print(len(s))      # 5 (5 characters)
# Non-ASCII characters work perfectly fine in a source file (if saved as UTF-8)
s = "héllo"
print(type(s))     # <type 'str'>
print(len(s))      # 5 (5 characters)
print(repr(s))     # 'héllo'

`bytes` in Python 3

The old str type from Python 2 was renamed to bytes in Python 3. It is, just like before, a sequence of bytes. It's used for raw data like network packets, file contents, or images.

# Python 3
# To create a bytes object, use a 'b' prefix
b = b"hello"
print(type(b))     # <type 'bytes'>
print(len(b))      # 5 (5 bytes)
print(repr(b))     # "b'hello'"
# Non-ASCII bytes in a bytes object are shown as their integer values
b = b"h\xc3\xa9llo" # The UTF-8 bytes for "héllo"
print(repr(b))     # "b'h\\xc3\\xa9llo'"

The Bridge in Python 3: `encode()` and `decode()`

The concept is the same, but the types are cleaner.

str.encode(encoding): Turns a str (text) into bytes.
bytes.decode(encoding): Turns bytes into a str (text).

Example in Python 3:

# A bytes object, encoded in UTF-8
utf8_bytes = b"Caf\xc3\xa9" # Represents "Café" as bytes
# To work with it as text, you must DECODE it
text_string = utf8_bytes.decode('utf-8')
print(text_string)      # 'Café'
print(type(text_string)) # <class 'str'>
# To save it to a file or send it over a network, you must ENCODE it
back_to_bytes = text_string.encode('utf-8')
print(back_to_bytes)    # b'Caf\xc3\xa9'
print(type(back_to_bytes)) # <class 'bytes'>

Summary Table

Feature	Python 2	Python 3
Native Text String	`unicode`	`str`
Native Byte String	`str`	`bytes`
What `str()` does	Creates a `str` (byte string).	Creates a `str` (Unicode string).
What `unicode()` does	Creates a `unicode` (text) string.	Removed. Use `str()` instead.
Core Idea	You have two types, `str` and `unicode`, and you must constantly convert between them using `.encode()` and `.decode()`.	You have two distinct types: `str` for text and `bytes` for binary data. Python 3 code is much cleaner because it doesn't accidentally mix them.
Best Practice	Always work with `unicode` objects in your application logic and only encode to `str` at the very last moment (e.g., when writing to a file or sending a network request).	Always work with `

Python中str与unicode有何区别？

The Short Answer (TL;DR)

The Core Problem: Bytes vs. Characters

The Solution: Unicode

Python 2: The Two-String World

`str` in Python 2

`unicode` in Python 2

The Bridge: `encode()` and `decode()`

Python 3: The One-String World (The Modern Way)

`str` in Python 3

`bytes` in Python 3

The Bridge in Python 3: `encode()` and `decode()`

Summary Table

99ANYc3cd6

华为Mate10新手怎么用？这几个隐藏技巧你都会吗？

Python Idle如何高效使用numpy？

sae docker Python如何高效部署与运行？

Python如何进行URL编码？

Java编程零基础怎么学？入门教程看这篇就够了吗？

Java Socket编程实例具体怎么实现？

Android开发需掌握哪些Java基础？