杰瑞科技汇

python str unicode

Of course! This is a fundamental and often confusing topic in Python, especially for developers who have worked with both Python 2 and Python 3. Let's break it down clearly.

The Core Idea: Bytes vs. Characters

At the heart of the str vs. unicode issue is the distinction between bytes and characters.

  • Bytes: A sequence of 8-bit values (0-255). This is how data is actually stored on your disk or transmitted over a network. It's just raw, numerical data.
  • Characters: An abstract concept, like the letter 'A', the Chinese character '中', or the euro symbol '€'. A character isn't a number; it's an idea.

The job of an encoding (like UTF-8, ASCII, Latin-1) is to translate between these two:

  • Encoding: Translating characters into bytes.
  • Decoding: Translating bytes into characters.

The Difference: Python 2 vs. Python 3

This is the most critical point. The meaning of str and unicode changed dramatically between these two versions.


Python 2 (The "Old" Way)

In Python 2, there were two distinct string types:

str (The "Byte String")

  • What it is: A sequence of bytes.
  • Default Encoding: By default, Python 2 assumed your str was encoded in ASCII.
  • Problem: You could create a str containing non-ASCII characters (like ), but Python would have no idea what encoding it was in. This led to cryptic UnicodeDecodeError and UnicodeEncodeError exceptions.
  • Example:
    # This is a byte string. Python 2 doesn't know its encoding.
    my_str = "Hello, world! 你好" 
    # On my system, this is actually a UTF-8 encoded byte string.
    # But Python 2 just sees it as a sequence of bytes.

unicode (The "Unicode String")

  • What it is: A sequence of abstract characters. It's an internal representation that is not tied to any specific encoding.

  • Purpose: To correctly handle text from all languages without ambiguity.

  • How to create: You create a unicode string by decoding a str (byte string) using a specific encoding.

  • Example:

    # my_str is a byte string (let's assume it's UTF-8 encoded)
    my_str = "Hello, world! 你好"
    # To get a proper unicode string, you must DECODE it
    my_unicode = my_str.decode('utf-8')
    print type(my_str)      # <type 'str'>
    print type(my_unicode)  # <type 'unicode'>
    # Now you can do things that require knowing the character, not the bytes
    print len(my_unicode)   # 14 (it counts characters: 'H','e','l','l','o',...,'你','好')

The Golden Rule in Python 2: "Unicode sandwich".

  • The "bread" is your external interface (reading from a file, getting from a network request). This should be bytes (str).
  • The "filling" is all your internal processing. This should be unicode.
  • You decode bytes to unicode when you read them in, and encode unicode back to bytes when you write them out.
# Python 2 Golden Rule Example
# 1. Read bytes from a file (the top slice of bread)
    with open('my_file.txt', 'r') as f:
        # f.read() returns a byte string ('str')
        data_from_file = f.read()
# 2. Decode to unicode for processing (the filling)
    text_data = data_from_file.decode('utf-8')
    # ... do all your text manipulation here with text_data (unicode) ...
# 3. Encode back to bytes to write or send (the bottom slice of bread)
    data_to_write = text_data.encode('utf-8')
    with open('another_file.txt', 'w') as f:
        f.write(data_to_write)

Python 3 (The "New" Way)

Python 3 was designed to fix this confusion by making the str vs. bytes distinction explicit and defaulting to the robust UTF-8 encoding.

str (The "Text String")

  • What it is: A sequence of abstract characters. This is what Python 2 called unicode.

  • Default Encoding: The default encoding for your source code files is UTF-8. You can now write non-ASCII characters directly in your strings.

  • Purpose: This is the type you should use for all your text processing.

  • Example:

    # This is a text string. It stores characters, not bytes.
    # Python 3 knows this is a string of characters.
    my_str = "Hello, world! 你好"
    print(type(my_str))     # <class 'str'>
    print(len(my_str))      # 14 (counts characters)
    print(my_str[0])        # 'H'

bytes (The "Byte String")

  • What it is: A sequence of bytes. This is what Python 2 called str.

  • Purpose: Used for raw binary data (like images, network packets, or when you need to interface with a legacy system that only works with bytes).

  • How to create: You create a bytes object by encoding a str (text string).

  • Example:

    # my_str is a text string ('str')
    my_str = "Hello, world! 你好"
    # To get a byte string, you must ENCODE it
    my_bytes = my_str.encode('utf-8')
    print(type(my_bytes))   # <class 'bytes'>
    print(my_bytes)         # b'Hello, world! \xe4\xbd\xa0\xe5\xa5\xbd'
    # The \xe4... are the UTF-8 byte representations for '你' and '好'
    # You can also create a bytes literal with a 'b' prefix
    my_bytes_literal = b"Hello, world!"
    print(type(my_bytes_literal)) # <class 'bytes'>

The Golden Rule in Python 3: It's much simpler.

  • Use str for all text.
  • Use bytes for all binary data.
  • Encode str -> bytes when you need to send or store text.
  • Decode bytes -> str when you receive or read text.
# Python 3 Golden Rule Example
# 1. Read bytes from a file
    with open('my_file.txt', 'rb') as f: # Note the 'rb' (read bytes)
        data_from_file = f.read() # data_from_file is 'bytes'
# 2. Decode to text (str) for processing
    text_data = data_from_file.decode('utf-8') # text_data is 'str'
    # ... do all your text manipulation here with text_data ...
# 3. Encode back to bytes to write
    data_to_write = text_data.encode('utf-8') # data_to_write is 'bytes'
    with open('another_file.txt', 'wb') as f: # Note the 'wb' (write bytes)
        f.write(data_to_write)

Summary Table

Feature Python 2 str Python 2 unicode Python 3 str Python 3 bytes
Type Byte String Text String Text String Byte String
Content Raw bytes (0-255) Abstract characters Abstract characters Raw bytes (0-255)
Encoding Assumed ASCII by default N/A (it's the decoded form) Default is UTF-8 for source code N/A (it's the encoded form)
Literal "hello" u"hello" "hello" b"hello"
How to get From files, network my_str.decode('utf-8') From files, network my_str.encode('utf-8')
len() Counts bytes Counts characters Counts characters Counts bytes

Key Takeaway

  • In Python 3, you should almost always work with str for text. Only use bytes when you are dealing with binary data or need to interface with a low-level system that requires it.
  • The unicode type does not exist in Python 3. It has been replaced by the improved str type.
  • The most common source of errors in Python 2 was mixing str and unicode without proper encoding/decoding. Python 3's design makes this much harder to do by accident.
分享:
扫描分享到社交APP
上一篇
下一篇