杰瑞科技汇

python byte string

Of course! Let's dive deep into Python's byte strings. This is a fundamental topic for anyone working with data that isn't plain text, such as network protocols, file I/O, or cryptography.

What is a Byte String?

In Python, a byte string is a sequence of bytes. It's an immutable sequence of integers in the range 0 <= x < 256. The key takeaway is that bytes are raw, 8-bit values. They don't inherently have an encoding like text does.

Think of it like this:

  • str (String): A sequence of Unicode characters (e.g., 'H', 'e', 'l', 'l', 'o'). To store it or send it over a network, it must be encoded into bytes.
  • bytes (Byte String): A sequence of raw bytes (e.g., 72, 101, 108, 108, 111). It's the machine-readable representation of data.

Creating Byte Strings

There are several ways to create a bytes object.

a) From a Literal

You can create a byte string using a literal by prefixing a string with b. The characters you can use are limited to ASCII characters.

# Using a literal
data = b'hello world'
print(data)
# Output: b'hello world'
# Each element is an integer representing a byte
print(data[0])
# Output: 104  (which is 'h' in ASCII)
print(list(data))
# Output: [104, 101, 108, 108, 111]

Important: You cannot use non-ASCII characters directly in a byte literal.

# This will cause a SyntaxError
# data = b'café' 

b) From an Iterable of Integers

You can create a byte string from any iterable of integers where each integer is between 0 and 255.

# From a list of integers
byte_data = bytes([65, 66, 67, 255])
print(byte_data)
# Output: b'ABC\xff'  # \xff is the hex representation for 255
# From a range
byte_range = bytes(range(65, 68)) # 65, 66, 67
print(byte_range)
# Output: b'ABC'

c) From an Existing String (Encoding)

This is the most common way you'll create byte strings: by encoding a str object. You must specify an encoding (like UTF-8, ASCII, etc.).

text_string = "Hello, 世界!" # A Unicode string
# Encode the string into bytes using UTF-8 encoding
utf8_bytes = text_string.encode('utf-8')
print(utf8_bytes)
# Output: b'Hello, \xe4\xb8\x96\xe7\x95\x8c!'
# Encode using a different encoding, like Latin-1 (ISO-8859-1)
latin1_bytes = text_string.encode('latin-1')
print(latin1_bytes)
# Output: b'Hello, \xa4\x96\x8c!'

Converting Back to Strings (Decoding)

To process a byte string as text, you must decode it into a str object. Again, you must know the correct encoding.

# The byte string from the previous example
utf8_bytes = b'Hello, \xe4\xb8\x96\xe7\x95\x8c!'
# Decode the bytes back into a string
text_string = utf8_bytes.decode('utf-8')
print(text_string)
# Output: Hello, 世界!
# Decoding with the wrong encoding leads to errors or "mojibake"
try:
    # This will fail because the bytes are not valid Latin-1
    wrong_decode = utf8_bytes.decode('latin-1')
except UnicodeDecodeError as e:
    print(f"Error: {e}")
    # Output: Error: 'utf-8' codec can't decode byte 0xe4 in position 7: invalid continuation byte

Key Operations and Methods

bytes objects are very similar to str objects. They support many of the same operations.

Slicing and Indexing

data = b'hello world'
print(data[1:4])
# Output: b'ell'
print(data[0])
# Output: 104

Concatenation and Repetition

b1 = b'hello '
b2 = b'world'
print(b1 + b2)
# Output: b'hello world'
print(b1 * 2)
# Output: b'hello hello '

Common Methods

data = b'  Hello World  '
# Find a byte (by integer or ASCII character)
print(data.find(b'W')) # Find the byte for 'W'
# Output: 7
# Startswith and Endswith
print(data.startswith(b'Hello'))
# Output: False (because of the leading spaces)
print(data.endswith(b'World'))
# Output: False (because of the trailing spaces)
# Strip, Lower, Upper
print(data.strip())
# Output: b'Hello World'
print(data.lower())
# Output: b'  hello world  '
# Count occurrences
print(data.count(b'l'))
# Output: 3

Mutable Counterpart: bytearray

Sometimes you need a mutable sequence of bytes. For this, Python provides the bytearray type. It has almost all the methods of bytes but also allows in-place modifications.

# Create a bytearray
ba = bytearray(b'hello')
print(ba)
# Output: bytearray(b'hello')
# Change an item in place
ba[0] = 74 # J in ASCII
print(ba)
# Output: bytearray(b'jello')
# Append a new byte
ba.append(33) # ! in ASCII
print(ba)
# Output: bytearray(b'jello!')
# You can also slice and assign
ba[1:3] = b'ea'
print(ba)
# Output: bytearray(b'jeallo!')

Practical Use Cases

Understanding when to use bytes is crucial.

Network Programming

When you send data over a network (e.g., using the socket library), you are sending raw bytes.

# This is a conceptual example
import socket
host = 'example.com'
port = 80
# Create a socket
# (In a real app, you'd use 'with' for proper cleanup)
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
# The HTTP request must be sent as bytes
# Note the '\r\n' line endings are part of the HTTP protocol
http_request = (
    f"GET / HTTP/1.1\r\n"
    f"Host: {host}\r\n"
    f"Connection: close\r\n"
    f"\r\n"
).encode('utf-8')
s.connect((host, port))
s.sendall(http_request)
# The response you receive will also be in bytes
response = s.recv(4096)
print(response) # This is a bytes object
s.close()

File I/O (Binary Mode)

When reading or writing non-text files (images, videos, executables, etc.), you must open the file in binary mode ('rb' for read binary, 'wb' for write binary).

# Writing to a binary file
data_to_write = b'\x89PNG\r\n\x1a\n' # The first few bytes of a PNG file
with open('image.png', 'wb') as f:
    f.write(data_to_write)
# Reading from a binary file
with open('image.png', 'rb') as f:
    read_data = f.read()
    print(read_data)
    # Output: b'\x89PNG\r\n\x1a\n'

Cryptography and Hashing

Cryptographic libraries work with raw binary data.

import hashlib
message = "A secret message".encode('utf-8') # Must be bytes
sha256_hash = hashlib.sha256(message).hexdigest()
print(f"SHA-256 Hash: {sha256_hash}")
# Output: SHA-256 Hash: 3a6eb0790f39ac87c94f3856b2dd2c5d110e6811602261a9a923d3bb231658b

Summary Table: str vs. bytes

Feature str (String) bytes (Byte String)
Purpose Human-readable text Raw binary data (machine-readable)
Content Unicode characters Integers from 0 to 255
Encoding Has an encoding (e.g., UTF-8) No inherent encoding. It is the encoded form.
Creation 'hello', "world" b'hello', bytes([72, 101]), 'hello'.encode('utf-8')
Conversion .encode(encoding) .decode(encoding)
Mutability Immutable Immutable (but bytearray is a mutable alternative)
Use Cases Displaying text, user input Network, files, cryptography, images
分享:
扫描分享到社交APP
上一篇
下一篇