杰瑞科技汇

Python并发编程,如何选择合适模型?

Of course! "Python concurrent" refers to the set of tools and techniques in Python for writing programs that can perform multiple operations simultaneously. This is crucial for improving performance, especially for I/O-bound tasks (like network requests or file operations) and can also help with CPU-bound tasks (like complex calculations).

Python并发编程,如何选择合适模型?-图1
(图片来源网络,侵删)

Let's break it down into the key concepts and modules.

The Core Concept: The Global Interpreter Lock (GIL)

Before diving into the tools, you must understand the GIL.

  • What it is: The GIL is a mutex (a lock) that protects access to Python objects, preventing multiple native threads from executing Python bytecode at the same time within a single process.
  • The Implication: For CPU-bound tasks (tasks that are heavy on computation), Python's threading module might not provide a performance boost because only one thread can execute Python code at a time. The GIL acts as a bottleneck.
  • The Exception: The GIL is released during I/O operations (like waiting for a network response or reading a file). This makes threading very effective for I/O-bound tasks.

Because of the GIL, Python uses different tools for different types of concurrency problems:

  1. For I/O-Bound Tasks: Use Threading.
  2. For CPU-Bound Tasks: Use Multiprocessing.
  3. For Simpler, High-Level Concurrency: Use asyncio (with async/await syntax).

Threading (for I/O-Bound Tasks)

Threading is used when your program spends most of its time waiting. For example, a web scraper that needs to make many network requests. While one thread is waiting for a response, another thread can make a new request.

Python并发编程,如何选择合适模型?-图2
(图片来源网络,侵删)

Key Idea: Run multiple threads within a single process. They share memory, which is great for data sharing but requires careful synchronization (using Lock, Queue, etc.).

Example: Web Scraping with concurrent.futures

The concurrent.futures module provides a high-level interface for asynchronously executing callables. ThreadPoolExecutor is the perfect tool for I/O-bound tasks.

import requests
import concurrent.futures
import time
def fetch_url(url):
    """Fetches the content of a URL and returns the URL and status code."""
    try:
        response = requests.get(url, timeout=5)
        return url, response.status_code
    except requests.RequestException as e:
        return url, str(e)
urls = [
    "https://www.python.org",
    "https://www.google.com",
    "https://www.github.com",
    "https://www.nonexistent-website-12345.com",
    "https://www.stackoverflow.com"
]
# Using a ThreadPoolExecutor to fetch URLs concurrently
start_time = time.time()
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
    # map() returns results in the same order as the inputs
    results = list(executor.map(fetch_url, urls))
end_time = time.time()
print("--- Results ---")
for url, status in results:
    print(f"{url}: {status}")
print(f"\nTotal time taken: {end_time - start_time:.2f} seconds")

Why is this faster? If you ran these requests sequentially, you'd have to wait for each one to complete before starting the next. With threading, while one request is "in flight" (waiting for the server), the other threads are working on other requests.


Multiprocessing (for CPU-Bound Tasks)

Multiprocessing gets around the GIL by creating separate processes, each with its own Python interpreter and memory space. This allows for true parallel execution on multi-core CPUs.

Python并发编程,如何选择合适模型?-图3
(图片来源网络,侵删)

Key Idea: Run multiple processes. Each process has its own memory, so data sharing is more complex (requires Queue, Pipe, or Manager). This is the go-to for heavy calculations.

Example: Parallel Image Processing

Let's say we have a list of image files and we want to apply a filter to each one. This is a CPU-bound task.

import os
import time
from concurrent.futures import ProcessPoolExecutor
from PIL import Image # Requires Pillow: pip install Pillow
# A dummy CPU-intensive function
def apply_grayscale(image_path):
    """Applies a grayscale filter to an image."""
    try:
        with Image.open(image_path) as img:
            img_gray = img.convert("L")
            output_path = f"gray_{os.path.basename(image_path)}"
            img_gray.save(output_path)
            return f"Processed {image_path} -> {output_path}"
    except Exception as e:
        return f"Error processing {image_path}: {e}"
# Create some dummy image files for the example
if not os.path.exists("images"):
    os.makedirs("images")
for i in range(5):
    Image.new('RGB', (100, 100), color = 'red').save(f"images/image_{i}.png")
image_files = [f"images/image_{i}.png" for i in range(5)]
start_time = time.time()
# Using a ProcessPoolExecutor
with ProcessPoolExecutor(max_workers=4) as executor:
    results = list(executor.map(apply_grayscale, image_files))
end_time = time.time()
print("--- Results ---")
for result in results:
    print(result)
print(f"\nTotal time taken: {end_time - start_time:.2f} seconds")
# Clean up dummy files
for f in image_files:
    if os.path.exists(f"gray_{os.path.basename(f)}"):
        os.remove(f"gray_{os.path.basename(f)}")

Why is this faster? The apply_grayscale function is CPU-intensive. By using separate processes, the work can be distributed across multiple CPU cores, and each core can work on a different image simultaneously. Threading would be ineffective here due to the GIL.


Asyncio (for I/O-Bound Tasks with High Concurrency)

Asyncio is a different paradigm. Instead of using threads or processes, it uses a single thread and an event loop to manage many "tasks." When a task performs an I/O operation (like await an_http_request()), it yields control back to the event loop, allowing other tasks to run.

Key Idea: Cooperative multitasking. Tasks must explicitly yield control using await. This is extremely efficient for handling thousands of concurrent I/O connections (e.g., a web server, chat app).

Example: Fetching URLs with asyncio and aiohttp

This is the modern, high-performance way to do I/O concurrency in Python.

import asyncio
import aiohttp
import time
async def fetch_url_async(session, url):
    """Asynchronously fetches a URL."""
    try:
        async with session.get(url, timeout=5) as response:
            return url, response.status
    except Exception as e:
        return url, str(e)
async def main():
    urls = [
        "https://www.python.org",
        "https://www.google.com",
        "https://www.github.com",
        "https://www.nonexistent-website-12345.com",
        "https://www.stackoverflow.com"
    ]
    start_time = time.time()
    # aiohttp requires an ClientSession
    async with aiohttp.ClientSession() as session:
        # Create a list of tasks to run concurrently
        tasks = [fetch_url_async(session, url) for url in urls]
        # asyncio.gather runs all tasks concurrently and waits for them all to finish
        results = await asyncio.gather(*tasks)
    end_time = time.time()
    print("--- Results ---")
    for url, status in results:
        print(f"{url}: {status}")
    print(f"\nTotal time taken: {end_time - start_time:.2f} seconds")
# Run the main async function
asyncio.run(main())

Summary: When to Use What

Method Best For How it Works Memory Overhead Data Sharing
Threading I/O-Bound (network, disk) Multiple threads in one process. GIL is released on I/O. Low (shares memory) Easy (but needs locks)
Multiprocessing CPU-Bound (math, video processing) Multiple processes, each with its own interpreter and memory. High (each process has its own memory) Harder (requires queues/manager)
Asyncio High-Concurrency I/O (web servers, 1000s of connections) Single thread, event loop, cooperative tasks (await). Very Low Shared within the same thread

A Note on asyncio vs. Threading

For simple I/O tasks like the example above, both ThreadPoolExecutor and asyncio can be very fast. The choice often comes down to:

  • Simplicity: ThreadPoolExecutor can be simpler if you're not already using async/await.
  • Scale: asyncio can handle a much larger number of concurrent connections (e.g., 10,000) because it doesn't have the overhead of creating and managing thousands of OS threads.
  • Ecosystem: Modern web frameworks like FastAPI and Django Channels are built on asyncio. If you're using them, you should use async functions for your I/O.
分享:
扫描分享到社交APP
上一篇
下一篇