杰瑞科技汇

python sharedarray

Of course! Let's dive into the concept of "shared arrays" in Python.

python sharedarray-图1
(图片来源网络,侵删)

What is a Shared Array?

A shared array is a data structure that allows multiple independent processes to read from and write to the same block of memory. This is different from a standard Python list or NumPy array, where each process gets its own copy of the data. Copying large arrays can be very slow and memory-intensive.

The primary use case for shared arrays is inter-process communication (IPC), especially when using Python's multiprocessing module to perform parallel or concurrent computations.


The Challenge with multiprocessing

When you use multiprocessing.Process, each process runs in its own memory space. If you pass a large NumPy array to a child process, the entire array is copied (pickled) and sent over to the child process. This is known as "pass-by-value".

Example of the problem:

python sharedarray-图2
(图片来源网络,侵删)
import numpy as np
import multiprocessing as mp
def worker(arr):
    # This function modifies a COPY of the array, not the original
    arr[0] = 999
    print(f"Inside process, arr[0] is: {arr[0]}")
if __name__ == "__main__":
    # Create a large array
    large_array = np.arange(1000000)
    print(f"Initial value in main: {large_array[0]}")
    p = mp.Process(target=worker, args=(large_array,))
    p.start()
    p.join()
    print(f"Value in main after process: {large_array[0]}")

Output:

Initial value in main: 0
Inside process, arr[0] is: 999
Value in main after process: 0  # <-- The original array was NOT changed!

As you can see, the change made by the child process (999) is lost because the child worked on a copy.


Solution 1: multiprocessing.Array (The Built-in)

Python's multiprocessing module provides its own Array type. It's a low-level, C-style array that can be shared.

Key Features:

  • Low-Level: It's a C-style array, so you must specify the data type ('i' for integer, 'd' for double, 'f' for float, etc.).
  • Locking: It has an optional lock parameter to prevent race conditions when multiple processes try to write simultaneously.
  • Slicing: You can get a NumPy view of the array, which is very efficient.

How to Use It:

import numpy as np
import multiprocessing as mp
def worker(shared_array):
    # Get a NumPy view of the shared array
    arr_np = np.ctypeslib.as_array(shared_array.get_obj())
    # Now we can modify the original memory
    arr_np[0] = 999
    print(f"Inside process, arr[0] is: {arr_np[0]}")
if __name__ == "__main__":
    # 'i' for C long (typically an int)
    # 'd' for double (typically a float)
    shared_array = mp.Array('i', [0] * 5) 
    print(f"Initial values in main: {list(shared_array)}")
    p = mp.Process(target=worker, args=(shared_array,))
    p.start()
    p.join()
    print(f"Values in main after process: {list(shared_array)}")

Output:

python sharedarray-图3
(图片来源网络,侵删)
Initial values in main: [0, 0, 0, 0, 0]
Inside process, arr[0] is: 999
Values in main after process: [999, 0, 0, 0, 0] # <-- SUCCESS! The original was changed.

Pros and Cons of multiprocessing.Array

  • Pros:
    • Built into the standard library (no extra installation needed).
    • Very efficient for basic data types.
    • Can be used with any language that understands C-style arrays.
  • Cons:
    • Low-level and less flexible than NumPy.
    • Requires you to manage data types manually.
    • Creating a NumPy view (np.ctypeslib.as_array) adds a small overhead, but it's much better than a full copy.

Solution 2: multiprocessing.shared_memory (The Modern Approach)

Python 3.8 introduced the multiprocessing.shared_memory module. This is a more powerful and flexible way to share memory. It's designed to work seamlessly with NumPy arrays.

Key Features:

  • NumPy Native: It's the preferred way to share NumPy arrays.
  • Flexible: You can create a shared memory block (SharedMemory) and then attach NumPy arrays to it. You can even have multiple arrays point to the same block.
  • No Locking: It does not provide built-in locking. You are responsible for synchronization (e.g., using multiprocessing.Lock if needed).
  • Cleanup: It's crucial to explicitly close() and unlink() shared memory blocks to avoid memory leaks.

How to Use It:

import numpy as np
import multiprocessing as mp
def worker(shm_name, shape, dtype):
    # Attach to the existing shared memory block using its name
    shm = mp.shared_memory.SharedMemory(name=shm_name)
    # Create a NumPy array from the shared memory block
    shared_arr = np.ndarray(shape, dtype=dtype, buffer=shm.buf)
    # Modify the original memory
    shared_arr[0] = 999
    print(f"Inside process, arr[0] is: {shared_arr[0]}")
    # IMPORTANT: Close the shared memory connection in the child process
    shm.close()
if __name__ == "__main__":
    # Create a NumPy array in the main process
    arr = np.arange(10)
    print(f"Initial values in main: {arr}")
    # Create a shared memory block from the NumPy array
    # This copies the initial data into the shared memory
    shm = mp.shared_memory.SharedMemory(create=True, size=arr.nbytes)
    # Create a NumPy view of the shared memory
    shared_arr = np.ndarray(arr.shape, dtype=arr.dtype, buffer=shm.buf)
    # Copy the data from the original array to the shared memory view
    np.copyto(shared_arr, arr)
    # Create a process, passing the name of the shared memory block
    p = mp.Process(target=worker, args=(shm.name, arr.shape, arr.dtype))
    p.start()
    p.join()
    # The main process can now see the changes
    print(f"Values in main after process: {shared_arr}")
    # CRITICAL: Clean up the shared memory block
    shm.close()
    shm.unlink() # Release the shared memory back to the OS

Output:

Initial values in main: [0 1 2 3 4 5 6 7 8 9]
Inside process, arr[0] is: 999
Values in main after process: [999   1   2   3   4   5   6   7   8   9] # <-- SUCCESS!

Pros and Cons of multiprocessing.shared_memory

  • Pros:
    • The modern, recommended way for sharing NumPy arrays.
    • Extremely efficient and flexible.
    • Avoids the C-style array limitations.
  • Cons:
    • Requires manual cleanup (close() and unlink()), which can be error-prone.
    • No built-in locking, requiring you to implement your own synchronization.

Solution 3: multiprocessing.Manager (The Easiest but Slowest)

The multiprocessing.Manager provides a way to create a server process that manages shared objects (like lists, dictionaries, and even custom classes). Other processes can use proxies to access these objects.

How to Use It:

import multiprocessing as mp
def worker(shared_list):
    # Modify the shared list through the proxy
    shared_list[0] = 999
    print(f"Inside process, list[0] is: {shared_list[0]}")
if __name__ == "__main__":
    with mp.Manager() as manager:
        # Create a shared list using the manager
        shared_list = manager.list([0] * 5)
        print(f"Initial values in main: {list(shared_list)}")
        p = mp.Process(target=worker, args=(shared_list,))
        p.start()
        p.join()
        print(f"Values in main after process: {list(shared_list)}")

Output:

Initial values in main: [0, 0, 0, 0, 0]
Inside process, list[0] is: 999
Values in main after process: [999, 0, 0, 0, 0] # <-- SUCCESS!

Pros and Cons of multiprocessing.Manager

  • Pros:
    • Very easy to use with familiar Python objects (lists, dicts).
    • Built-in synchronization (the manager handles locking).
  • Cons:
    • Very slow. Every access to the shared object (e.g., shared_list[0]) involves a network-like call to the manager process. This makes it unsuitable for performance-critical, high-frequency data access.
    • Can become a bottleneck if many processes are constantly accessing the shared object.

Summary and Recommendation

Method Best For Performance Ease of Use Key Considerations
multiprocessing.Array Sharing basic C-style arrays, simple data types. High Medium Low-level, requires type management.
multiprocessing.shared_memory Sharing NumPy arrays. (Recommended for modern Python) Very High Medium Manual cleanup (unlink) is essential. No built-in locks.
multiprocessing.Manager Sharing complex Python objects (lists, dicts) with minimal code. Low Very High Can be a major performance bottleneck.

Final Recommendation:

  • If you are using Python 3.8+ and need to share NumPy arrays, multiprocessing.shared_memory is the best and most efficient choice.
  • If you need to share simple, fixed-size arrays and want to avoid dependencies, multiprocessing.Array is a solid, built-in option.
  • If you need to share complex Python objects (like a list of dictionaries) and performance is not the primary concern, multiprocessing.Manager is the easiest to implement.
分享:
扫描分享到社交APP
上一篇
下一篇