杰瑞科技汇

Python gevent如何高效处理PDF?

Of course! Let's dive into using Python's gevent library to work with PDFs. This is a powerful combination, especially for I/O-bound tasks.

Python gevent如何高效处理PDF?-图1
(图片来源网络,侵删)

The Core Idea: When to Use gevent with PDFs

gevent is a concurrency library that uses "greenlets" to handle thousands of lightweight tasks. It excels at applications that are I/O-bound, meaning they spend most of their time waiting for external operations to complete (like network requests, database calls, or reading/writing files).

PDF processing can fall into two categories:

  1. I/O-Bound (A perfect fit for gevent):

    • Splitting a large PDF: Reading a big PDF file and splitting it into multiple smaller PDF files. The main work is reading/writing chunks of data from the disk.
    • Merging many PDFs: Taking a list of many small PDF files and merging them into one. The overhead is opening, reading, and closing each file.
    • Watermarking multiple PDFs: Applying a watermark to a large batch of PDF files. Each watermarking operation is independent and involves file I/O.
    • Converting PDFs to images/text: Running a command-line tool (like pdftoppm or pdftotext) on many PDFs in parallel.
  2. CPU-Bound (A poor fit for gevent):

    • OCR (Optical Character Recognition): Extracting text from a scanned image-based PDF is extremely CPU-intensive. gevent won't speed this up because it can't make a single CPU core work faster on one task. For this, you'd use the multiprocessing module.
    • Complex PDF rendering or analysis: Deeply parsing a complex PDF structure might also be CPU-heavy.

The Golden Rule: Use gevent when your bottleneck is waiting for the disk, network, or another process. Use multiprocessing when your bottleneck is the CPU itself.


Example 1: Splitting a Large PDF (I/O-Bound)

This is a classic use case. Imagine you have a 500-page PDF report and you want to split it into 5 files of 100 pages each. We can do this concurrently.

We'll use pdfrw, a great library for manipulating PDFs without a heavy dependency like PyPDF2 or PyMuPDF.

Step 1: Install Libraries

pip install gevent pdfrw

Step 2: The Python Code

This script will take a large PDF and split it into chunks. Each chunk is processed in its own greenlet.

import os
import gevent
from gevent import monkey
from pdfrw import PdfReader, PdfWriter, IndirectPdfDict
# Monkey-patch the standard library to make it cooperative
monkey.patch_all()
def split_pdf(input_path, output_dir, num_chunks):
    """
    Splits a single PDF into multiple chunks concurrently using gevent.
    """
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)
    print(f"Reading input PDF: {input_path}")
    pdf_reader = PdfReader(input_path)
    total_pages = len(pdf_reader.pages)
    print(f"Total pages found: {total_pages}")
    if num_chunks <= 0:
        raise ValueError("Number of chunks must be positive.")
    if num_chunks > total_pages:
        print(f"Warning: Requested more chunks ({num_chunks}) than pages ({total_pages}).")
        num_chunks = total_pages
    pages_per_chunk = total_pages // num_chunks
    remainder_pages = total_pages % num_chunks
    jobs = []
    page_start = 0
    for i in range(num_chunks):
        # Calculate the number of pages for this chunk
        pages_in_this_chunk = pages_per_chunk
        if i < remainder_pages:
            pages_in_this_chunk += 1
        page_end = page_start + pages_in_this_chunk
        # Create a unique output path for each chunk
        output_path = os.path.join(output_dir, f"chunk_{i+1}.pdf")
        # Create a job (greenlet) for this chunk
        job = gevent.spawn(process_chunk, input_path, output_path, page_start, page_end)
        jobs.append(job)
        print(f"Queued job to create {output_path} (pages {page_start + 1} to {page_end})")
        page_start = page_end
    # Wait for all greenlets to finish
    gevent.joinall(jobs)
    print("\nAll chunks processed successfully!")
def process_chunk(input_path, output_path, page_start, page_end):
    """
    Processes a single chunk of the PDF.
    This is an I/O-bound task.
    """
    print(f"-> Processing {output_path}...")
    try:
        # Read the full PDF (this is I/O)
        pdf_reader = PdfReader(input_path)
        # Create a new PDF writer
        pdf_writer = PdfWriter()
        # Add the specified pages to the writer
        for i in range(page_start, page_end):
            pdf_writer.addpage(pdf_reader.pages[i])
        # Write the output file (this is also I/O)
        pdf_writer.write(output_path)
        print(f"-> Successfully wrote {output_path}")
    except Exception as e:
        print(f"!! Error processing {output_path}: {e}")
if __name__ == "__main__":
    # Create a dummy large PDF for testing if you don't have one
    # For this example, assume 'large_report.pdf' exists
    INPUT_PDF = 'large_report.pdf'
    OUTPUT_DIR = 'split_pdfs'
    NUM_CHUNKS = 5
    # Check if input file exists
    if not os.path.exists(INPUT_PDF):
        print(f"Error: Input file '{INPUT_PDF}' not found.")
        print("Please create a dummy PDF file named 'large_report.pdf' or change the INPUT_PDF variable.")
    else:
        split_pdf(INPUT_PDF, OUTPUT_DIR, NUM_CHUNKS)

How it Works:

  1. monkey.patch_all(): This is the magic line. It modifies parts of Python's standard library (like os, socket, time) so that blocking operations (like file I/O) will automatically "yield" control, allowing other greenlets to run.
  2. *`gevent.spawn(func, args)**: This starts a new greenlet, which is a lightweight coroutine. It schedulesfunc` to be executed with the given arguments.
  3. gevent.joinall(jobs): This pauses the main program until all the greenlets in the jobs list have completed their tasks.
  4. Concurrency: While one greenlet is waiting for the disk to write chunk_1.pdf, another greenlet can start reading pages for chunk_2.pdf. This overlap of waiting and working is what provides the performance boost.

Example 2: Watermarking Multiple PDFs (I/O-Bound)

Another excellent use case. You have a directory of PDFs and want to add a watermark to each one.

Step 1: Install Libraries

pip install gevent pdfrw PyPDF2

(We use PyPDF2 for merging, which is simple and effective).

Step 2: The Python Code

import os
import gevent
from gevent import monkey
from pdfrw import PdfReader, PdfWriter, PageMerge
monkey.patch_all()
def add_watermark(input_pdf_path, output_pdf_path, watermark_path):
    """Adds a watermark to a single PDF file."""
    try:
        print(f"-> Processing {os.path.basename(input_pdf_path)}...")
        # Read the main PDF and the watermark
        pdf_reader = PdfReader(input_pdf_path)
        watermark = PdfReader(watermark_path).pages[0]
        # Create a writer
        pdf_writer = PdfWriter()
        # Merge watermark with each page
        for page in pdf_reader.pages:
            PageMerge(page).add(watermark, prepend=True)
            pdf_writer.addpage(page)
        # Write the output file
        pdf_writer.write(output_pdf_path)
        print(f"-> Watermarked saved to {os.path.basename(output_pdf_path)}")
    except Exception as e:
        print(f"!! Error with {input_pdf_path}: {e}")
def watermark_pdfs_in_directory(input_dir, output_dir, watermark_path):
    """Watermarks all PDFs in a directory concurrently."""
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)
    jobs = []
    for filename in os.listdir(input_dir):
        if filename.lower().endswith('.pdf'):
            input_path = os.path.join(input_dir, filename)
            # Create a unique output name to avoid overwrites
            output_filename = f"watermarked_{filename}"
            output_path = os.path.join(output_dir, output_filename)
            # Spawn a greenlet for each PDF
            job = gevent.spawn(add_watermark, input_path, output_path, watermark_path)
            jobs.append(job)
    # Wait for all watermarking jobs to finish
    gevent.joinall(jobs)
    print("\nWatermarking complete!")
if __name__ == "__main__":
    # Create a 'watermark.pdf' with a single page for testing
    # Assume you have a 'source_pdfs' directory with PDFs to watermark
    INPUT_DIR = 'source_pdfs'
    OUTPUT_DIR = 'watermarked_pdfs'
    WATERMARK_PDF = 'watermark.pdf'
    if not os.path.exists(INPUT_DIR):
        print(f"Error: Input directory '{INPUT_DIR}' not found.")
    elif not os.path.exists(WATERMARK_PDF):
        print(f"Error: Watermark file '{WATERMARK_PDF}' not found.")
    else:
        watermark_pdfs_in_directory(INPUT_DIR, OUTPUT_DIR, WATERMARK_PDF)

Important Considerations

  • Global Interpreter Lock (GIL): gevent is based on CPython's threading model, which is limited by the GIL. This means only one thread (and thus, one greenlet at a time) can execute Python bytecode. This is why gevent is only effective for I/O-bound tasks, not CPU-bound ones. While waiting for I/O, the GIL is released, allowing other greenlets to run.
  • Compatibility: monkey.patch_all() can sometimes cause issues with libraries that have their own C extensions and perform their own blocking I/O. If you encounter strange bugs, it might be due to an incompatibility. You can patch specific modules instead: monkey.patch_socket(), monkey.patch_time(), etc.
  • Error Handling: gevent.joinall(jobs) will raise an exception if any of the greenlets raised an unhandled exception. You can pass raise_error=False to joinall and then check the value and exception attributes of each job to handle errors more gracefully.

gevent vs. multiprocessing for PDFs

Feature gevent multiprocessing
Concurrency Model Greenlets (Cooperative Multitasking) Processes (True Parallelism)
Best For I/O-Bound tasks (Disk, Network) CPU-Bound tasks (OCR, Math)
Overhead Very Low (schedules in one process) High (each process has its own memory)
Memory Usage Low (shared memory space) High (memory is duplicated)
GIL Works around it for I/O Bypasses it completely
Example Use Case Splitting, Merging, Watermarking many PDFs OCR on a single or few large PDFs

For a complex application, you might even combine them: use a multiprocessing.Pool to distribute CPU-heavy OCR jobs across your CPU cores, and within each of those worker processes, use gevent to handle concurrent file I/O.

分享:
扫描分享到社交APP
上一篇
下一篇