Python gevent如何高效处理PDF？-杰瑞科技汇

Of course! Let's dive into using Python's gevent library to work with PDFs. This is a powerful combination, especially for I/O-bound tasks.

（图片来源网络，侵删）

The Core Idea: When to Use `gevent` with PDFs

gevent is a concurrency library that uses "greenlets" to handle thousands of lightweight tasks. It excels at applications that are I/O-bound, meaning they spend most of their time waiting for external operations to complete (like network requests, database calls, or reading/writing files).

PDF processing can fall into two categories:

I/O-Bound (A perfect fit for gevent):
- Splitting a large PDF: Reading a big PDF file and splitting it into multiple smaller PDF files. The main work is reading/writing chunks of data from the disk.
- Merging many PDFs: Taking a list of many small PDF files and merging them into one. The overhead is opening, reading, and closing each file.
- Watermarking multiple PDFs: Applying a watermark to a large batch of PDF files. Each watermarking operation is independent and involves file I/O.
- Converting PDFs to images/text: Running a command-line tool (like pdftoppm or pdftotext) on many PDFs in parallel.
CPU-Bound (A poor fit for gevent):
- OCR (Optical Character Recognition): Extracting text from a scanned image-based PDF is extremely CPU-intensive. gevent won't speed this up because it can't make a single CPU core work faster on one task. For this, you'd use the multiprocessing module.
- Complex PDF rendering or analysis: Deeply parsing a complex PDF structure might also be CPU-heavy.

The Golden Rule: Use gevent when your bottleneck is waiting for the disk, network, or another process. Use multiprocessing when your bottleneck is the CPU itself.

Example 1: Splitting a Large PDF (I/O-Bound)

This is a classic use case. Imagine you have a 500-page PDF report and you want to split it into 5 files of 100 pages each. We can do this concurrently.

We'll use pdfrw, a great library for manipulating PDFs without a heavy dependency like PyPDF2 or PyMuPDF.

Step 1: Install Libraries

pip install gevent pdfrw

Step 2: The Python Code

This script will take a large PDF and split it into chunks. Each chunk is processed in its own greenlet.

import os
import gevent
from gevent import monkey
from pdfrw import PdfReader, PdfWriter, IndirectPdfDict
# Monkey-patch the standard library to make it cooperative
monkey.patch_all()
def split_pdf(input_path, output_dir, num_chunks):
    """
    Splits a single PDF into multiple chunks concurrently using gevent.
    """
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)
    print(f"Reading input PDF: {input_path}")
    pdf_reader = PdfReader(input_path)
    total_pages = len(pdf_reader.pages)
    print(f"Total pages found: {total_pages}")
    if num_chunks <= 0:
        raise ValueError("Number of chunks must be positive.")
    if num_chunks > total_pages:
        print(f"Warning: Requested more chunks ({num_chunks}) than pages ({total_pages}).")
        num_chunks = total_pages
    pages_per_chunk = total_pages // num_chunks
    remainder_pages = total_pages % num_chunks
    jobs = []
    page_start = 0
    for i in range(num_chunks):
        # Calculate the number of pages for this chunk
        pages_in_this_chunk = pages_per_chunk
        if i < remainder_pages:
            pages_in_this_chunk += 1
        page_end = page_start + pages_in_this_chunk
        # Create a unique output path for each chunk
        output_path = os.path.join(output_dir, f"chunk_{i+1}.pdf")
        # Create a job (greenlet) for this chunk
        job = gevent.spawn(process_chunk, input_path, output_path, page_start, page_end)
        jobs.append(job)
        print(f"Queued job to create {output_path} (pages {page_start + 1} to {page_end})")
        page_start = page_end
    # Wait for all greenlets to finish
    gevent.joinall(jobs)
    print("\nAll chunks processed successfully!")
def process_chunk(input_path, output_path, page_start, page_end):
    """
    Processes a single chunk of the PDF.
    This is an I/O-bound task.
    """
    print(f"-> Processing {output_path}...")
    try:
        # Read the full PDF (this is I/O)
        pdf_reader = PdfReader(input_path)
        # Create a new PDF writer
        pdf_writer = PdfWriter()
        # Add the specified pages to the writer
        for i in range(page_start, page_end):
            pdf_writer.addpage(pdf_reader.pages[i])
        # Write the output file (this is also I/O)
        pdf_writer.write(output_path)
        print(f"-> Successfully wrote {output_path}")
    except Exception as e:
        print(f"!! Error processing {output_path}: {e}")
if __name__ == "__main__":
    # Create a dummy large PDF for testing if you don't have one
    # For this example, assume 'large_report.pdf' exists
    INPUT_PDF = 'large_report.pdf'
    OUTPUT_DIR = 'split_pdfs'
    NUM_CHUNKS = 5
    # Check if input file exists
    if not os.path.exists(INPUT_PDF):
        print(f"Error: Input file '{INPUT_PDF}' not found.")
        print("Please create a dummy PDF file named 'large_report.pdf' or change the INPUT_PDF variable.")
    else:
        split_pdf(INPUT_PDF, OUTPUT_DIR, NUM_CHUNKS)

How it Works:

monkey.patch_all(): This is the magic line. It modifies parts of Python's standard library (like os, socket, time) so that blocking operations (like file I/O) will automatically "yield" control, allowing other greenlets to run.
*`gevent.spawn(func, args)**: This starts a new greenlet, which is a lightweight coroutine. It schedulesfunc` to be executed with the given arguments.
gevent.joinall(jobs): This pauses the main program until all the greenlets in the jobs list have completed their tasks.
Concurrency: While one greenlet is waiting for the disk to write chunk_1.pdf, another greenlet can start reading pages for chunk_2.pdf. This overlap of waiting and working is what provides the performance boost.

Example 2: Watermarking Multiple PDFs (I/O-Bound)

Another excellent use case. You have a directory of PDFs and want to add a watermark to each one.

Step 1: Install Libraries

pip install gevent pdfrw PyPDF2

(We use PyPDF2 for merging, which is simple and effective).

Step 2: The Python Code

import os
import gevent
from gevent import monkey
from pdfrw import PdfReader, PdfWriter, PageMerge
monkey.patch_all()
def add_watermark(input_pdf_path, output_pdf_path, watermark_path):
    """Adds a watermark to a single PDF file."""
    try:
        print(f"-> Processing {os.path.basename(input_pdf_path)}...")
        # Read the main PDF and the watermark
        pdf_reader = PdfReader(input_pdf_path)
        watermark = PdfReader(watermark_path).pages[0]
        # Create a writer
        pdf_writer = PdfWriter()
        # Merge watermark with each page
        for page in pdf_reader.pages:
            PageMerge(page).add(watermark, prepend=True)
            pdf_writer.addpage(page)
        # Write the output file
        pdf_writer.write(output_pdf_path)
        print(f"-> Watermarked saved to {os.path.basename(output_pdf_path)}")
    except Exception as e:
        print(f"!! Error with {input_pdf_path}: {e}")
def watermark_pdfs_in_directory(input_dir, output_dir, watermark_path):
    """Watermarks all PDFs in a directory concurrently."""
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)
    jobs = []
    for filename in os.listdir(input_dir):
        if filename.lower().endswith('.pdf'):
            input_path = os.path.join(input_dir, filename)
            # Create a unique output name to avoid overwrites
            output_filename = f"watermarked_{filename}"
            output_path = os.path.join(output_dir, output_filename)
            # Spawn a greenlet for each PDF
            job = gevent.spawn(add_watermark, input_path, output_path, watermark_path)
            jobs.append(job)
    # Wait for all watermarking jobs to finish
    gevent.joinall(jobs)
    print("\nWatermarking complete!")
if __name__ == "__main__":
    # Create a 'watermark.pdf' with a single page for testing
    # Assume you have a 'source_pdfs' directory with PDFs to watermark
    INPUT_DIR = 'source_pdfs'
    OUTPUT_DIR = 'watermarked_pdfs'
    WATERMARK_PDF = 'watermark.pdf'
    if not os.path.exists(INPUT_DIR):
        print(f"Error: Input directory '{INPUT_DIR}' not found.")
    elif not os.path.exists(WATERMARK_PDF):
        print(f"Error: Watermark file '{WATERMARK_PDF}' not found.")
    else:
        watermark_pdfs_in_directory(INPUT_DIR, OUTPUT_DIR, WATERMARK_PDF)

Important Considerations

Global Interpreter Lock (GIL): gevent is based on CPython's threading model, which is limited by the GIL. This means only one thread (and thus, one greenlet at a time) can execute Python bytecode. This is why gevent is only effective for I/O-bound tasks, not CPU-bound ones. While waiting for I/O, the GIL is released, allowing other greenlets to run.
Compatibility: monkey.patch_all() can sometimes cause issues with libraries that have their own C extensions and perform their own blocking I/O. If you encounter strange bugs, it might be due to an incompatibility. You can patch specific modules instead: monkey.patch_socket(), monkey.patch_time(), etc.
Error Handling: gevent.joinall(jobs) will raise an exception if any of the greenlets raised an unhandled exception. You can pass raise_error=False to joinall and then check the value and exception attributes of each job to handle errors more gracefully.

`gevent` vs. `multiprocessing` for PDFs

Feature	`gevent`	`multiprocessing`
Concurrency Model	Greenlets (Cooperative Multitasking)	Processes (True Parallelism)
Best For	I/O-Bound tasks (Disk, Network)	CPU-Bound tasks (OCR, Math)
Overhead	Very Low (schedules in one process)	High (each process has its own memory)
Memory Usage	Low (shared memory space)	High (memory is duplicated)
GIL	Works around it for I/O	Bypasses it completely
Example Use Case	Splitting, Merging, Watermarking many PDFs	OCR on a single or few large PDFs

For a complex application, you might even combine them: use a multiprocessing.Pool to distribute CPU-heavy OCR jobs across your CPU cores, and within each of those worker processes, use gevent to handle concurrent file I/O.

Python gevent如何高效处理PDF？

The Core Idea: When to Use `gevent` with PDFs

Example 1: Splitting a Large PDF (I/O-Bound)

Step 1: Install Libraries

Step 2: The Python Code

How it Works:

Example 2: Watermarking Multiple PDFs (I/O-Bound)

Step 1: Install Libraries

Step 2: The Python Code

Important Considerations

`gevent` vs. `multiprocessing` for PDFs

99ANYc3cd6

iPhone7换屏幕教程难不难？自己换要注意啥？

Python字典合并有哪些常用方法？

Storyboard教程怎么学才高效？

Abaqus网格划分教程，关键步骤与常见问题解析？

Python UDP服务器如何高效收发数据？

Java Socket编程实例具体怎么实现？

Java main 函数参数如何传递与解析？

java web与java区别

java 反射 private

mike2025安装教程

Selenium Python如何判断元素存在？

base64 转图片 java

python tornado 并发

石头雕刻入门教程，ZBrush如何快速上手？

foreach 数组 java

python paramiko模块

Python gevent如何高效处理PDF？

The Core Idea: When to Use gevent with PDFs

Example 1: Splitting a Large PDF (I/O-Bound)

Step 1: Install Libraries

Step 2: The Python Code

How it Works:

Example 2: Watermarking Multiple PDFs (I/O-Bound)

Step 1: Install Libraries

Step 2: The Python Code

Important Considerations

gevent vs. multiprocessing for PDFs

相关推荐

Java Socket编程实例具体怎么实现？

The Core Idea: When to Use `gevent` with PDFs

`gevent` vs. `multiprocessing` for PDFs