Of course! Let's dive into using Python's gevent library to work with PDFs. This is a powerful combination, especially for I/O-bound tasks.

The Core Idea: When to Use gevent with PDFs
gevent is a concurrency library that uses "greenlets" to handle thousands of lightweight tasks. It excels at applications that are I/O-bound, meaning they spend most of their time waiting for external operations to complete (like network requests, database calls, or reading/writing files).
PDF processing can fall into two categories:
-
I/O-Bound (A perfect fit for
gevent):- Splitting a large PDF: Reading a big PDF file and splitting it into multiple smaller PDF files. The main work is reading/writing chunks of data from the disk.
- Merging many PDFs: Taking a list of many small PDF files and merging them into one. The overhead is opening, reading, and closing each file.
- Watermarking multiple PDFs: Applying a watermark to a large batch of PDF files. Each watermarking operation is independent and involves file I/O.
- Converting PDFs to images/text: Running a command-line tool (like
pdftoppmorpdftotext) on many PDFs in parallel.
-
CPU-Bound (A poor fit for
gevent):- OCR (Optical Character Recognition): Extracting text from a scanned image-based PDF is extremely CPU-intensive.
geventwon't speed this up because it can't make a single CPU core work faster on one task. For this, you'd use themultiprocessingmodule. - Complex PDF rendering or analysis: Deeply parsing a complex PDF structure might also be CPU-heavy.
- OCR (Optical Character Recognition): Extracting text from a scanned image-based PDF is extremely CPU-intensive.
The Golden Rule: Use gevent when your bottleneck is waiting for the disk, network, or another process. Use multiprocessing when your bottleneck is the CPU itself.
Example 1: Splitting a Large PDF (I/O-Bound)
This is a classic use case. Imagine you have a 500-page PDF report and you want to split it into 5 files of 100 pages each. We can do this concurrently.
We'll use pdfrw, a great library for manipulating PDFs without a heavy dependency like PyPDF2 or PyMuPDF.
Step 1: Install Libraries
pip install gevent pdfrw
Step 2: The Python Code
This script will take a large PDF and split it into chunks. Each chunk is processed in its own greenlet.
import os
import gevent
from gevent import monkey
from pdfrw import PdfReader, PdfWriter, IndirectPdfDict
# Monkey-patch the standard library to make it cooperative
monkey.patch_all()
def split_pdf(input_path, output_dir, num_chunks):
"""
Splits a single PDF into multiple chunks concurrently using gevent.
"""
if not os.path.exists(output_dir):
os.makedirs(output_dir)
print(f"Reading input PDF: {input_path}")
pdf_reader = PdfReader(input_path)
total_pages = len(pdf_reader.pages)
print(f"Total pages found: {total_pages}")
if num_chunks <= 0:
raise ValueError("Number of chunks must be positive.")
if num_chunks > total_pages:
print(f"Warning: Requested more chunks ({num_chunks}) than pages ({total_pages}).")
num_chunks = total_pages
pages_per_chunk = total_pages // num_chunks
remainder_pages = total_pages % num_chunks
jobs = []
page_start = 0
for i in range(num_chunks):
# Calculate the number of pages for this chunk
pages_in_this_chunk = pages_per_chunk
if i < remainder_pages:
pages_in_this_chunk += 1
page_end = page_start + pages_in_this_chunk
# Create a unique output path for each chunk
output_path = os.path.join(output_dir, f"chunk_{i+1}.pdf")
# Create a job (greenlet) for this chunk
job = gevent.spawn(process_chunk, input_path, output_path, page_start, page_end)
jobs.append(job)
print(f"Queued job to create {output_path} (pages {page_start + 1} to {page_end})")
page_start = page_end
# Wait for all greenlets to finish
gevent.joinall(jobs)
print("\nAll chunks processed successfully!")
def process_chunk(input_path, output_path, page_start, page_end):
"""
Processes a single chunk of the PDF.
This is an I/O-bound task.
"""
print(f"-> Processing {output_path}...")
try:
# Read the full PDF (this is I/O)
pdf_reader = PdfReader(input_path)
# Create a new PDF writer
pdf_writer = PdfWriter()
# Add the specified pages to the writer
for i in range(page_start, page_end):
pdf_writer.addpage(pdf_reader.pages[i])
# Write the output file (this is also I/O)
pdf_writer.write(output_path)
print(f"-> Successfully wrote {output_path}")
except Exception as e:
print(f"!! Error processing {output_path}: {e}")
if __name__ == "__main__":
# Create a dummy large PDF for testing if you don't have one
# For this example, assume 'large_report.pdf' exists
INPUT_PDF = 'large_report.pdf'
OUTPUT_DIR = 'split_pdfs'
NUM_CHUNKS = 5
# Check if input file exists
if not os.path.exists(INPUT_PDF):
print(f"Error: Input file '{INPUT_PDF}' not found.")
print("Please create a dummy PDF file named 'large_report.pdf' or change the INPUT_PDF variable.")
else:
split_pdf(INPUT_PDF, OUTPUT_DIR, NUM_CHUNKS)
How it Works:
monkey.patch_all(): This is the magic line. It modifies parts of Python's standard library (likeos,socket,time) so that blocking operations (like file I/O) will automatically "yield" control, allowing other greenlets to run.- *`gevent.spawn(func, args)
**: This starts a new greenlet, which is a lightweight coroutine. It schedulesfunc` to be executed with the given arguments. gevent.joinall(jobs): This pauses the main program until all the greenlets in thejobslist have completed their tasks.- Concurrency: While one greenlet is waiting for the disk to write
chunk_1.pdf, another greenlet can start reading pages forchunk_2.pdf. This overlap of waiting and working is what provides the performance boost.
Example 2: Watermarking Multiple PDFs (I/O-Bound)
Another excellent use case. You have a directory of PDFs and want to add a watermark to each one.
Step 1: Install Libraries
pip install gevent pdfrw PyPDF2
(We use PyPDF2 for merging, which is simple and effective).
Step 2: The Python Code
import os
import gevent
from gevent import monkey
from pdfrw import PdfReader, PdfWriter, PageMerge
monkey.patch_all()
def add_watermark(input_pdf_path, output_pdf_path, watermark_path):
"""Adds a watermark to a single PDF file."""
try:
print(f"-> Processing {os.path.basename(input_pdf_path)}...")
# Read the main PDF and the watermark
pdf_reader = PdfReader(input_pdf_path)
watermark = PdfReader(watermark_path).pages[0]
# Create a writer
pdf_writer = PdfWriter()
# Merge watermark with each page
for page in pdf_reader.pages:
PageMerge(page).add(watermark, prepend=True)
pdf_writer.addpage(page)
# Write the output file
pdf_writer.write(output_pdf_path)
print(f"-> Watermarked saved to {os.path.basename(output_pdf_path)}")
except Exception as e:
print(f"!! Error with {input_pdf_path}: {e}")
def watermark_pdfs_in_directory(input_dir, output_dir, watermark_path):
"""Watermarks all PDFs in a directory concurrently."""
if not os.path.exists(output_dir):
os.makedirs(output_dir)
jobs = []
for filename in os.listdir(input_dir):
if filename.lower().endswith('.pdf'):
input_path = os.path.join(input_dir, filename)
# Create a unique output name to avoid overwrites
output_filename = f"watermarked_{filename}"
output_path = os.path.join(output_dir, output_filename)
# Spawn a greenlet for each PDF
job = gevent.spawn(add_watermark, input_path, output_path, watermark_path)
jobs.append(job)
# Wait for all watermarking jobs to finish
gevent.joinall(jobs)
print("\nWatermarking complete!")
if __name__ == "__main__":
# Create a 'watermark.pdf' with a single page for testing
# Assume you have a 'source_pdfs' directory with PDFs to watermark
INPUT_DIR = 'source_pdfs'
OUTPUT_DIR = 'watermarked_pdfs'
WATERMARK_PDF = 'watermark.pdf'
if not os.path.exists(INPUT_DIR):
print(f"Error: Input directory '{INPUT_DIR}' not found.")
elif not os.path.exists(WATERMARK_PDF):
print(f"Error: Watermark file '{WATERMARK_PDF}' not found.")
else:
watermark_pdfs_in_directory(INPUT_DIR, OUTPUT_DIR, WATERMARK_PDF)
Important Considerations
- Global Interpreter Lock (GIL):
geventis based on CPython's threading model, which is limited by the GIL. This means only one thread (and thus, one greenlet at a time) can execute Python bytecode. This is whygeventis only effective for I/O-bound tasks, not CPU-bound ones. While waiting for I/O, the GIL is released, allowing other greenlets to run. - Compatibility:
monkey.patch_all()can sometimes cause issues with libraries that have their own C extensions and perform their own blocking I/O. If you encounter strange bugs, it might be due to an incompatibility. You can patch specific modules instead:monkey.patch_socket(),monkey.patch_time(), etc. - Error Handling:
gevent.joinall(jobs)will raise an exception if any of the greenlets raised an unhandled exception. You can passraise_error=Falsetojoinalland then check thevalueandexceptionattributes of each job to handle errors more gracefully.
gevent vs. multiprocessing for PDFs
| Feature | gevent |
multiprocessing |
|---|---|---|
| Concurrency Model | Greenlets (Cooperative Multitasking) | Processes (True Parallelism) |
| Best For | I/O-Bound tasks (Disk, Network) | CPU-Bound tasks (OCR, Math) |
| Overhead | Very Low (schedules in one process) | High (each process has its own memory) |
| Memory Usage | Low (shared memory space) | High (memory is duplicated) |
| GIL | Works around it for I/O | Bypasses it completely |
| Example Use Case | Splitting, Merging, Watermarking many PDFs | OCR on a single or few large PDFs |
For a complex application, you might even combine them: use a multiprocessing.Pool to distribute CPU-heavy OCR jobs across your CPU cores, and within each of those worker processes, use gevent to handle concurrent file I/O.
