杰瑞科技汇

Python WebDriver如何操作PDF?

Of course! Using Python with a WebDriver to generate or interact with PDFs is a common task, especially for web scraping, testing, or automation. The approach you take depends entirely on your goal.

Let's break it down into the most common scenarios:

  1. Generate a PDF from a webpage: This is the most frequent use case. You want to take a screenshot of a webpage, but in a high-fidelity, scrollable, vector-based format.
  2. Extract text from a PDF: You have a PDF file and want to read its content.
  3. Fill out a PDF form: You have an interactive PDF form and want to programmatically fill it in.
  4. Interact with a PDF embedded in a webpage: You want to click links, fill forms, or extract text from a PDF that's being displayed by a browser plugin.

Here’s a detailed guide covering each scenario.


Scenario 1: Generate a PDF from a Webpage (Most Common)

This is the best way to get a true representation of a webpage as it appears in a browser. The key is to use a browser extension or a library that can print the page to a PDF file.

Method A: Using pdfkit (Recommended & Simplest)

pdfkit is a fantastic wrapper around the command-line tool wkhtmltopdf. It's very reliable and produces high-quality PDFs.

Step 1: Install Prerequisites

Python WebDriver如何操作PDF?-图1

  1. Install wkhtmltopdf: You need this command-line tool on your system.

    • Windows: Download the installer from the official site. Make sure to add it to your system's PATH during installation.
    • macOS (using Homebrew): brew install wkhtmltopdf
    • Linux (Debian/Ubuntu): sudo apt-get install wkhtmltopdf
  2. Install Python libraries:

    pip install pdfkit selenium

Step 2: Write the Python Script

This script uses Selenium to load a page and then pdfkit to save it as a PDF.

import pdfkit
from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager
# --- 1. Setup Selenium WebDriver ---
# Use webdriver-manager to automatically handle the driver
driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()))
# --- 2. Navigate to the desired URL ---
url = "https://www.python.org"
driver.get(url)
# --- 3. Save the page's HTML to a temporary file ---
# This is a robust way to pass the content to pdfkit
# You can also use the URL directly, but this is more reliable for complex pages
with open("temp_page.html", "w", encoding="utf-8") as f:
    f.write(driver.page_source)
# --- 4. Convert the HTML file to a PDF using pdfkit ---
# The 'options' dictionary allows you to customize the PDF output
options = {
    'page-size': 'A4',
    'encoding': "UTF-8",
    'margin-top': '10mm',
    'margin-right': '10mm',
    'margin-bottom': '10mm',
    'margin-left': '10mm',
}
try:
    pdfkit.from_file("temp_page.html", "python_org.pdf", options=options)
    print("PDF generated successfully!")
except Exception as e:
    print(f"An error occurred: {e}")
# --- 5. Clean up ---
# Close the browser and delete the temporary file
driver.quit()
import os
os.remove("temp_page.html")

Method B: Using Chrome's Built-in "Save as PDF" Feature

This method uses Selenium to programmatically open Chrome's print dialog and save the file. It's powerful because it uses the browser's native rendering engine.

Python WebDriver如何操作PDF?-图2

Step 1: Install Selenium

pip install selenium

Step 2: Write the Python Script

This script navigates to a page, opens the print dialog, changes the destination to "Save as PDF", and saves the file.

import os
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.action_chains import ActionChains
# Setup Selenium WebDriver
driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()))
url = "https://www.wikipedia.org/wiki/Python_(programming_language)"
driver.get(url)
# Give the page time to load completely
time.sleep(3) 
# Define the path for the output PDF
pdf_path = "wikipedia_python.pdf"
# Open the print dialog
print_action = ActionChains(driver)
print_action.key_down("control").send_keys("p").key_up("control").perform()
# Wait for the print dialog to appear
time.sleep(2)
# Find and click the "Save" button in the print dialog
# The button text might be "Save" or "Save..." depending on the OS
save_button = driver.find_element(By.XPATH, '//button[contains(text(), "Save")]')
save_button.click()
# Handle the "Save As" dialog (This is the tricky part)
# This part is OS-specific and might require additional libraries like pywinauto on Windows.
# For demonstration, we'll assume a default save location.
# A more robust solution would be to use a library like pyautogui.
time.sleep(3) # Wait for the save dialog to appear and close
# A simple, non-robust way to handle the save dialog on some systems
# is to just type the filename and press Enter. This is NOT reliable across all setups.
# For a real project, use a dedicated library like pywinauto (Windows) or applescript (macOS).
try:
    # This is a placeholder for a more robust solution
    # On Linux, it might save to a default directory automatically.
    # On macOS, you might need to use applescript.
    # On Windows, you would use pywinauto.
    print("Attempting to save. This part needs OS-specific handling for full reliability.")
    # You would typically send the filename and an 'Enter' keypress here.
    # For now, we'll just close the driver.
except Exception as e:
    print(f"Could not handle the save dialog: {e}")
# Close the browser
driver.quit()
print(f"Process finished. Check for '{pdf_path}' in your default downloads directory.")

Note: The "Save As" dialog is notoriously difficult to automate because it's a native OS window, not a web element. Method A (pdfkit) is generally preferred for its simplicity and reliability.


Scenario 2: Extract Text from a PDF

If you have a PDF file (whether generated or not) and you want to read its text, you should use a PDF parsing library like PyPDF2 or pdfplumber. pdfplumber is often better because it handles text layout more accurately.

Python WebDriver如何操作PDF?-图3

Step 1: Install the library

pip install pdfplumber

Step 2: Write the Python Script

import pdfplumber
pdf_file_path = "python_org.pdf" # Use the PDF from the first example
try:
    with pdfplumber.open(pdf_file_path) as pdf:
        full_text = ""
        # Loop through all pages
        for i, page in enumerate(pdf.pages):
            # Extract text from the page
            text = page.extract_text()
            if text:
                full_text += text
                print(f"--- Page {i+1} ---")
                print(text)
                print("\n")
    # You can now save the extracted text to a .txt file
    with open("extracted_text.txt", "w", encoding="utf-8") as f:
        f.write(full_text)
    print("Text successfully extracted and saved to extracted_text.txt")
except FileNotFoundError:
    print(f"Error: The file '{pdf_file_path}' was not found.")
except Exception as e:
    print(f"An error occurred while reading the PDF: {e}")

Scenario 3: Fill Out a PDF Form

For this, you'll need a library that can manipulate PDF forms. pdfrw is excellent for this.

Step 1: Install the library

pip install pdfrw

Step 2: Get a PDF with an AcroForm Find a PDF with fillable form fields. You can often find templates online. Let's assume you have form_template.pdf and the field names are name, email, and comments.

Step 3: Write the Python Script

from pdfrw import PdfReader, PdfWriter, IndirectPdfDict
# Path to your PDF form
template_path = "form_template.pdf"
# Path to save the filled form
output_path = "filled_form.pdf"
# Data to fill into the form
form_data = {
    "/name": "John Doe",
    "/email": "john.doe@example.com",
    "/comments": "This form was filled automatically using Python."
}
# Read the template PDF
template_pdf = PdfReader(template_path)
# Get the AcroForm dictionary from the PDF's root
# If it doesn't exist, you can't fill

分享:
扫描分享到社交APP
上一篇
下一篇