杰瑞科技汇

Python Selenium如何操作PDF?

Of course! Using Python with Selenium to handle PDFs is a common task in web scraping and automation. The key thing to understand is that Selenium itself cannot parse or read the content of a PDF file.

Python Selenium如何操作PDF?-图1
(图片来源网络,侵删)

Selenium is a browser automation tool. Its job is to drive a web browser. So, when you click a link that opens a PDF, Selenium's job is to make the browser download the file. Then, you use a separate Python library to read the downloaded PDF file.

Here’s a complete guide covering the most common scenarios:

  1. Scenario 1: Downloading a PDF from a link (Most Common)
  2. Scenario 2: Handling PDFs that open in the browser tab (Inline)
  3. Scenario 3: Reading Text from the Downloaded PDF

Prerequisites

First, you need to install the necessary libraries.

# For browser automation
pip install selenium
# For managing browser drivers (easiest way)
pip install webdriver-manager
# For reading PDF content (we'll use PyPDF2 for this example)
pip install PyPDF2

Scenario 1: Downloading a PDF from a Link (The Standard Approach)

This is the most robust method. You configure the browser to automatically save PDFs to a specific folder without opening them, and then you process the file.

Python Selenium如何操作PDF?-图2
(图片来源网络,侵删)

Step 1: Set Up Browser Preferences for Automatic Download

We'll use Chrome for this example. The core idea is to change Chrome's download.default_directory and plugins.always_open_pdf_externally settings.

from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager
import os
# --- 1. Configure Download Directory ---
# Create a directory to save the PDFs if it doesn't exist
download_dir = os.path.join(os.getcwd(), "pdf_downloads")
if not os.path.exists(download_dir):
    os.makedirs(download_dir)
# --- 2. Set Chrome Options ---
chrome_options = webdriver.ChromeOptions()
# Set the download directory
prefs = {
    "download.default_directory": download_dir,
    "download.prompt_for_download": False,  # To automatically download the file
    "download.directory_upgrade": True,
    "safebrowsing.enabled": True # Disable safe browsing to avoid warnings
}
chrome_options.add_experimental_option("prefs", prefs)
# --- 3. Initialize the WebDriver ---
# Use webdriver-manager to handle the driver automatically
service = ChromeService(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service, options=chrome_options)
# --- 4. Automate the Download ---
try:
    # Navigate to a page with a PDF link
    driver.get("https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf")
    # Find the link to the PDF (you'll need to inspect the page to find the correct selector)
    # For this dummy PDF, the link text is "Download the dummy PDF file"
    pdf_link = driver.find_element("link text", "Download the dummy PDF file")
    pdf_link.click()
    print(f"PDF download initiated. Check the '{download_dir}' folder.")
    # Wait for the download to complete (optional but recommended)
    # A simple way is to wait for the file to appear and its size to be stable.
    # For a more robust solution, consider using WebDriverWait with a custom condition.
    import time
    time.sleep(5) # Simple wait, replace with a better method for production
finally:
    # Close the browser
    driver.quit()

Step 2: Reading the Downloaded PDF

Now that you have the PDF file in your pdf_downloads folder, you can use a library like PyPDF2 to read its text content.

import os
import PyPDF2
# Path to the downloaded file
# We need to find the latest file in the directory
download_dir = "pdf_downloads"
try:
    # Get a list of all files in the directory
    files = os.listdir(download_dir)
    # Filter for PDF files
    pdf_files = [f for f in files if f.endswith('.pdf')]
    if not pdf_files:
        print("No PDF files found in the download directory.")
    else:
        # Get the most recently modified PDF file
        latest_pdf = max(pdf_files, key=lambda f: os.path.getmtime(os.path.join(download_dir, f)))
        pdf_path = os.path.join(download_dir, latest_pdf)
        print(f"Reading PDF: {pdf_path}")
        with open(pdf_path, 'rb') as file:
            reader = PyPDF2.PdfReader(file)
            # Get the number of pages
            num_pages = len(reader.pages)
            print(f"The PDF has {num_pages} pages.")
            # Extract text from all pages
            text = ""
            for page_num in range(num_pages):
                page = reader.pages[page_page_num]
                text += page.extract_text() + "\n"
            print("\n--- Extracted Text ---")
            print(text)
except FileNotFoundError:
    print(f"Error: The directory '{download_dir}' was not found.")
except Exception as e:
    print(f"An error occurred: {e}")

Scenario 2: Handling PDFs that Open in the Browser Tab (Inline)

Sometimes, websites are configured to open PDFs directly in the browser tab instead of downloading them. Selenium can switch to this new "tab" and get its URL.

The Challenge

You cannot use driver.page_source to get the PDF's content. It will be a garbled binary mess. You need to get the direct URL to the PDF file and then download it manually using a library like requests.

Python Selenium如何操作PDF?-图3
(图片来源网络,侵删)

Solution: Get the PDF URL and Download with requests

import requests
import os
from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager
# --- 1. Initialize WebDriver (no special download options needed) ---
service = ChromeService(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service)
# --- 2. Automate to get the PDF URL ---
try:
    # Navigate to a page that opens a PDF inline
    driver.get("https://www.adobe.com/support/products/enterprise/knowledgecenter/media/c4611_sample_explain.pdf")
    # The PDF is the only content on the page, so its URL is the current page's URL
    pdf_url = driver.current_url
    print(f"Found PDF URL: {pdf_url}")
    # --- 3. Download the PDF using the 'requests' library ---
    # Create a directory to save the PDFs
    download_dir = "pdf_downloads_inline"
    if not os.path.exists(download_dir):
        os.makedirs(download_dir)
    # Get the filename from the URL
    filename = os.path.basename(pdf_url)
    save_path = os.path.join(download_dir, filename)
    # Send a GET request to the URL
    response = requests.get(pdf_url, stream=True)
    response.raise_for_status() # Raise an exception for bad status codes
    # Write the content to a file
    with open(save_path, 'wb') as f:
        for chunk in response.iter_content(chunk_size=8192):
            f.write(chunk)
    print(f"PDF downloaded successfully to: {save_path}")
finally:
    # Close the browser
    driver.quit()

Summary and Best Practices

Method Pros Cons Best For
Scenario 1: Browser Download Settings Most reliable. Fully automated, no extra libraries needed. Requires specific browser configuration. When you have control over the browser and want a completely automated download process.
Scenario 2: Get URL + requests Very flexible. Works for any file type, not just PDFs. Requires an extra library (requests). Slightly more code. When PDFs open inline or in new tabs, or when you can't modify browser settings.
Scenario 3: PyPDF2 / pdfplumber Essential for data extraction. Allows you to search, read, and analyze PDF content. Cannot download or handle complex PDFs (scanned images, forms). The mandatory second step after you have successfully downloaded a PDF file.

Pro Tip: For more advanced PDF text extraction, especially with complex layouts, consider using the pdfplumber library. It's built on top of pdfminer.six and is often better at preserving text structure.

pip install pdfplumber
import pdfplumber
with pdfplumber.open("my_downloaded_file.pdf") as pdf:
    first_page = pdf.pages[0]
    print(first_page.extract_text())
分享:
扫描分享到社交APP
上一篇
下一篇