Python Selenium如何操作PDF？-杰瑞科技汇

Of course! Using Python with Selenium to handle PDFs is a common task in web scraping and automation. The key thing to understand is that Selenium itself cannot parse or read the content of a PDF file.

（图片来源网络，侵删）

Selenium is a browser automation tool. Its job is to drive a web browser. So, when you click a link that opens a PDF, Selenium's job is to make the browser download the file. Then, you use a separate Python library to read the downloaded PDF file.

Here’s a complete guide covering the most common scenarios:

Scenario 1: Downloading a PDF from a link (Most Common)
Scenario 2: Handling PDFs that open in the browser tab (Inline)
Scenario 3: Reading Text from the Downloaded PDF

Prerequisites

First, you need to install the necessary libraries.

# For browser automation
pip install selenium
# For managing browser drivers (easiest way)
pip install webdriver-manager
# For reading PDF content (we'll use PyPDF2 for this example)
pip install PyPDF2

Scenario 1: Downloading a PDF from a Link (The Standard Approach)

This is the most robust method. You configure the browser to automatically save PDFs to a specific folder without opening them, and then you process the file.

（图片来源网络，侵删）

Step 1: Set Up Browser Preferences for Automatic Download

We'll use Chrome for this example. The core idea is to change Chrome's download.default_directory and plugins.always_open_pdf_externally settings.

from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager
import os
# --- 1. Configure Download Directory ---
# Create a directory to save the PDFs if it doesn't exist
download_dir = os.path.join(os.getcwd(), "pdf_downloads")
if not os.path.exists(download_dir):
    os.makedirs(download_dir)
# --- 2. Set Chrome Options ---
chrome_options = webdriver.ChromeOptions()
# Set the download directory
prefs = {
    "download.default_directory": download_dir,
    "download.prompt_for_download": False,  # To automatically download the file
    "download.directory_upgrade": True,
    "safebrowsing.enabled": True # Disable safe browsing to avoid warnings
}
chrome_options.add_experimental_option("prefs", prefs)
# --- 3. Initialize the WebDriver ---
# Use webdriver-manager to handle the driver automatically
service = ChromeService(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service, options=chrome_options)
# --- 4. Automate the Download ---
try:
    # Navigate to a page with a PDF link
    driver.get("https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf")
    # Find the link to the PDF (you'll need to inspect the page to find the correct selector)
    # For this dummy PDF, the link text is "Download the dummy PDF file"
    pdf_link = driver.find_element("link text", "Download the dummy PDF file")
    pdf_link.click()
    print(f"PDF download initiated. Check the '{download_dir}' folder.")
    # Wait for the download to complete (optional but recommended)
    # A simple way is to wait for the file to appear and its size to be stable.
    # For a more robust solution, consider using WebDriverWait with a custom condition.
    import time
    time.sleep(5) # Simple wait, replace with a better method for production
finally:
    # Close the browser
    driver.quit()

Step 2: Reading the Downloaded PDF

Now that you have the PDF file in your pdf_downloads folder, you can use a library like PyPDF2 to read its text content.

import os
import PyPDF2
# Path to the downloaded file
# We need to find the latest file in the directory
download_dir = "pdf_downloads"
try:
    # Get a list of all files in the directory
    files = os.listdir(download_dir)
    # Filter for PDF files
    pdf_files = [f for f in files if f.endswith('.pdf')]
    if not pdf_files:
        print("No PDF files found in the download directory.")
    else:
        # Get the most recently modified PDF file
        latest_pdf = max(pdf_files, key=lambda f: os.path.getmtime(os.path.join(download_dir, f)))
        pdf_path = os.path.join(download_dir, latest_pdf)
        print(f"Reading PDF: {pdf_path}")
        with open(pdf_path, 'rb') as file:
            reader = PyPDF2.PdfReader(file)
            # Get the number of pages
            num_pages = len(reader.pages)
            print(f"The PDF has {num_pages} pages.")
            # Extract text from all pages
            text = ""
            for page_num in range(num_pages):
                page = reader.pages[page_page_num]
                text += page.extract_text() + "\n"
            print("\n--- Extracted Text ---")
            print(text)
except FileNotFoundError:
    print(f"Error: The directory '{download_dir}' was not found.")
except Exception as e:
    print(f"An error occurred: {e}")

Scenario 2: Handling PDFs that Open in the Browser Tab (Inline)

Sometimes, websites are configured to open PDFs directly in the browser tab instead of downloading them. Selenium can switch to this new "tab" and get its URL.

The Challenge

You cannot use driver.page_source to get the PDF's content. It will be a garbled binary mess. You need to get the direct URL to the PDF file and then download it manually using a library like requests.

（图片来源网络，侵删）

Solution: Get the PDF URL and Download with `requests`

import requests
import os
from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager
# --- 1. Initialize WebDriver (no special download options needed) ---
service = ChromeService(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service)
# --- 2. Automate to get the PDF URL ---
try:
    # Navigate to a page that opens a PDF inline
    driver.get("https://www.adobe.com/support/products/enterprise/knowledgecenter/media/c4611_sample_explain.pdf")
    # The PDF is the only content on the page, so its URL is the current page's URL
    pdf_url = driver.current_url
    print(f"Found PDF URL: {pdf_url}")
    # --- 3. Download the PDF using the 'requests' library ---
    # Create a directory to save the PDFs
    download_dir = "pdf_downloads_inline"
    if not os.path.exists(download_dir):
        os.makedirs(download_dir)
    # Get the filename from the URL
    filename = os.path.basename(pdf_url)
    save_path = os.path.join(download_dir, filename)
    # Send a GET request to the URL
    response = requests.get(pdf_url, stream=True)
    response.raise_for_status() # Raise an exception for bad status codes
    # Write the content to a file
    with open(save_path, 'wb') as f:
        for chunk in response.iter_content(chunk_size=8192):
            f.write(chunk)
    print(f"PDF downloaded successfully to: {save_path}")
finally:
    # Close the browser
    driver.quit()

Summary and Best Practices

Method	Pros	Cons	Best For
Scenario 1: Browser Download Settings	Most reliable. Fully automated, no extra libraries needed.	Requires specific browser configuration.	When you have control over the browser and want a completely automated download process.
Scenario 2: Get URL + `requests`	Very flexible. Works for any file type, not just PDFs.	Requires an extra library (`requests`). Slightly more code.	When PDFs open inline or in new tabs, or when you can't modify browser settings.
Scenario 3: `PyPDF2` / `pdfplumber`	Essential for data extraction. Allows you to search, read, and analyze PDF content.	Cannot download or handle complex PDFs (scanned images, forms).	The mandatory second step after you have successfully downloaded a PDF file.

Pro Tip: For more advanced PDF text extraction, especially with complex layouts, consider using the pdfplumber library. It's built on top of pdfminer.six and is often better at preserving text structure.

pip install pdfplumber

import pdfplumber
with pdfplumber.open("my_downloaded_file.pdf") as pdf:
    first_page = pdf.pages[0]
    print(first_page.extract_text())

Python Selenium如何操作PDF？

Prerequisites

Scenario 1: Downloading a PDF from a Link (The Standard Approach)

Step 1: Set Up Browser Preferences for Automatic Download

Step 2: Reading the Downloaded PDF

Scenario 2: Handling PDFs that Open in the Browser Tab (Inline)

The Challenge

Solution: Get the PDF URL and Download with `requests`

Summary and Best Practices

99ANYc3cd6

mac系统如何从u盘启动？

ArrayList和数组有何区别？

HBase Thrift Python如何连接与使用？

Java数组和ArrayList到底该用哪个？

Python 2.7 urllib2 如何正确使用？

HDR Light Studio教程如何快速上手？

苹果手机刷ID后如何刷机？教程视频详解。

Python defaultdict如何自动初始化字典键值？

Python时间戳如何转datetime？

Java中float转int，精度丢失怎么办？

Java wait 与 sleep 底层实现与应用场景有何不同？

Python中initialvalue是什么？如何使用？

Win7和Ubuntu如何双系统安装？

superpads数字教程the

Java如何调用C语言开发的WebService？

Oracle存储过程如何调用Java方法？

Python Selenium如何操作PDF？

Prerequisites

Scenario 1: Downloading a PDF from a Link (The Standard Approach)

Step 1: Set Up Browser Preferences for Automatic Download

Step 2: Reading the Downloaded PDF

Scenario 2: Handling PDFs that Open in the Browser Tab (Inline)

The Challenge

Solution: Get the PDF URL and Download with requests

Summary and Best Practices

相关推荐

HDR Light Studio教程如何快速上手？

Solution: Get the PDF URL and Download with `requests`