杰瑞科技汇

Docker+Python+Scrapy如何高效协同爬取数据?

Of course! Using Docker with Python Scrapy is an excellent practice for creating portable, reproducible, and scalable web scraping projects. It solves common problems like dependency conflicts and "it works on my machine" issues.

Docker+Python+Scrapy如何高效协同爬取数据?-图1

Here’s a comprehensive guide covering everything from the basics to a production-ready setup.


Why Use Docker with Scrapy?

  1. Reproducible Environment: You can define the exact versions of Python and Scrapy needed for your project. Anyone (or any CI/CD pipeline) can spin up the exact same environment with a single command.
  2. Isolation: Your scraping project is isolated from your system's global Python packages. This prevents conflicts between different projects (e.g., one needing Scrapy 2.8, another needing 2.11).
  3. Scalability: Docker is the foundation of container orchestration tools like Kubernetes and Docker Swarm. If you need to run 50 scraper instances simultaneously, you can easily do so by scaling your Docker containers.
  4. Portability: You can build your Scrapy project into a Docker image and run it on any system that has Docker installed (Linux, macOS, Windows), without worrying about the underlying OS.
  5. Simplified Deployment: Deploying a scraper to a server or a cloud service (like AWS ECS, Google Cloud Run) is as simple as pushing your Docker image to a container registry.

Step-by-Step Guide: Building a Scrapy Project with Docker

We will create a simple Scrapy project to scrape quotes from http://quotes.toscrape.com/.

Step 1: Create the Scrapy Project

First, set up a standard Scrapy project. It's best practice to do this inside a dedicated directory for your Docker setup.

# Create a project directory
mkdir scrapy-docker-project
cd scrapy-docker-project
# Create the Scrapy project
scrapy startproject my_scraper
# This will create the following structure:
# my_scraper/
# ├── my_scraper/
# │   ├── __init__.py
# │   ├── items.py
# │   ├── middlewares.py
# │   ├── pipelines.py
# │   ├── settings.py
# │   └── spiders/
# │       └── __init__.py
# └── scrapy.cfg

Step 2: Create a Spider

Let's create a spider to scrape quotes.

cd my_scraper
scrapy genspider quotes quotes.toscrape.com

This creates my_scraper/spiders/quotes.py. Edit this file to look like this:

my_scraper/spiders/quotes.py

import scrapy
class QuotesSpider(scrapy.Spider):
    name = "quotes"
    allowed_domains = ["quotes.toscrape.com"]
    start_urls = ["http://quotes.toscrape.com"]
    def parse(self, response):
        for quote in response.css("div.quote"):
            yield {
                "text": quote.css("span.text::text").get(),
                "author": quote.css("small.author::text").get(),
                "tags": quote.css("div.tags a.tag::text").getall(),
            }
        # Follow pagination link
        next_page = response.css("li.next a::attr(href)").get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

Step 3: Create the Dockerfile

This is the most important file. It's a blueprint for building your Docker image. Create a file named Dockerfile in the root of your project (i.e., inside scrapy-docker-project/my_scraper/).

my_scraper/Dockerfile

Docker+Python+Scrapy如何高效协同爬取数据?-图2

# Use an official Python runtime as a parent image
FROM python:3.11-slim
# Set the working directory in the container
WORKDIR /app
# Copy the requirements file into the container at /app
# We copy this first to leverage Docker's layer caching.
# It will only rebuild this layer if requirements.txt changes.
COPY requirements.txt .
# Install any needed packages specified in requirements.txt
# --no-cache-dir reduces image size
RUN pip install --no-cache-dir -r requirements.txt
# Copy the current directory contents into the container at /app
# This includes your Scrapy project code.
COPY . .
# Scrapy needs a non-root user to run for security reasons.
# Create a user and group named 'scraper'
RUN groupadd -r scraper && useradd -r -g scraper scraper
# Change ownership of the /app directory to the 'scraper' user
RUN chown -R scraper:scraper /app
USER scraper
# Set the environment variable to prevent Scrapy from logging debug info
# which can be noisy in some environments.
ENV SCrapy_LOG_LEVEL=INFO
# Command to run your Scrapy spider when the container launches
CMD ["scrapy", "crawl", "quotes"]

Step 4: Create the requirements.txt File

This file lists all the Python dependencies for your project. Create it in the root of your project.

my_scraper/requirements.txt

scrapy

You can add other packages here, like python-dotenv for managing secrets or pymongo for saving to a database.

Step 5: Build and Run the Docker Image

Now you're ready to build your image and run the container.

  1. Build the Image: Open your terminal in the scrapy-docker-project/my_scraper/ directory and run:

    # The '.' at the end specifies the build context (the current directory)
    docker build -t my-scraper-image .

    This command tells Docker to build an image with the tag my-scraper-image using the instructions in the Dockerfile in the current directory.

  2. Run the Container: Once the build is complete, run the container:

    docker run --rm my-scraper-image
    • --rm: Automatically removes the container when it exits. This is great for testing.
    • my-scraper-image: The name of the image you just built.

You should see the output of your quotes spider being printed to your terminal!

Docker+Python+Scrapy如何高效协同爬取数据?-图3


Advanced Docker Compose Setup

For more complex applications (e.g., a scraper that needs a database like MongoDB or Redis), you should use Docker Compose. It allows you to define and run multi-container Docker applications.

Let's modify our setup to include MongoDB to store the scraped data.

Step 1: Install pymongo

Update your requirements.txt to include the MongoDB driver.

my_scraper/requirements.txt

scrapy
pymongo

Step 2: Create a Scrapy Pipeline

A pipeline processes items scraped by your spiders. Let's create one to save items to MongoDB.

my_scraper/pipelines.py

from itemadapter import ItemAdapter
from pymongo import MongoClient
class MongoPipeline:
    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db
    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
        )
    def open_spider(self, spider):
        self.client = MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]
    def close_spider(self, spider):
        self.client.close()
    def process_item(self, item, spider):
        self.db[spider.name].insert_one(ItemAdapter(item).asdict())
        return item

Step 3: Configure Scrapy Settings

Enable the pipeline in your Scrapy settings file.

my_scraper/settings.py

# ... other settings ...
# Enable the pipeline
ITEM_PIPELINES = {
   'my_scraper.pipelines.MongoPipeline': 300,
}
# --- Docker Compose will provide the URI ---
# For local development, you might use:
# MONGO_URI = "mongodb://localhost:27017/"
# MONGO_DATABASE = "quotes_db"

Step 4: Create the docker-compose.yml File

Create this file in the root of your project (my_scraper/).

my_scraper/docker-compose.yml

version: '3.8'
services:
  # The MongoDB service
  mongo:
    image: mongo:latest
    container_name: my-mongo
    volumes:
      - mongo-data:/data/db
    ports:
      - "27017:27017"
  # The Scrapy Scraper service
  scraper:
    build: .
    container_name: my-scraper
    depends_on:
      - mongo
    environment:
      - MONGO_URI=mongodb://mongo:27017/  # 'mongo' is the service name
      - MONGO_DATABASE=quotes_db
    # To run a different command, e.g., to save output to a file
    # command: scrapy crawl quotes -o output.json
# Define the named volume for persistent data storage
volumes:
  mongo-data:

Step 5: Run with Docker Compose

Now, instead of docker run, you'll use

分享:
扫描分享到社交APP
上一篇
下一篇