Of course! Using Docker with Python Scrapy is an excellent practice for creating portable, reproducible, and scalable web scraping projects. It solves common problems like dependency conflicts and "it works on my machine" issues.

Here’s a comprehensive guide covering everything from the basics to a production-ready setup.
Why Use Docker with Scrapy?
- Reproducible Environment: You can define the exact versions of Python and Scrapy needed for your project. Anyone (or any CI/CD pipeline) can spin up the exact same environment with a single command.
- Isolation: Your scraping project is isolated from your system's global Python packages. This prevents conflicts between different projects (e.g., one needing Scrapy 2.8, another needing 2.11).
- Scalability: Docker is the foundation of container orchestration tools like Kubernetes and Docker Swarm. If you need to run 50 scraper instances simultaneously, you can easily do so by scaling your Docker containers.
- Portability: You can build your Scrapy project into a Docker image and run it on any system that has Docker installed (Linux, macOS, Windows), without worrying about the underlying OS.
- Simplified Deployment: Deploying a scraper to a server or a cloud service (like AWS ECS, Google Cloud Run) is as simple as pushing your Docker image to a container registry.
Step-by-Step Guide: Building a Scrapy Project with Docker
We will create a simple Scrapy project to scrape quotes from http://quotes.toscrape.com/.
Step 1: Create the Scrapy Project
First, set up a standard Scrapy project. It's best practice to do this inside a dedicated directory for your Docker setup.
# Create a project directory mkdir scrapy-docker-project cd scrapy-docker-project # Create the Scrapy project scrapy startproject my_scraper # This will create the following structure: # my_scraper/ # ├── my_scraper/ # │ ├── __init__.py # │ ├── items.py # │ ├── middlewares.py # │ ├── pipelines.py # │ ├── settings.py # │ └── spiders/ # │ └── __init__.py # └── scrapy.cfg
Step 2: Create a Spider
Let's create a spider to scrape quotes.
cd my_scraper scrapy genspider quotes quotes.toscrape.com
This creates my_scraper/spiders/quotes.py. Edit this file to look like this:
my_scraper/spiders/quotes.py
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
allowed_domains = ["quotes.toscrape.com"]
start_urls = ["http://quotes.toscrape.com"]
def parse(self, response):
for quote in response.css("div.quote"):
yield {
"text": quote.css("span.text::text").get(),
"author": quote.css("small.author::text").get(),
"tags": quote.css("div.tags a.tag::text").getall(),
}
# Follow pagination link
next_page = response.css("li.next a::attr(href)").get()
if next_page is not None:
yield response.follow(next_page, self.parse)
Step 3: Create the Dockerfile
This is the most important file. It's a blueprint for building your Docker image. Create a file named Dockerfile in the root of your project (i.e., inside scrapy-docker-project/my_scraper/).
my_scraper/Dockerfile

# Use an official Python runtime as a parent image FROM python:3.11-slim # Set the working directory in the container WORKDIR /app # Copy the requirements file into the container at /app # We copy this first to leverage Docker's layer caching. # It will only rebuild this layer if requirements.txt changes. COPY requirements.txt . # Install any needed packages specified in requirements.txt # --no-cache-dir reduces image size RUN pip install --no-cache-dir -r requirements.txt # Copy the current directory contents into the container at /app # This includes your Scrapy project code. COPY . . # Scrapy needs a non-root user to run for security reasons. # Create a user and group named 'scraper' RUN groupadd -r scraper && useradd -r -g scraper scraper # Change ownership of the /app directory to the 'scraper' user RUN chown -R scraper:scraper /app USER scraper # Set the environment variable to prevent Scrapy from logging debug info # which can be noisy in some environments. ENV SCrapy_LOG_LEVEL=INFO # Command to run your Scrapy spider when the container launches CMD ["scrapy", "crawl", "quotes"]
Step 4: Create the requirements.txt File
This file lists all the Python dependencies for your project. Create it in the root of your project.
my_scraper/requirements.txt
scrapy
You can add other packages here, like python-dotenv for managing secrets or pymongo for saving to a database.
Step 5: Build and Run the Docker Image
Now you're ready to build your image and run the container.
-
Build the Image: Open your terminal in the
scrapy-docker-project/my_scraper/directory and run:# The '.' at the end specifies the build context (the current directory) docker build -t my-scraper-image .
This command tells Docker to build an image with the tag
my-scraper-imageusing the instructions in theDockerfilein the current directory. -
Run the Container: Once the build is complete, run the container:
docker run --rm my-scraper-image
--rm: Automatically removes the container when it exits. This is great for testing.my-scraper-image: The name of the image you just built.
You should see the output of your quotes spider being printed to your terminal!

Advanced Docker Compose Setup
For more complex applications (e.g., a scraper that needs a database like MongoDB or Redis), you should use Docker Compose. It allows you to define and run multi-container Docker applications.
Let's modify our setup to include MongoDB to store the scraped data.
Step 1: Install pymongo
Update your requirements.txt to include the MongoDB driver.
my_scraper/requirements.txt
scrapy
pymongo
Step 2: Create a Scrapy Pipeline
A pipeline processes items scraped by your spiders. Let's create one to save items to MongoDB.
my_scraper/pipelines.py
from itemadapter import ItemAdapter
from pymongo import MongoClient
class MongoPipeline:
def __init__(self, mongo_uri, mongo_db):
self.mongo_uri = mongo_uri
self.mongo_db = mongo_db
@classmethod
def from_crawler(cls, crawler):
return cls(
mongo_uri=crawler.settings.get('MONGO_URI'),
mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
)
def open_spider(self, spider):
self.client = MongoClient(self.mongo_uri)
self.db = self.client[self.mongo_db]
def close_spider(self, spider):
self.client.close()
def process_item(self, item, spider):
self.db[spider.name].insert_one(ItemAdapter(item).asdict())
return item
Step 3: Configure Scrapy Settings
Enable the pipeline in your Scrapy settings file.
my_scraper/settings.py
# ... other settings ...
# Enable the pipeline
ITEM_PIPELINES = {
'my_scraper.pipelines.MongoPipeline': 300,
}
# --- Docker Compose will provide the URI ---
# For local development, you might use:
# MONGO_URI = "mongodb://localhost:27017/"
# MONGO_DATABASE = "quotes_db"
Step 4: Create the docker-compose.yml File
Create this file in the root of your project (my_scraper/).
my_scraper/docker-compose.yml
version: '3.8'
services:
# The MongoDB service
mongo:
image: mongo:latest
container_name: my-mongo
volumes:
- mongo-data:/data/db
ports:
- "27017:27017"
# The Scrapy Scraper service
scraper:
build: .
container_name: my-scraper
depends_on:
- mongo
environment:
- MONGO_URI=mongodb://mongo:27017/ # 'mongo' is the service name
- MONGO_DATABASE=quotes_db
# To run a different command, e.g., to save output to a file
# command: scrapy crawl quotes -o output.json
# Define the named volume for persistent data storage
volumes:
mongo-data:
Step 5: Run with Docker Compose
Now, instead of docker run, you'll use
