杰瑞科技汇

Ubuntu下Python Scrapy怎么用?

Of course! Here is a comprehensive guide to getting started with Scrapy on Ubuntu, covering installation, setup, creating your first spider, and running your first crawl.

Ubuntu下Python Scrapy怎么用?-图1
(图片来源网络,侵删)

Table of Contents

  1. Prerequisites: What you need before you start.
  2. Installation: Step-by-step guide to install Scrapy and its dependencies.
  3. Project Setup: How to create and structure a Scrapy project.
  4. Your First Spider: Writing a simple spider to scrape quotes from a website.
  5. Running the Spider: Executing your spider and saving the output.
  6. Next Steps: Where to go from here.

Prerequisites

Before you begin, make sure you have the following:

  • Ubuntu System: This guide is for Ubuntu 20.04/22.04. The commands are similar for other versions.
  • Python 3: Scrapy requires Python 3.6 or newer. Ubuntu 20.04 and 22.04 come with Python 3 pre-installed.
  • pip: Python's package installer, which is also usually included.
  • Terminal Access: You'll need to open a terminal to run commands.

You can check your Python and pip versions with:

python3 --version
pip3 --version

Installation

The recommended way to install Scrapy is using pip inside a Python virtual environment. This helps keep your project dependencies isolated from your system's Python packages.

Step 1: Update Your System

It's always a good practice to update your package list before installing new software.

Ubuntu下Python Scrapy怎么用?-图2
(图片来源网络,侵删)
sudo apt update
sudo apt upgrade -y

Step 2: Install System Dependencies

Scrapy has some dependencies that need to be installed at the system level. The most important one is libssl-dev, which is required for pyOpenSSL.

sudo apt install python3-dev python3-pip libssl-dev libffi-dev build-essential -y
  • python3-dev: Development headers for Python.
  • python3-pip: The Python package installer.
  • libssl-dev, libffi-dev, build-essential: Needed to build some of Scrapy's C-based dependencies.

Step 3: Create and Activate a Virtual Environment

This is a crucial step for managing project dependencies.

  1. Install venv (if not already installed):

    sudo apt install python3-venv -y
  2. Create a project directory and navigate into it:

    mkdir my_scrapy_project
    cd my_scrapy_project
  3. Create the virtual environment:

    python3 -m venv scrapy_env

    This will create a scrapy_env folder in your project directory.

  4. Activate the virtual environment:

    source scrapy_env/bin/activate

    Your terminal prompt should change to show the name of the active environment, like this: (scrapy_env) user@hostname:~/my_scrapy_project$.

Step 4: Install Scrapy

Now that your virtual environment is active, you can safely install Scrapy using pip.

pip install scrapy

pip will now install Scrapy and all its required packages (like Twisted, lxml, etc.) inside your scrapy_env folder, leaving your system's Python installation untouched.


Project Setup

Scrapy uses a project-based structure. It's best to create a new project for each scraping task.

  1. Create a Scrapy Project: Make sure your virtual environment is still active. Then, run the startproject command:

    scrapy startproject tutorial

    This will create a tutorial directory with the following structure:

    tutorial/
    ├── scrapy.cfg          # deploy configuration file
    └── tutorial/            # project's Python module, you'll import your code from here
        ├── __init__.py
        ├── items.py        # project items definition file
        ├── middlewares.py  # project middlewares file
        ├── pipelines.py    # project pipelines file
        ├── settings.py     # project settings file
        └── spiders/        # a directory where you'll put your spiders
            └── __init__.py
  2. Navigate into the Project Directory:

    cd tutorial

Your First Spider

A Spider is a class that you define, and Scrapy uses it to scrape information from a website (or a group of websites).

Let's create a spider to scrape quotes from http://quotes.toscrape.com/, a website designed specifically for scraping practice.

  1. Create the Spider File: Create a new file in the tutorial/spiders directory. Let's name it quotes_spider.py.

    # Make sure you are inside the tutorial/ directory
    touch tutorial/spiders/quotes_spider.py
  2. Write the Spider Code: Open quotes_spider.py with your favorite text editor (e.g., nano, vim, or VS Code) and add the following code:

    import scrapy
    class QuotesSpider(scrapy.Spider):
        name = "quotes"  # Unique name for the spider
        # The list of URLs the spider will begin to crawl from
        start_urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        # This is the core method that will be called for each URL
        def parse(self, response):
            # The 'response' object holds the page content
            # We use CSS selectors to extract data
            for quote in response.css('div.quote'):
                yield {
                    'text': quote.css('span.text::text').get(),
                    'author': quote.css('small.author::text').get(),
                    'tags': quote.css('div.tags a.tag::text').getall(),
                }
            # Example of following a link to the next page
            # next_page = response.css('li.next a::attr(href)').get()
            # if next_page is not None:
            #     yield response.follow(next_page, callback=self.parse)

    Code Explanation:

    • name = "quotes": A unique name to identify and run the spider.
    • start_urls: A list of URLs where the spider will start crawling.
    • parse(self, response): This is the default callback method that Scrapy calls for each start_url. It's where the scraping logic lives.
    • response.css('div.quote'): Uses CSS selectors to find all <div> elements with the class quote.
    • quote.css('span.text::text').get(): Gets the text content from the <span> with class text.
    • yield: Instead of return, spiders yield Python dictionaries (or scrapy.Item objects). This is how Scrapy collects the scraped data.
    • :text: The :text pseudo-element extracts the inner text of an element.
    • .getall(): Returns a list of all matches.

Running the Spider

Now you're ready to run your spider and see it in action.

  1. Navigate to the Project Root: Make sure you are in the tutorial directory (the one containing scrapy.cfg).

  2. Run the Spider: Use the scrapy crawl command, followed by the spider's name (quotes).

    scrapy crawl quotes

You will see Scrapy's logs in your terminal as it starts the engine, makes requests, and receives responses. The output will look something like this:

2025-10-27 10:30:00 [scrapy.utils.log] INFO: Scrapy 2.8.0 started (bot: tutorial)
...
2025-10-27 10:30:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
{'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', 'author': 'Albert Einstein', 'tags': ['change', 'deep-thoughts', 'thinking', 'world']}
{'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', 'author': 'J.K. Rowling', 'tags': ['abilities', 'choices']}
... (and so on)

Saving the Output

Scrapy makes it easy to save the scraped data to a file. Use the -o or --output flag.

  • Save to a JSON file:

    scrapy crawl quotes -o quotes.json

    This will create a quotes.json file in your tutorial directory.

分享:
扫描分享到社交APP
上一篇
下一篇