Ubuntu下Python Scrapy怎么用？-杰瑞科技汇

Of course! Here is a comprehensive guide to getting started with Scrapy on Ubuntu, covering installation, setup, creating your first spider, and running your first crawl.

（图片来源网络，侵删）

Prerequisites: What you need before you start.
Installation: Step-by-step guide to install Scrapy and its dependencies.
Project Setup: How to create and structure a Scrapy project.
Your First Spider: Writing a simple spider to scrape quotes from a website.
Running the Spider: Executing your spider and saving the output.
Next Steps: Where to go from here.

Prerequisites

Before you begin, make sure you have the following:

Ubuntu System: This guide is for Ubuntu 20.04/22.04. The commands are similar for other versions.
Python 3: Scrapy requires Python 3.6 or newer. Ubuntu 20.04 and 22.04 come with Python 3 pre-installed.
pip: Python's package installer, which is also usually included.
Terminal Access: You'll need to open a terminal to run commands.

You can check your Python and pip versions with:

python3 --version
pip3 --version

Installation

The recommended way to install Scrapy is using pip inside a Python virtual environment. This helps keep your project dependencies isolated from your system's Python packages.

Step 1: Update Your System

It's always a good practice to update your package list before installing new software.

（图片来源网络，侵删）

sudo apt update
sudo apt upgrade -y

Step 2: Install System Dependencies

Scrapy has some dependencies that need to be installed at the system level. The most important one is libssl-dev, which is required for pyOpenSSL.

sudo apt install python3-dev python3-pip libssl-dev libffi-dev build-essential -y

python3-dev: Development headers for Python.
python3-pip: The Python package installer.
libssl-dev, libffi-dev, build-essential: Needed to build some of Scrapy's C-based dependencies.

Step 3: Create and Activate a Virtual Environment

This is a crucial step for managing project dependencies.

Install venv (if not already installed):
```
sudo apt install python3-venv -y
```
Create a project directory and navigate into it:
```
mkdir my_scrapy_project
cd my_scrapy_project
```
Create the virtual environment:
```
python3 -m venv scrapy_env
```
This will create a scrapy_env folder in your project directory.
Activate the virtual environment:
```
source scrapy_env/bin/activate
```
Your terminal prompt should change to show the name of the active environment, like this: (scrapy_env) user@hostname:~/my_scrapy_project$.

Step 4: Install Scrapy

Now that your virtual environment is active, you can safely install Scrapy using pip.

pip install scrapy

pip will now install Scrapy and all its required packages (like Twisted, lxml, etc.) inside your scrapy_env folder, leaving your system's Python installation untouched.

Project Setup

Scrapy uses a project-based structure. It's best to create a new project for each scraping task.

Create a Scrapy Project: Make sure your virtual environment is still active. Then, run the startproject command:

scrapy startproject tutorial

This will create a tutorial directory with the following structure:

tutorial/
├── scrapy.cfg          # deploy configuration file
└── tutorial/            # project's Python module, you'll import your code from here
    ├── __init__.py
    ├── items.py        # project items definition file
    ├── middlewares.py  # project middlewares file
    ├── pipelines.py    # project pipelines file
    ├── settings.py     # project settings file
    └── spiders/        # a directory where you'll put your spiders
        └── __init__.py

Navigate into the Project Directory:
```
cd tutorial
```

Your First Spider

A Spider is a class that you define, and Scrapy uses it to scrape information from a website (or a group of websites).

Let's create a spider to scrape quotes from http://quotes.toscrape.com/, a website designed specifically for scraping practice.

Create the Spider File: Create a new file in the tutorial/spiders directory. Let's name it quotes_spider.py.
```
# Make sure you are inside the tutorial/ directory
touch tutorial/spiders/quotes_spider.py
```

Write the Spider Code: Open quotes_spider.py with your favorite text editor (e.g., nano, vim, or VS Code) and add the following code:

import scrapy
class QuotesSpider(scrapy.Spider):
    name = "quotes"  # Unique name for the spider
    # The list of URLs the spider will begin to crawl from
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]
    # This is the core method that will be called for each URL
    def parse(self, response):
        # The 'response' object holds the page content
        # We use CSS selectors to extract data
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }
        # Example of following a link to the next page
        # next_page = response.css('li.next a::attr(href)').get()
        # if next_page is not None:
        #     yield response.follow(next_page, callback=self.parse)

Code Explanation:

name = "quotes": A unique name to identify and run the spider.
start_urls: A list of URLs where the spider will start crawling.
parse(self, response): This is the default callback method that Scrapy calls for each start_url. It's where the scraping logic lives.
response.css('div.quote'): Uses CSS selectors to find all <div> elements with the class quote.
quote.css('span.text::text').get(): Gets the text content from the <span> with class text.
yield: Instead of return, spiders yield Python dictionaries (or scrapy.Item objects). This is how Scrapy collects the scraped data.
:text: The :text pseudo-element extracts the inner text of an element.
.getall(): Returns a list of all matches.

Running the Spider

Now you're ready to run your spider and see it in action.

Navigate to the Project Root: Make sure you are in the tutorial directory (the one containing scrapy.cfg).
Run the Spider: Use the scrapy crawl command, followed by the spider's name (quotes).
```
scrapy crawl quotes
```

You will see Scrapy's logs in your terminal as it starts the engine, makes requests, and receives responses. The output will look something like this:

2025-10-27 10:30:00 [scrapy.utils.log] INFO: Scrapy 2.8.0 started (bot: tutorial)
...
2025-10-27 10:30:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
{'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', 'author': 'Albert Einstein', 'tags': ['change', 'deep-thoughts', 'thinking', 'world']}
{'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', 'author': 'J.K. Rowling', 'tags': ['abilities', 'choices']}
... (and so on)

Saving the Output

Scrapy makes it easy to save the scraped data to a file. Use the -o or --output flag.

Save to a JSON file:
```
scrapy crawl quotes -o quotes.json
```
This will create a quotes.json file in your tutorial directory.

Ubuntu下Python Scrapy怎么用？

Table of Contents

Prerequisites

Installation

Step 1: Update Your System

Step 2: Install System Dependencies

Step 3: Create and Activate a Virtual Environment

Step 4: Install Scrapy

Project Setup

Your First Spider

Running the Spider

Saving the Output

99ANYc3cd6

Spark中如何用Python高效解析与处理JSON数据？

Python列表如何统一转大写？

Java FTP Socket如何实现文件传输？

请问Java Web开发中，Spring框架到底有什么用？

CFak47使用技巧有哪些？

Java Socket编程实例具体怎么实现？

python sift landmark

java对mysql数据库的操作

Java Arrays.sort()底层排序算法是什么？

Java如何高效获取Map中的value值？

Java split正则表达式如何正确使用？

Guitar Rig怎么用？新手入门教程指南

OpenCV Python如何绘制直方图？

华为ec2106v1破解教程

双系统Win10硬盘安装步骤是怎样的？

Access视频教程哪里下载？

Ubuntu下Python Scrapy怎么用？

Table of Contents

Prerequisites

Installation

Step 1: Update Your System

Step 2: Install System Dependencies

Step 3: Create and Activate a Virtual Environment

Step 4: Install Scrapy

Project Setup

Your First Spider

Running the Spider

Saving the Output

相关推荐

Java Socket编程实例具体怎么实现？