Of course! Building a Python web crawler is a fantastic project that combines several important skills: networking, data parsing, and data storage.

Let's break down the process from the absolute basics to a more advanced, production-ready crawler.
What is a Web Crawler?
A web crawler (or spider) is a bot that systematically browses the web, typically for the purpose of web indexing. It does this by:
- Starting with a list of initial URLs (a "seed" list).
- Fetching the content of a URL.
- Parsing the content to find new URLs.
- Adding these new URLs to a list of pages to visit.
- Repeating the process until it has visited all desired pages or a stopping condition is met.
The Core Concepts
- Fetching (HTTP Requests): Getting the HTML content of a web page. We use libraries like
requests. - Parsing (HTML Parsing): Navigating the HTML structure to extract the data we want or find links. We use libraries like
BeautifulSoup4orlxml. - Politeness & Respect: A good crawler is polite to the website it's crawling.
- Rate Limiting: Don't send too many requests in a short period. This can overload the server and get your IP address blocked.
- Check
robots.txt: Most websites have a file atwww.example.com/robots.txtthat tells bots which parts of the site they are not allowed to access. Always respect this. - Identify Yourself: Use a
User-Agentin your request headers to identify your bot.
- Storage: Saving the scraped data. This can be as simple as printing to the console or saving to a file (CSV, JSON). For larger projects, you'd use a database.
The Tools You'll Need
You'll need a few Python libraries. Install them using pip:
pip install requests pip install beautifulsoup4 pip install lxml # A faster parser, optional but recommended
Example 1: A Simple Crawler (One Page)
This crawler will fetch a single page, parse it, and print all the links it finds.

import requests
from bs4 import BeautifulSoup
# The URL of the page we want to crawl
url = 'http://quotes.toscrape.com/'
# 1. Fetch the page content
try:
response = requests.get(url)
# Raise an exception if the request was unsuccessful (e.g., 404, 500)
response.raise_for_status()
except requests.exceptions.RequestException as e:
print(f"Error fetching the URL: {e}")
exit()
# 2. Parse the HTML content
soup = BeautifulSoup(response.text, 'lxml')
# 3. Find all the <a> tags (which define links)
links = soup.find_all('a')
# 4. Print the links
print(f"Found {len(links)} links on {url}:\n")
for link in links:
# Get the 'href' attribute, which contains the URL
href = link.get('href')
# Get the visible text of the link
text = link.text.strip()
# Avoid printing empty or irrelevant links
if href and text:
print(f"Text: {text} -> URL: {href}")
Example 2: A Recursive Crawler (Multiple Pages)
This is a more advanced crawler that can follow links to discover new pages. To prevent it from running forever, we'll use a set to keep track of visited URLs.
Warning: A recursive crawler can easily get out of control. For this example, we'll limit it to a maximum depth and a maximum number of pages to crawl.
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse
def crawl(start_url, max_pages=10, max_depth=2):
"""
A recursive web crawler.
:param start_url: The initial URL to start crawling from.
:param max_pages: The maximum number of pages to crawl.
:param max_depth: The maximum depth to crawl from the start_url.
"""
# Use a set to store visited URLs to avoid duplicates and infinite loops
visited_urls = set()
# Use a list to manage the crawling queue (URL and its depth)
urls_to_visit = [(start_url, 0)] # (url, current_depth)
while urls_to_visit and len(visited_urls) < max_pages:
current_url, current_depth = urls_to_visit.pop(0)
# Skip if we've already visited this URL or exceeded depth
if current_url in visited_urls or current_depth > max_depth:
continue
print(f"Crawling (Depth {current_depth}): {current_url}")
visited_urls.add(current_url)
try:
# Set a User-Agent to identify our bot
headers = {'User-Agent': 'MySimpleCrawler/1.0'}
response = requests.get(current_url, headers=headers, timeout=5)
response.raise_for_status()
except requests.exceptions.RequestException as e:
print(f"Failed to fetch {current_url}: {e}")
continue
# Parse the page content
soup = BeautifulSoup(response.text, 'lxml')
# Find all links on the page if we haven't reached max depth
if current_depth < max_depth:
for link_tag in soup.find_all('a', href=True):
href = link_tag['href']
# Make the URL absolute (e.g., /page -> http://domain.com/page)
absolute_url = urljoin(current_url, href)
# Clean the URL (e.g., remove fragments like #section)
absolute_url = urlparse(absolute_url).scheme + "://" + urlparse(absolute_url).netloc + urlparse(absolute_url).path
# Add the new URL to our queue if it's from the same domain
if urlparse(absolute_url).netloc == urlparse(start_url).netloc:
urls_to_visit.append((absolute_url, current_depth + 1))
print("\nCrawling finished.")
print(f"Total pages visited: {len(visited_urls)}")
# --- Run the crawler ---
if __name__ == "__main__":
start_url = 'http://quotes.toscrape.com/'
crawl(start_url, max_pages=15, max_depth=2)
How to Improve and Scale Up
The recursive example is good, but for real-world use, you need more robust solutions.
Politeness: Adding Delays
Don't hammer the server. Add a time.sleep() between requests.

import time
# ...
try:
response = requests.get(current_url, headers=headers, timeout=5)
response.raise_for_status()
# Be polite! Wait for a second before the next request.
time.sleep(1)
except ...
Data Storage: Saving to a CSV
Let's modify the crawler to scrape quotes and save them to a CSV file.
import csv
import requests
from bs4 import BeautifulSoup
def scrape_quotes_to_csv(url, output_file='quotes.csv'):
"""Scrapes quotes from a single page and saves them to a CSV file."""
try:
response = requests.get(url, headers={'User-Agent': 'MyQuoteScraper/1.0'})
response.raise_for_status()
except requests.exceptions.RequestException as e:
print(f"Error fetching {url}: {e}")
return
soup = BeautifulSoup(response.text, 'lxml')
quotes = soup.find_all('div', class_='quote')
with open(output_file, 'a', newline='', encoding='utf-8') as f:
writer = csv.writer(f)
# Write header only if the file is new (optional)
# if f.tell() == 0:
# writer.writerow(['Quote', 'Author'])
for quote in quotes:
text = quote.find('span', class_='text').get_text(strip=True)
author = quote.find('small', class_='author').get_text(strip=True)
writer.writerow([text, author])
print(f"Scraped: '{text}' - {author}")
# --- Run the scraper ---
if __name__ == "__main__":
target_url = 'http://quotes.toscrape.com/'
scrape_quotes_to_csv(target_url)
Advanced: Using Scrapy Framework
For large-scale crawling, Scrapy is the industry-standard Python framework. It's powerful, fast, and handles many complexities for you.
- Asynchronous Requests: Scrapy is asynchronous, meaning it can send many requests at once without waiting for a response, making it much faster.
- Built-in Data Pipelines: It has a robust pipeline for processing scraped data (cleaning, validating, storing in databases).
- Automatic Throttling & Politeness: It automatically handles rate limiting and respects
robots.txt. - Extensible: It's easy to add custom middlewares and extensions.
Scrapy Project Example:
- Install Scrapy:
pip install scrapy - Create a project:
scrapy startproject myproject - Generate a spider:
cd myprojectthenscrapy genspider quotes quotes.toscrape.com
Here's what the quotes_spider.py might look like:
# myproject/spiders/quotes_spider.py
import scrapy
class QuotesSpider(scrapy.Spider):
name = 'quotes'
# allowed_domains = ['quotes.toscrape.com'] # Optional: restricts crawling to this domain
start_urls = ['http://quotes.toscrape.com/']
def parse(self, response):
# This method is called for each response
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').get(),
'author': quote.css('small.author::text').get(),
'tags': quote.css('div.tags a.tag::text').getall(),
}
# Follow the "Next" link to the next page
next_page = response.css('li.next a::attr(href)').get()
if next_page is not None:
yield response.follow(next_page, callback=self.parse)
To run this spider and save the output to a JSON file:
scrapy crawl quotes -o quotes.json
Ethical and Legal Considerations
robots.txt: Always check this file first. It's the first rule of web scraping.- Terms of Service (ToS): Some websites explicitly forbid scraping in their ToS. Violating this can have legal consequences.
- Public vs. Private Data: Only scrape data that is publicly available. Do not attempt to scrape private or user-generated content without permission.
- Be Gentle: Use delays, respect
robots.txt, and identify your bot. A good crawler leaves no trace.
