Of course! Here's a comprehensive guide to using boilerpipe in Python, including what it is, why you'd use it, how to install it, and detailed code examples.
What is Boilerpipe?
boilerplate (or "boilerpipe" in its Python implementation) is a Python library designed for extracting the main content from a webpage. It's a port of the popular Java library of the same name.
The "boilerplate" of a webpage refers to the non-content parts: the navigation menus, headers, footers, ads, sidebars, and copyright notices. These elements are often present on every page and can clutter the text you're actually interested in.
boilerpipe uses a set of heuristics and text analysis algorithms to intelligently identify and remove this boilerplate, leaving you with the core article or content.
Why Use Boilerpipe?
When you scrape a webpage using a library like requests and BeautifulSoup, you get the full HTML. To get the text, you might do something like this:
# This is a naive approach soup = BeautifulSoup(html_content, 'html.parser') all_text = soup.get_text()
This all_text would be a mess, containing navigation links, ad copy, and other irrelevant text.
boilerpipe solves this by giving you just the main body text, which is incredibly useful for:
- Content Summarization: Getting the key points of an article.
- Natural Language Processing (NLP): Feeding clean text to models for sentiment analysis, topic modeling, etc.
- Search Engine Indexing: Creating a cleaner index of a page's content.
- Data Mining: Extracting clean data from news articles or blogs.
Installation
First, you need to install the library. It's available on PyPI.
pip install boilerpipe
Basic Usage
The most common use case is to extract the main text from a URL. boilerpipe makes this very simple.
Let's start with a basic example.
import boilerpipe # The URL of a news article url = 'https://en.wikipedia.org/wiki/Python_(programming_language)' # Extract the main text # The 'DefaultExtractor' is a good general-purpose choice. # It automatically chooses the best extraction algorithm for the page. extracted_text = boilerpipe.extract_from_url(url) print(extracted_text)
When you run this, you'll get a long string containing the main content of the Wikipedia article, stripped of the navigation, sidebars, and footer text.
Key Concepts: Extractors
The power of boilerpipe lies in its different Extractors. These are pre-configured algorithms that work best for different types of web pages. You can choose the one that fits your use case best.
You can import them like this:
from boilerpipe.extractors import (
ArticleExtractor,
DefaultExtractor,
KeepEverythingExtractor,
LuhnSummarizer,
CanolaExtractor,
Extractor,
)
DefaultExtractor
This is the recommended starting point. It tries to pick the best extractor for the page based on its structure. It's very effective for news articles and blog posts.
from boilerpipe.extractors import DefaultExtractor url = 'https://www.bbc.com/news/technology-66893093' extracted_text = DefaultExtractor().extract_from_url(url) print(extracted_text[:500] + "...") # Print the first 500 chars
ArticleExtractor
This is a specialized extractor designed specifically for news articles and similar content. It's often more accurate than DefaultExtractor for this specific type of page.
from boilerpipe.extractors import ArticleExtractor url = 'https://www.reuters.com/technology/ibm-watson-ai-chief-resigns-amid-restructuring-2025-09-05/' extracted_text = ArticleExtractor().extract_from_url(url) print(extracted_text[:500] + "...")
KeepEverythingExtractor
This is the opposite of what we usually want. It returns all text found on the page, effectively doing nothing. This is useful if you want to use boilerpipe's text cleaning functions (like removing extra whitespace) but keep all the content.
from boilerpipe.extractors import KeepEverythingExtractor url = 'https://example.com' all_text = KeepEverythingExtractor().extract_from_url(url) print(all_text)
CanolaExtractor
This is a lightweight and fast extractor. It's less accurate than ArticleExtractor or DefaultExtractor but can be a good choice for pages with a very simple structure where performance is critical.
Advanced Usage: Extracting Titles and Links
boilerpipe can also extract other useful pieces of information, not just the main text.
Extracting the Title
You can get the title of the page along with the extracted content.
from boilerpipe.extractors import ArticleExtractor
url = 'https://techcrunch.com/2025/09/01/apple-iphone-15-pro/'
# The extract() method returns a BoilerpipeDoc object
doc = ArticleExtractor().extract(url)
# The .title attribute contains the page title
print(f"Page Title: {doc.title}\n")
# The .content attribute contains the main text
print(f"Content (first 200 chars): {doc.content[:200]}...")
Extracting Links
You can also get all the links that were found within the main content area.
from boilerpipe.extractors import ArticleExtractor
url = 'https://techcrunch.com/2025/09/01/apple-iphone-15-pro/'
doc = ArticleExtractor().extract(url)
# The .links attribute is a list of dictionaries
# Each dict contains 'text' and 'href'
print("Links found in the main content:")
for link in doc.links[:5]: # Print first 5 links
print(f"- Text: '{link['text']}', URL: {link['href']}")
Working with Local HTML Files
You're not limited to URLs. You can also process HTML content that you've already downloaded and stored in a string or a file.
From a String
import requests
from boilerpipe.extractors import DefaultExtractor
# Download HTML first
html_string = requests.get('https://www.nature.com/articles/d41586-023-02835-1').text
# Extract from the string
extracted_text = DefaultExtractor().extract(html_string)
print(extracted_text[:300] + "...")
From a File
Let's say you have a file named article.html.
from boilerpipe.extractors import DefaultExtractor
# Read the file content
with open('article.html', 'r', encoding='utf-8') as f:
html_content = f.read()
# Extract from the file content
extracted_text = DefaultExtractor().extract(html_content)
print(extracted_text[:300] + "...")
Full Example: A Simple Web Scraper
Here's a practical example that combines requests and boilerpipe to create a simple script that fetches and cleans articles from a list of URLs.
import requests
from boilerpipe.extractors import ArticleExtractor
# A list of article URLs to process
urls = [
'https://www.bbc.com/news/technology-66893093',
'https://www.theverge.com/2025/9/12/23865555/apple-iphone-15-pro-max-usb-c-a16-bionic',
'https://arstechnica.com/gadgets/2025/09/iphone-15-pro-max-review-a-superb-camera-and-a-titanium-edge/'
]
def get_clean_content(url):
"""Fetches a URL and returns the main article content."""
try:
print(f"Fetching: {url}")
# Fetch the HTML
response = requests.get(url, headers={'User-Agent': 'My-Scraper/1.0'})
response.raise_for_status() # Raise an exception for bad status codes
# Extract the main content using ArticleExtractor
# It's good practice to pass the raw HTML string
doc = ArticleExtractor().extract(html=response.text)
return {
'url': url,
'title': doc.title,
'content': doc.content
}
except requests.exceptions.RequestException as e:
print(f"Error fetching {url}: {e}")
return None
except Exception as e:
print(f"Error processing {url}: {e}")
return None
# Process each URL
for url in urls:
article_data = get_clean_content(url)
if article_data:
print("\n" + "="*50)
print(f"Title: {article_data['title']}")
print(f"Content Preview:\n{article_data['content'][:300]}...")
print("="*50 + "\n")
Comparison with Alternatives
| Library | Approach | Pros | Cons |
|---|---|---|---|
boilerpipe |
Heuristics & text analysis | Good for general pages, fast, easy to use. Can be less accurate on very modern or poorly structured sites. | Requires a Java runtime (handled automatically by the Python wrapper). |
trafilatura |
ML-based & heuristic | Often more accurate, especially on modern websites. Handles various formats well. Can extract metadata (author, date). | Can be slightly slower. More complex setup. |
readability-lxml |
Port of Readability.js | Excellent for news articles and blog posts. Very intuitive. | Primarily focused on articles, less general-purpose than boilerpipe. |
newspaper3k |
High-level API | A full-fledged NLP library for articles. Includes extraction, summarization, keyword extraction. | Can be heavy-weight if you only need extraction. Might be overkill. |
Recommendation:
- For a quick and easy solution for general web pages, start with
boilerpipe. - If you find
boilerpipeisn't accurate enough, especially for modern news sites,trafilaturais an excellent next choice. - If you are building a dedicated news scraper,
readability-lxmlornewspaper3kare fantastic options.
