Python boilerpipe如何精准提取正文？-杰瑞科技汇

Of course! Here's a comprehensive guide to using boilerpipe in Python, including what it is, why you'd use it, how to install it, and detailed code examples.

What is Boilerpipe?

boilerplate (or "boilerpipe" in its Python implementation) is a Python library designed for extracting the main content from a webpage. It's a port of the popular Java library of the same name.

The "boilerplate" of a webpage refers to the non-content parts: the navigation menus, headers, footers, ads, sidebars, and copyright notices. These elements are often present on every page and can clutter the text you're actually interested in.

boilerpipe uses a set of heuristics and text analysis algorithms to intelligently identify and remove this boilerplate, leaving you with the core article or content.

Why Use Boilerpipe?

When you scrape a webpage using a library like requests and BeautifulSoup, you get the full HTML. To get the text, you might do something like this:

# This is a naive approach
soup = BeautifulSoup(html_content, 'html.parser')
all_text = soup.get_text()

This all_text would be a mess, containing navigation links, ad copy, and other irrelevant text.

boilerpipe solves this by giving you just the main body text, which is incredibly useful for:

Content Summarization: Getting the key points of an article.
Natural Language Processing (NLP): Feeding clean text to models for sentiment analysis, topic modeling, etc.
Search Engine Indexing: Creating a cleaner index of a page's content.
Data Mining: Extracting clean data from news articles or blogs.

Installation

First, you need to install the library. It's available on PyPI.

pip install boilerpipe

Basic Usage

The most common use case is to extract the main text from a URL. boilerpipe makes this very simple.

Let's start with a basic example.

import boilerpipe
# The URL of a news article
url = 'https://en.wikipedia.org/wiki/Python_(programming_language)'
# Extract the main text
# The 'DefaultExtractor' is a good general-purpose choice.
# It automatically chooses the best extraction algorithm for the page.
extracted_text = boilerpipe.extract_from_url(url)
print(extracted_text)

When you run this, you'll get a long string containing the main content of the Wikipedia article, stripped of the navigation, sidebars, and footer text.

Key Concepts: Extractors

The power of boilerpipe lies in its different Extractors. These are pre-configured algorithms that work best for different types of web pages. You can choose the one that fits your use case best.

You can import them like this:

from boilerpipe.extractors import (
    ArticleExtractor,
    DefaultExtractor,
    KeepEverythingExtractor,
    LuhnSummarizer,
    CanolaExtractor,
    Extractor,
)

`DefaultExtractor`

This is the recommended starting point. It tries to pick the best extractor for the page based on its structure. It's very effective for news articles and blog posts.

from boilerpipe.extractors import DefaultExtractor
url = 'https://www.bbc.com/news/technology-66893093'
extracted_text = DefaultExtractor().extract_from_url(url)
print(extracted_text[:500] + "...") # Print the first 500 chars

`ArticleExtractor`

This is a specialized extractor designed specifically for news articles and similar content. It's often more accurate than DefaultExtractor for this specific type of page.

from boilerpipe.extractors import ArticleExtractor
url = 'https://www.reuters.com/technology/ibm-watson-ai-chief-resigns-amid-restructuring-2025-09-05/'
extracted_text = ArticleExtractor().extract_from_url(url)
print(extracted_text[:500] + "...")

`KeepEverythingExtractor`

This is the opposite of what we usually want. It returns all text found on the page, effectively doing nothing. This is useful if you want to use boilerpipe's text cleaning functions (like removing extra whitespace) but keep all the content.

from boilerpipe.extractors import KeepEverythingExtractor
url = 'https://example.com'
all_text = KeepEverythingExtractor().extract_from_url(url)
print(all_text)

`CanolaExtractor`

This is a lightweight and fast extractor. It's less accurate than ArticleExtractor or DefaultExtractor but can be a good choice for pages with a very simple structure where performance is critical.

Advanced Usage: Extracting Titles and Links

boilerpipe can also extract other useful pieces of information, not just the main text.

Extracting the Title

You can get the title of the page along with the extracted content.

from boilerpipe.extractors import ArticleExtractor
url = 'https://techcrunch.com/2025/09/01/apple-iphone-15-pro/'
# The extract() method returns a BoilerpipeDoc object
doc = ArticleExtractor().extract(url)
# The .title attribute contains the page title
print(f"Page Title: {doc.title}\n")
# The .content attribute contains the main text
print(f"Content (first 200 chars): {doc.content[:200]}...")

Extracting Links

You can also get all the links that were found within the main content area.

from boilerpipe.extractors import ArticleExtractor
url = 'https://techcrunch.com/2025/09/01/apple-iphone-15-pro/'
doc = ArticleExtractor().extract(url)
# The .links attribute is a list of dictionaries
# Each dict contains 'text' and 'href'
print("Links found in the main content:")
for link in doc.links[:5]: # Print first 5 links
    print(f"- Text: '{link['text']}', URL: {link['href']}")

Working with Local HTML Files

You're not limited to URLs. You can also process HTML content that you've already downloaded and stored in a string or a file.

From a String

import requests
from boilerpipe.extractors import DefaultExtractor
# Download HTML first
html_string = requests.get('https://www.nature.com/articles/d41586-023-02835-1').text
# Extract from the string
extracted_text = DefaultExtractor().extract(html_string)
print(extracted_text[:300] + "...")

From a File

Let's say you have a file named article.html.

from boilerpipe.extractors import DefaultExtractor
# Read the file content
with open('article.html', 'r', encoding='utf-8') as f:
    html_content = f.read()
# Extract from the file content
extracted_text = DefaultExtractor().extract(html_content)
print(extracted_text[:300] + "...")

Full Example: A Simple Web Scraper

Here's a practical example that combines requests and boilerpipe to create a simple script that fetches and cleans articles from a list of URLs.

import requests
from boilerpipe.extractors import ArticleExtractor
# A list of article URLs to process
urls = [
    'https://www.bbc.com/news/technology-66893093',
    'https://www.theverge.com/2025/9/12/23865555/apple-iphone-15-pro-max-usb-c-a16-bionic',
    'https://arstechnica.com/gadgets/2025/09/iphone-15-pro-max-review-a-superb-camera-and-a-titanium-edge/'
]
def get_clean_content(url):
    """Fetches a URL and returns the main article content."""
    try:
        print(f"Fetching: {url}")
        # Fetch the HTML
        response = requests.get(url, headers={'User-Agent': 'My-Scraper/1.0'})
        response.raise_for_status()  # Raise an exception for bad status codes
        # Extract the main content using ArticleExtractor
        # It's good practice to pass the raw HTML string
        doc = ArticleExtractor().extract(html=response.text)
        return {
            'url': url,
            'title': doc.title,
            'content': doc.content
        }
    except requests.exceptions.RequestException as e:
        print(f"Error fetching {url}: {e}")
        return None
    except Exception as e:
        print(f"Error processing {url}: {e}")
        return None
# Process each URL
for url in urls:
    article_data = get_clean_content(url)
    if article_data:
        print("\n" + "="*50)
        print(f"Title: {article_data['title']}")
        print(f"Content Preview:\n{article_data['content'][:300]}...")
        print("="*50 + "\n")

Comparison with Alternatives

Library	Approach	Pros	Cons
`boilerpipe`	Heuristics & text analysis	Good for general pages, fast, easy to use. Can be less accurate on very modern or poorly structured sites.	Requires a Java runtime (handled automatically by the Python wrapper).
`trafilatura`	ML-based & heuristic	Often more accurate, especially on modern websites. Handles various formats well. Can extract metadata (author, date).	Can be slightly slower. More complex setup.
`readability-lxml`	Port of Readability.js	Excellent for news articles and blog posts. Very intuitive.	Primarily focused on articles, less general-purpose than `boilerpipe`.
`newspaper3k`	High-level API	A full-fledged NLP library for articles. Includes extraction, summarization, keyword extraction.	Can be heavy-weight if you only need extraction. Might be overkill.

Recommendation:

For a quick and easy solution for general web pages, start with boilerpipe.
If you find boilerpipe isn't accurate enough, especially for modern news sites, trafilatura is an excellent next choice.
If you are building a dedicated news scraper, readability-lxml or newspaper3k are fantastic options.

Python boilerpipe如何精准提取正文？

What is Boilerpipe?

Why Use Boilerpipe?

Installation

Basic Usage

Key Concepts: Extractors

`DefaultExtractor`

`ArticleExtractor`

`KeepEverythingExtractor`

`CanolaExtractor`

Advanced Usage: Extracting Titles and Links

Extracting the Title

Extracting Links

Working with Local HTML Files

From a String

From a File

Full Example: A Simple Web Scraper

Comparison with Alternatives

99ANYc3cd6

Python中如何实现拉格朗日插值？

Composer安装教程，具体步骤是怎样的？

Python DataFrame如何高效操作？

Java读取Excel，POI如何实现？

PS2025CC怎么安装？步骤详解来了！

Java Socket编程实例具体怎么实现？

Python如何改变我们的生活方式？

MySQL解压版安装步骤有哪些？

cooledit安装教程

Java字符串如何判断包含另一个字符串？

Python session 登录如何实现与维护？

哪里能找到Python教程PDF资源？

Python中enumerate怎么用？

java home linux

Python sessionlib是什么？如何使用？

python cefbrowser

Python boilerpipe如何精准提取正文？

What is Boilerpipe?

Why Use Boilerpipe?

Installation

Basic Usage

Key Concepts: Extractors

DefaultExtractor

ArticleExtractor

KeepEverythingExtractor

CanolaExtractor

Advanced Usage: Extracting Titles and Links

Extracting the Title

Extracting Links

Working with Local HTML Files

From a String

From a File

Full Example: A Simple Web Scraper

Comparison with Alternatives

相关推荐

Java Socket编程实例具体怎么实现？

`DefaultExtractor`

`ArticleExtractor`

`KeepEverythingExtractor`

`CanolaExtractor`