Of course! Let's dive into html.parser, which is Python's built-in module for parsing HTML. It's part of the standard library, so you don't need to install anything extra.

We'll cover:
- The Basics: How to use
HTMLParserby subclassing it. - Handling Different Tags:
handle_starttag,handle_endtag,handle_data. - Handling Attributes: How to access attributes in a start tag.
- A Complete, Practical Example: Scraping all the links (
<a>tags) and their text from a webpage. - Pros and Cons: When to use
html.parservs. other libraries likeBeautifulSoup.
The Basics: Subclassing HTMLParser
The core idea is to create your own class that inherits from html.parser.HTMLParser and then override its methods to react to different parts of the HTML document.
The parser will call your methods automatically as it processes the HTML text.
Let's start with a simple example that just prints out what it finds.

import html.parser
class MyHTMLParser(html.parser.HTMLParser):
"""A simple HTML parser to print tags and data."""
def handle_starttag(self, tag, attrs):
"""Called when a start tag is found, e.g., <div>."""
print(f"Start tag: {tag}")
def handle_endtag(self, tag):
"""Called when an end tag is found, e.g., </div>."""
print(f"End tag: {tag}")
def handle_data(self, data):
"""Called for text content between tags."""
print(f"Data: '{data}'")
# --- Main execution ---
if __name__ == "__main__":
parser = MyHTMLParser()
# A sample HTML string
html_string = """
<html>
<head>
<title>A Simple Page</title>
</head>
<body>
<h1>Welcome!</h1>
<p>This is a paragraph with a <a href="https://example.com">link</a>.</p>
</body>
</html>
"""
# Feed the HTML string to the parser
parser.feed(html_string)
Output:
Start tag: html
Start tag: head
Start tag: title
Data: 'A Simple Page'
End tag: title
End tag: head
Start tag: body
Start tag: h1
Data: 'Welcome!'
End tag: h1
Start tag: p
Data: 'This is a paragraph with a '
Start tag: a
Data: 'link'
End tag: a
Data: '.'
End tag: p
End tag: body
End tag: html
Handling Different Tags and Data
The previous example showed the three most common methods. Here's a more complete list of useful methods you can override:
| Method | Description | Example |
|---|---|---|
handle_starttag(tag, attrs) |
An opening tag, like <p> or <div class="main">. |
handle_starttag('p', [('class', 'main')]) |
handle_endtag(tag) |
A closing tag, like </p>. |
handle_endtag('p') |
handle_data(data) |
The text content between tags. | handle_data('Hello world') |
handle_comment(data) |
An HTML comment, like <!-- comment -->. |
handle_comment('This is a comment') |
handle_decl(decl) |
A declaration, like <!DOCTYPE html>. |
handle_decl('DOCTYPE html') |
handle_pi(data) |
A processing instruction, like <?xml-stylesheet ... ?>. |
Less common in standard HTML. |
Handling Attributes
The attrs argument in handle_starttag is a list of (name, value) tuples for the attributes of the tag.
Let's modify our parser to print the attributes of <a> tags.

import html.parser
class LinkParser(html.parser.HTMLParser):
"""Parses a page to find all links (<a> tags)."""
def handle_starttag(self, tag, attrs):
"""Called for every start tag."""
if tag == 'a':
# attrs is a list of (name, value) tuples
# e.g., [('href', 'https://example.com'), ('class', 'link')]
print(f"Found a link: {dict(attrs)}")
# --- Main execution ---
if __name__ == "__main__":
parser = LinkParser()
html_string = """
<html>
<body>
<a href="https://python.org">Python</a>
<a href="/about" class="nav-link">About</a>
<div>No link here</div>
</body>
</html>
"""
parser.feed(html_string)
Output:
Found a link: {'href': 'https://python.org'}
Found a link: {'href': '/about', 'class': 'nav-link'}
A Complete, Practical Example: Scraping Links
This is a very common task. Let's build a parser that extracts all the URLs from <a> tags and the text they are linking to.
import html.parser
class LinkExtractor(html.parser.HTMLParser):
"""
An HTML parser that extracts all links and their text.
"""
def __init__(self):
super().__init__()
self.links = [] # To store the results
self.current_link_text = "" # To accumulate text for the current link
def handle_starttag(self, tag, attrs):
"""Check for start of an <a> tag and capture its attributes."""
if tag == 'a':
# attrs is a list of (name, value) tuples
attrs_dict = dict(attrs)
if 'href' in attrs_dict:
# Store the href and start accumulating the link text
self.links.append({
'url': attrs_dict['href'],
'text': ''
})
# A new link has started, so reset the text accumulator
self.current_link_text = ""
def handle_data(self, data):
"""Accumulate text data. If we are inside a link, save it."""
# If we are currently parsing a link, add the data to its text
if self.links and self.current_link_text == "" and self.links[-1]['text'] == '':
# This is the first data for the current link
self.links[-1]['text'] = data.strip()
elif self.links:
# Append more data to the current link's text
self.links[-1]['text'] += data.strip()
def handle_endtag(self, tag):
"""Called when an end tag is found."""
# We don't need to do anything special here for this example,
# but it's good to know the method exists.
# --- Main execution ---
if __name__ == "__main__":
parser = LinkExtractor()
# A more realistic HTML snippet
html_content = """
<div id="content">
<h1>Main Page</h1>
<p>Here are some interesting links:</p>
<a href="https://www.python.org" target="_blank">The official Python website</a>.
<p>Also, check out <a href="/docs">the documentation</a>.</p>
<p>This is a link with no text.</p>
<a href="https://pypi.org"></a>
</div>
"""
parser.feed(html_content)
# Print the extracted links
print("Extracted Links:")
for link in parser.links:
print(f" - URL: {link['url']}")
print(f" Text: '{link['text']}'")
Output:
Extracted Links:
- URL: https://www.python.org
Text: 'The official Python website'
- URL: /docs
Text: 'the documentation'
- URL: https://pypi.org
Text: ''
Pros and Cons: html.parser vs. BeautifulSoup
While html.parser is great and built-in, it's not always the best tool for the job.
html.parser (Python's built-in)
Pros:
- No Installation: It's part of the standard library. You can use it anywhere Python is installed.
- Good for Learning: Understanding how it works gives you a solid foundation for how parsers operate.
- Decent Performance: It's implemented in C, so it's reasonably fast for most tasks.
Cons:
- Verbose: You have to write a lot of boilerplate code to handle different cases and structure your data.
- Fragile: It's not very forgiving of "bad" or "malformed" HTML. A small mistake in the HTML can break your parser.
- No DOM Tree: It's a event-driven parser, not a Document Object Model (DOM) parser. You don't get a tree structure that you can easily navigate (e.g.,
find('div', class_='main').find_all('p')). You just get a stream of events.
BeautifulSoup (a third-party library)
BeautifulSoup is a wrapper around other parsers (including html.parser). It provides a much more user-friendly, high-level interface.
Pros:
- Extremely Easy to Use: Its API is designed for humans. Finding elements is trivial.
# With BeautifulSoup soup = BeautifulSoup(html_content, 'html.parser') all_links = soup.find_all('a') for link in all_links: print(link['href'], link.text) - Robust: It's very good at handling broken or imperfect HTML.
- Powerful Navigation: It builds a DOM tree, allowing for complex searching and traversal (
soup.select('div#content p > a')). - Supports Multiple Parsers: You can use
lxml(which is much faster and more robust) orhtml.parseras a fallback.
Cons:
- Requires Installation: You need to install it (
pip install beautifulsoup4). - Slower than
lxml: While fast itself, it's generally slower than thelxmlparser when used directly.
Conclusion: Which one should I use?
-
Use
html.parserif:- You are in an environment where you can't install third-party packages.
- You are learning about HTML parsing for the first time.
- Your task is very simple and you want to avoid dependencies.
-
Use
BeautifulSoup(with thelxmlparser if possible) if:- You are doing any serious web scraping.
- You need to parse real-world HTML, which is often messy.
- You want to write clean, readable, and maintainable code.
- You need to perform complex queries on the document structure.
For almost any real-world project, BeautifulSoup is the recommended choice. However, knowing how html.parser works is a valuable skill.
