杰瑞科技汇

python 3 htmlparser

Of course! Let's dive into html.parser, which is Python's built-in module for parsing HTML. It's part of the standard library, so you don't need to install anything extra.

python 3 htmlparser-图1
(图片来源网络,侵删)

We'll cover:

  1. The Basics: How to use HTMLParser by subclassing it.
  2. Handling Different Tags: handle_starttag, handle_endtag, handle_data.
  3. Handling Attributes: How to access attributes in a start tag.
  4. A Complete, Practical Example: Scraping all the links (<a> tags) and their text from a webpage.
  5. Pros and Cons: When to use html.parser vs. other libraries like BeautifulSoup.

The Basics: Subclassing HTMLParser

The core idea is to create your own class that inherits from html.parser.HTMLParser and then override its methods to react to different parts of the HTML document.

The parser will call your methods automatically as it processes the HTML text.

Let's start with a simple example that just prints out what it finds.

python 3 htmlparser-图2
(图片来源网络,侵删)
import html.parser
class MyHTMLParser(html.parser.HTMLParser):
    """A simple HTML parser to print tags and data."""
    def handle_starttag(self, tag, attrs):
        """Called when a start tag is found, e.g., <div>."""
        print(f"Start tag: {tag}")
    def handle_endtag(self, tag):
        """Called when an end tag is found, e.g., </div>."""
        print(f"End tag: {tag}")
    def handle_data(self, data):
        """Called for text content between tags."""
        print(f"Data:     '{data}'")
# --- Main execution ---
if __name__ == "__main__":
    parser = MyHTMLParser()
    # A sample HTML string
    html_string = """
    <html>
        <head>
            <title>A Simple Page</title>
        </head>
        <body>
            <h1>Welcome!</h1>
            <p>This is a paragraph with a <a href="https://example.com">link</a>.</p>
        </body>
    </html>
    """
    # Feed the HTML string to the parser
    parser.feed(html_string)

Output:

Start tag: html
Start tag: head
Start tag: title
Data:     'A Simple Page'
End tag: title
End tag: head
Start tag: body
Start tag: h1
Data:     'Welcome!'
End tag: h1
Start tag: p
Data:     'This is a paragraph with a '
Start tag: a
Data:     'link'
End tag: a
Data:     '.'
End tag: p
End tag: body
End tag: html

Handling Different Tags and Data

The previous example showed the three most common methods. Here's a more complete list of useful methods you can override:

Method Description Example
handle_starttag(tag, attrs) An opening tag, like <p> or <div class="main">. handle_starttag('p', [('class', 'main')])
handle_endtag(tag) A closing tag, like </p>. handle_endtag('p')
handle_data(data) The text content between tags. handle_data('Hello world')
handle_comment(data) An HTML comment, like <!-- comment -->. handle_comment('This is a comment')
handle_decl(decl) A declaration, like <!DOCTYPE html>. handle_decl('DOCTYPE html')
handle_pi(data) A processing instruction, like <?xml-stylesheet ... ?>. Less common in standard HTML.

Handling Attributes

The attrs argument in handle_starttag is a list of (name, value) tuples for the attributes of the tag.

Let's modify our parser to print the attributes of <a> tags.

python 3 htmlparser-图3
(图片来源网络,侵删)
import html.parser
class LinkParser(html.parser.HTMLParser):
    """Parses a page to find all links (<a> tags)."""
    def handle_starttag(self, tag, attrs):
        """Called for every start tag."""
        if tag == 'a':
            # attrs is a list of (name, value) tuples
            # e.g., [('href', 'https://example.com'), ('class', 'link')]
            print(f"Found a link: {dict(attrs)}")
# --- Main execution ---
if __name__ == "__main__":
    parser = LinkParser()
    html_string = """
    <html>
        <body>
            <a href="https://python.org">Python</a>
            <a href="/about" class="nav-link">About</a>
            <div>No link here</div>
        </body>
    </html>
    """
    parser.feed(html_string)

Output:

Found a link: {'href': 'https://python.org'}
Found a link: {'href': '/about', 'class': 'nav-link'}

A Complete, Practical Example: Scraping Links

This is a very common task. Let's build a parser that extracts all the URLs from <a> tags and the text they are linking to.

import html.parser
class LinkExtractor(html.parser.HTMLParser):
    """
    An HTML parser that extracts all links and their text.
    """
    def __init__(self):
        super().__init__()
        self.links = [] # To store the results
        self.current_link_text = "" # To accumulate text for the current link
    def handle_starttag(self, tag, attrs):
        """Check for start of an <a> tag and capture its attributes."""
        if tag == 'a':
            # attrs is a list of (name, value) tuples
            attrs_dict = dict(attrs)
            if 'href' in attrs_dict:
                # Store the href and start accumulating the link text
                self.links.append({
                    'url': attrs_dict['href'],
                    'text': ''
                })
                # A new link has started, so reset the text accumulator
                self.current_link_text = ""
    def handle_data(self, data):
        """Accumulate text data. If we are inside a link, save it."""
        # If we are currently parsing a link, add the data to its text
        if self.links and self.current_link_text == "" and self.links[-1]['text'] == '':
             # This is the first data for the current link
             self.links[-1]['text'] = data.strip()
        elif self.links:
             # Append more data to the current link's text
             self.links[-1]['text'] += data.strip()
    def handle_endtag(self, tag):
        """Called when an end tag is found."""
        # We don't need to do anything special here for this example,
        # but it's good to know the method exists.
# --- Main execution ---
if __name__ == "__main__":
    parser = LinkExtractor()
    # A more realistic HTML snippet
    html_content = """
    <div id="content">
        <h1>Main Page</h1>
        <p>Here are some interesting links:</p>
        <a href="https://www.python.org" target="_blank">The official Python website</a>.
        <p>Also, check out <a href="/docs">the documentation</a>.</p>
        <p>This is a link with no text.</p>
        <a href="https://pypi.org"></a>
    </div>
    """
    parser.feed(html_content)
    # Print the extracted links
    print("Extracted Links:")
    for link in parser.links:
        print(f"  - URL: {link['url']}")
        print(f"    Text: '{link['text']}'")

Output:

Extracted Links:
  - URL: https://www.python.org
    Text: 'The official Python website'
  - URL: /docs
    Text: 'the documentation'
  - URL: https://pypi.org
    Text: ''

Pros and Cons: html.parser vs. BeautifulSoup

While html.parser is great and built-in, it's not always the best tool for the job.

html.parser (Python's built-in)

Pros:

  • No Installation: It's part of the standard library. You can use it anywhere Python is installed.
  • Good for Learning: Understanding how it works gives you a solid foundation for how parsers operate.
  • Decent Performance: It's implemented in C, so it's reasonably fast for most tasks.

Cons:

  • Verbose: You have to write a lot of boilerplate code to handle different cases and structure your data.
  • Fragile: It's not very forgiving of "bad" or "malformed" HTML. A small mistake in the HTML can break your parser.
  • No DOM Tree: It's a event-driven parser, not a Document Object Model (DOM) parser. You don't get a tree structure that you can easily navigate (e.g., find('div', class_='main').find_all('p')). You just get a stream of events.

BeautifulSoup (a third-party library)

BeautifulSoup is a wrapper around other parsers (including html.parser). It provides a much more user-friendly, high-level interface.

Pros:

  • Extremely Easy to Use: Its API is designed for humans. Finding elements is trivial.
    # With BeautifulSoup
    soup = BeautifulSoup(html_content, 'html.parser')
    all_links = soup.find_all('a')
    for link in all_links:
        print(link['href'], link.text)
  • Robust: It's very good at handling broken or imperfect HTML.
  • Powerful Navigation: It builds a DOM tree, allowing for complex searching and traversal (soup.select('div#content p > a')).
  • Supports Multiple Parsers: You can use lxml (which is much faster and more robust) or html.parser as a fallback.

Cons:

  • Requires Installation: You need to install it (pip install beautifulsoup4).
  • Slower than lxml: While fast itself, it's generally slower than the lxml parser when used directly.

Conclusion: Which one should I use?

  • Use html.parser if:

    • You are in an environment where you can't install third-party packages.
    • You are learning about HTML parsing for the first time.
    • Your task is very simple and you want to avoid dependencies.
  • Use BeautifulSoup (with the lxml parser if possible) if:

    • You are doing any serious web scraping.
    • You need to parse real-world HTML, which is often messy.
    • You want to write clean, readable, and maintainable code.
    • You need to perform complex queries on the document structure.

For almost any real-world project, BeautifulSoup is the recommended choice. However, knowing how html.parser works is a valuable skill.

分享:
扫描分享到社交APP
上一篇
下一篇