杰瑞科技汇

Python urllib如何解析XML?

Of course! Let's dive into using Python's urllib library to fetch and parse XML data. This is a common task for consuming web services, RSS feeds, or other APIs that return XML.

We'll cover three main parts:

  1. Fetching XML Data: Using urllib.request to get XML content from a URL.
  2. Parsing XML: Using Python's built-in xml.etree.ElementTree module to work with the XML data.
  3. A Complete, Runnable Example: Putting it all together with a real-world example (an RSS feed).

Fetching XML Data with urllib.request

The urllib.request module is used to open URLs (both HTTP and HTTPS). The key function is urllib.request.urlopen().

Important: It's highly recommended to use a try...except block to handle network errors like a bad URL, no internet connection, or the server being down.

import urllib.request
import urllib.error
# A URL that returns XML data (a public RSS feed)
xml_url = "https://news.ycombinator.com/rss"
try:
    # Open the URL and read the response
    # We use a 'with' statement to ensure the connection is properly closed
    with urllib.request.urlopen(xml_url) as response:
        # Read the response content. It's returned as bytes, so we decode it to a string.
        xml_data = response.read().decode('utf-8')
    print("Successfully fetched XML data:")
    print(xml_data[:500] + "...") # Print the first 500 characters
except urllib.error.URLError as e:
    print(f"Error fetching URL: {e.reason}")
except Exception as e:
    print(f"An unexpected error occurred: {e}")

Explanation:

  • with urllib.request.urlopen(xml_url) as response:: This opens the URL. The with statement is good practice as it automatically handles closing the network connection.
  • response.read(): This reads the entire content of the response from the server. For large files, you might want to read it in chunks, but for most APIs, reading it all at once is fine.
  • .decode('utf-8'): The data from read() is in bytes. We need to decode it into a standard string (UTF-8 is the most common encoding for web content).

Parsing XML with xml.etree.ElementTree

Once you have the XML string, you can parse it into a structured tree that you can easily navigate. Python's standard library includes xml.etree.ElementTree, which is perfect for this.

The core concepts are:

  • ElementTree: The main class that represents the entire XML document.
  • Element: A single node in the XML tree (e.g., <item>, <title>).
  • Tag: The name of the element (e.g., 'title').
  • Text: The content inside the element (e.g., "Python is awesome").
  • Attributes: Key-value pairs on an element (e.g., <link href="...">).

Key Methods and Properties of an Element:

  • .tag: The tag name of the element.
  • .text: The text content inside the element.
  • .attrib: A dictionary of the element's attributes.
  • .find('tag_name'): Finds the first child element with the given tag.
  • .findall('tag_name'): Finds all child elements with the given tag, returning a list.
  • .iter('tag_name'): Creates a tree iterator to loop over all elements with the given tag, anywhere in the tree.

Complete Example: Parsing an RSS Feed

Let's combine fetching and parsing to get the titles and links from the Hacker News RSS feed.

XML Structure of an RSS Feed (simplified):

<rss>
  <channel>Hacker News</title>
    <link>https://news.ycombinator.com</link>
    <description>...</description>
    <item>
      <title>How to use urllib and ElementTree</title>
      <link>https://example.com/article</link>
      <description>...</description>
    </item>
    <item>
      <title>Another interesting article</title>
      <link>https://example.com/article2</link>
      <description>...</description>
    </item>
  </channel>
</rss>

Python Code:

import urllib.request
import urllib.error
import xml.etree.ElementTree as ET
def parse_hacker_news_rss():
    """
    Fetches and parses the Hacker News RSS feed to print article titles and links.
    """
    xml_url = "https://news.ycombinator.com/rss"
    try:
        # 1. Fetch the XML data
        with urllib.request.urlopen(xml_url) as response:
            xml_data = response.read().decode('utf-8')
        # 2. Parse the XML string
        # ET.fromstring() parses an XML string from a file-like object or string.
        root = ET.fromstring(xml_data)
        # The RSS root has a <channel> tag, which contains the <item> tags.
        # We find the channel first.
        channel = root.find('channel')
        if channel is not None:
            # 3. Extract data from the <item> elements
            # .findall() returns a list of all 'item' elements under the channel
            items = channel.findall('item')
            print(f"Found {len(items)} articles:\n" + "-"*30)
            for item in items:
                # Find the 'title' and 'link' tags within each 'item'
                title_element = item.find('title')
                link_element = item.find('link')
                # Check if the tags were found before trying to access their text
                if title_element is not None and link_element is not None:
                    title = title_element.text
                    link = link_element.text
                    print(f"Title: {title}")
                    print(f"Link:  {link}\n")
    except urllib.error.URLError as e:
        print(f"Error fetching URL: {e.reason}")
    except ET.ParseError as e:
        print(f"Error parsing XML: {e}")
    except Exception as e:
        print(f"An unexpected error occurred: {e}")
# Run the function
if __name__ == "__main__":
    parse_hacker_news_rss()

Output of the script:

Found 30 articles:
------------------------------ Show HN: I built a tool to convert YouTube videos to animated GIFs
Link:  https://giffromvideo.com
 Show HN: LLM-based autocompletion for the terminal
Link:  https://github.com/antonmedv/fill
 The 2025 State of JS
Link:  https://2025.stateofjs.com/en-US/libraries/
 Show HN: Open source 3D city builder
Link:  https://github.com/brunoSimon/city-simmer
... (and so on)

Important Considerations and Best Practices

  1. Error Handling: As shown, always handle URLError for network issues and ET.ParseError for malformed XML.
  2. Namespaces: Real-world XML (especially from services like Google, Amazon, etc.) often uses namespaces. A namespace looks like xmlns:dc="http://purl.org/dc/elements/1.1/". If you try to find('dc:title'), it will fail. You need to handle them.
    • Solution: Define a dictionary for the namespaces and pass it to find() and findall() using the {namespace}tag syntax.
      namespaces = {'dc': 'http://purl.org/dc/elements/1.1/'}
      # Now you can find the element like this:
      # title_element = item.find('dc:title', namespaces)
  3. Security: Be cautious when parsing XML from untrusted sources. While ElementTree is generally safe, very large or deeply nested XML files can cause denial-of-service attacks (e.g., "Billion Laughs" attack). For untrusted sources, consider using a more secure parser like defusedxml, which is a drop-in replacement for xml.etree.ElementTree that protects against these vulnerabilities.
    pip install defusedxml

    Then just import it instead: import defusedxml.ElementTree as ET. It will raise exceptions for potentially dangerous constructs.

分享:
扫描分享到社交APP
上一篇
下一篇