Of course! Let's dive into using Python's urllib library to fetch and parse XML data. This is a common task for consuming web services, RSS feeds, or other APIs that return XML.
We'll cover three main parts:
- Fetching XML Data: Using
urllib.requestto get XML content from a URL. - Parsing XML: Using Python's built-in
xml.etree.ElementTreemodule to work with the XML data. - A Complete, Runnable Example: Putting it all together with a real-world example (an RSS feed).
Fetching XML Data with urllib.request
The urllib.request module is used to open URLs (both HTTP and HTTPS). The key function is urllib.request.urlopen().
Important: It's highly recommended to use a try...except block to handle network errors like a bad URL, no internet connection, or the server being down.
import urllib.request
import urllib.error
# A URL that returns XML data (a public RSS feed)
xml_url = "https://news.ycombinator.com/rss"
try:
# Open the URL and read the response
# We use a 'with' statement to ensure the connection is properly closed
with urllib.request.urlopen(xml_url) as response:
# Read the response content. It's returned as bytes, so we decode it to a string.
xml_data = response.read().decode('utf-8')
print("Successfully fetched XML data:")
print(xml_data[:500] + "...") # Print the first 500 characters
except urllib.error.URLError as e:
print(f"Error fetching URL: {e.reason}")
except Exception as e:
print(f"An unexpected error occurred: {e}")
Explanation:
with urllib.request.urlopen(xml_url) as response:: This opens the URL. Thewithstatement is good practice as it automatically handles closing the network connection.response.read(): This reads the entire content of the response from the server. For large files, you might want to read it in chunks, but for most APIs, reading it all at once is fine..decode('utf-8'): The data fromread()is in bytes. We need to decode it into a standard string (UTF-8 is the most common encoding for web content).
Parsing XML with xml.etree.ElementTree
Once you have the XML string, you can parse it into a structured tree that you can easily navigate. Python's standard library includes xml.etree.ElementTree, which is perfect for this.
The core concepts are:
ElementTree: The main class that represents the entire XML document.Element: A single node in the XML tree (e.g.,<item>,<title>).- Tag: The name of the element (e.g.,
'title'). - Text: The content inside the element (e.g.,
"Python is awesome"). - Attributes: Key-value pairs on an element (e.g.,
<link href="...">).
Key Methods and Properties of an Element:
.tag: The tag name of the element..text: The text content inside the element..attrib: A dictionary of the element's attributes..find('tag_name'): Finds the first child element with the given tag..findall('tag_name'): Finds all child elements with the given tag, returning a list..iter('tag_name'): Creates a tree iterator to loop over all elements with the given tag, anywhere in the tree.
Complete Example: Parsing an RSS Feed
Let's combine fetching and parsing to get the titles and links from the Hacker News RSS feed.
XML Structure of an RSS Feed (simplified):
<rss>
<channel>Hacker News</title>
<link>https://news.ycombinator.com</link>
<description>...</description>
<item>
<title>How to use urllib and ElementTree</title>
<link>https://example.com/article</link>
<description>...</description>
</item>
<item>
<title>Another interesting article</title>
<link>https://example.com/article2</link>
<description>...</description>
</item>
</channel>
</rss>
Python Code:
import urllib.request
import urllib.error
import xml.etree.ElementTree as ET
def parse_hacker_news_rss():
"""
Fetches and parses the Hacker News RSS feed to print article titles and links.
"""
xml_url = "https://news.ycombinator.com/rss"
try:
# 1. Fetch the XML data
with urllib.request.urlopen(xml_url) as response:
xml_data = response.read().decode('utf-8')
# 2. Parse the XML string
# ET.fromstring() parses an XML string from a file-like object or string.
root = ET.fromstring(xml_data)
# The RSS root has a <channel> tag, which contains the <item> tags.
# We find the channel first.
channel = root.find('channel')
if channel is not None:
# 3. Extract data from the <item> elements
# .findall() returns a list of all 'item' elements under the channel
items = channel.findall('item')
print(f"Found {len(items)} articles:\n" + "-"*30)
for item in items:
# Find the 'title' and 'link' tags within each 'item'
title_element = item.find('title')
link_element = item.find('link')
# Check if the tags were found before trying to access their text
if title_element is not None and link_element is not None:
title = title_element.text
link = link_element.text
print(f"Title: {title}")
print(f"Link: {link}\n")
except urllib.error.URLError as e:
print(f"Error fetching URL: {e.reason}")
except ET.ParseError as e:
print(f"Error parsing XML: {e}")
except Exception as e:
print(f"An unexpected error occurred: {e}")
# Run the function
if __name__ == "__main__":
parse_hacker_news_rss()
Output of the script:
Found 30 articles:
------------------------------ Show HN: I built a tool to convert YouTube videos to animated GIFs
Link: https://giffromvideo.com
Show HN: LLM-based autocompletion for the terminal
Link: https://github.com/antonmedv/fill
The 2025 State of JS
Link: https://2025.stateofjs.com/en-US/libraries/
Show HN: Open source 3D city builder
Link: https://github.com/brunoSimon/city-simmer
... (and so on)
Important Considerations and Best Practices
- Error Handling: As shown, always handle
URLErrorfor network issues andET.ParseErrorfor malformed XML. - Namespaces: Real-world XML (especially from services like Google, Amazon, etc.) often uses namespaces. A namespace looks like
xmlns:dc="http://purl.org/dc/elements/1.1/". If you try tofind('dc:title'), it will fail. You need to handle them.- Solution: Define a dictionary for the namespaces and pass it to
find()andfindall()using the{namespace}tagsyntax.namespaces = {'dc': 'http://purl.org/dc/elements/1.1/'} # Now you can find the element like this: # title_element = item.find('dc:title', namespaces)
- Solution: Define a dictionary for the namespaces and pass it to
- Security: Be cautious when parsing XML from untrusted sources. While
ElementTreeis generally safe, very large or deeply nested XML files can cause denial-of-service attacks (e.g., "Billion Laughs" attack). For untrusted sources, consider using a more secure parser likedefusedxml, which is a drop-in replacement forxml.etree.ElementTreethat protects against these vulnerabilities.pip install defusedxml
Then just import it instead:
import defusedxml.ElementTree as ET. It will raise exceptions for potentially dangerous constructs.
