杰瑞科技汇

Python如何正确导入并使用xml模块?

Of course! Here's a comprehensive guide on how to import and work with XML data in Python, covering the built-in xml.etree.ElementTree module and the popular third-party library lxml.


The Standard Library: xml.etree.ElementTree

Python comes with a built-in module called xml.etree.ElementTree (often imported as ET). It's a powerful, easy-to-use, and memory-efficient way to parse and create XML data. It's perfect for most common tasks.

Key Concepts:

  • Element: The basic building block of an XML tree. An element has a tag, attributes, and text content.
  • Tree: The entire XML document is represented as a tree of elements.
  • Parsing: The process of reading an XML string or file and converting it into an ElementTree object.
  • Finding: Navigating the tree to find specific elements using tags, paths, or expressions.

Example 1: Parsing an XML File

Let's say you have an XML file named library.xml:

library.xml

<?xml version="1.0"?>
<library>
  <book category="FICTION">lang="en">The Great Gatsby</title>
    <author>F. Scott Fitzgerald</author>
    <year>1925</year>
  </book>
  <book category="SCIENCE">lang="en">A Brief History of Time</title>
    <author>Stephen Hawking</author>
    <year>1988</year>
  </book>
  <book category="CHILDREN">lang="en">The Hobbit</title>
    <author>J.R.R. Tolkien</author>
    <year>1937</year>
  </book>
</library>

Here's how to parse it and extract information:

import xml.etree.ElementTree as ET
try:
    # Parse the XML file
    tree = ET.parse('library.xml')
    # Get the root element of the tree
    root = tree.getroot()
    print(f"Root tag: {root.tag}\n")
    # --- 1. Iterating over all 'book' elements ---
    print("--- Iterating over all books ---")
    for book in root.findall('book'):
        # find() searches for a child element with the given tag
        title = book.find('title').text
        author = book.find('author').text
        year = book.find('year').text
        category = book.get('category')  # .get() retrieves an attribute
        print(f"Title: {title}, Author: {author}, Year: {year}, Category: {category}")
    print("\n")
    # --- 2. Finding a specific element using XPath-like paths ---
    print("--- Finding a specific book by author ---")
    # findall with a path searches the entire tree
    hobbit_book = root.find(".//book[author='J.R.R. Tolkien']")
    if hobbit_book is not None:
        print(f"Found book: {hobbit_book.find('title').text}")
    print("\n")
    # --- 3. Accessing attributes and text ---
    print("--- Accessing attributes and text ---")
    first_book_title = root.find('book/title')
    # Get the 'lang' attribute
    lang_attr = first_book_title.get('lang')
    # Get the text contenttext = first_book_title.text
    print(f"First book title language: {lang_attr}")
    print(f"First book title text: {title_text}")
except FileNotFoundError:
    print("Error: library.xml not found.")
except ET.ParseError:
    print("Error: Could not parse the XML file.")

Output:

Root tag: library
--- Iterating over all books --- The Great Gatsby, Author: F. Scott Fitzgerald, Year: 1925, Category: FICTION A Brief History of Time, Author: Stephen Hawking, Year: 1988, Category: SCIENCE The Hobbit, Author: J.R.R. Tolkien, Year: 1937, Category: CHILDREN
--- Finding a specific book by author ---
Found book: The Hobbit
--- Accessing attributes and text ---
First book title language: en
First book title text: The Great Gatsby

Example 2: Creating XML from Scratch

You can also build an XML tree programmatically.

import xml.etree.ElementTree as ET
# Create the root element
root = ET.Element("inventory")
# Create a new product element
product1 = ET.SubElement(root, "product", id="P1001")
ET.SubElement(product1, "name").text = "Laptop"
ET.SubElement(product1, "quantity").text = "10"
# Create another product
product2 = ET.SubElement(root, "product", id="P1002")
ET.SubElement(product2, "name").text = "Mouse"
ET.SubElement(product2, "quantity").text = "50"
# Create a new XML tree from the root element
tree = ET.ElementTree(root)
# Write the tree to a file
# 'encoding="utf-8"' and 'xml_declaration=True' are best practices
tree.write("inventory.xml", encoding="utf-8", xml_declaration=True)
print("inventory.xml has been created.")

Generated inventory.xml:

<?xml version='1.0' encoding='utf-8'?>
<inventory>
  <product id="P1001">
    <name>Laptop</name>
    <quantity>10</quantity>
  </product>
  <product id="P1002">
    <name>Mouse</name>
    <quantity>50</quantity>
  </product>
</inventory>

The Powerful Third-Party Library: lxml

For more complex tasks, performance-critical applications, or when you need full XPath 1.0 support, the lxml library is the best choice. It's faster than the standard library and has more features.

First, you need to install it:

pip install lxml

lxml has a very similar API to xml.etree.ElementTree, so the code is often a drop-in replacement.

Example 1: Parsing with lxml

Let's use the same library.xml.

from lxml import etree
# Parse the XML file
tree = etree.parse('library.xml')
root = tree.getroot()
print(f"Root tag: {root.tag}\n")
# --- 1. Iterating over all 'book' elements ---
print("--- Iterating over all books ---")
for book in root.xpath('//book'): # lxml uses the full power of XPath= book.xpath('title/text()')[0]
    author = book.xpath('author/text()')[0]
    year = book.xpath('year/text()')[0]
    category = book.get('category')
    print(f"Title: {title}, Author: {author}, Year: {year}, Category: {category}")
print("\n")
# --- 2. Using more complex XPath expressions ---
print("--- Using complex XPath ---")
# Find all titles of books in the 'FICTION' category
fiction_titles = root.xpath('//book[@category="FICTION"]/title/text()')
print(f"Fiction titles: {fiction_titles}")
# Find the author of the book with the title 'The Hobbit'
hobbit_author = root.xpath('//book[title="The Hobbit"]/author/text()')
print(f"Author of 'The Hobbit': {hobbit_author[0]}")
print("\n")
# --- 3. Pretty Printing ---
# A nice feature of lxml is easy pretty printing
print("--- Pretty Printing ---")
print(etree.tostring(root, pretty_print=True, encoding='unicode'))

Key lxml Advantages:

  • Full XPath 1.0 Support: root.xpath() is incredibly powerful and can select nodes based on complex conditions, attributes, and text content.
  • Performance: lxml is significantly faster for both parsing and querying.
  • HTML Parsing: lxml.html is excellent for parsing broken or messy HTML, which xml.etree.ElementTree cannot handle.
  • Validation: It can validate XML against a DTD or Schema.

Summary: ElementTree vs. lxml

Feature xml.etree.ElementTree (Standard Library) lxml (Third-Party)
Installation Built-in, no installation needed. pip install lxml
Performance Good, but slower for large/complex files. Excellent, much faster.
XPath Support Limited (basic find() and findall()). Full XPath 1.0 support (xpath() method).
HTML Parsing Not supported. Excellent support (lxml.html).
Validation No built-in validation. Can validate against DTD and XML Schema.
Ease of Use Very easy for simple tasks. API is similar, slightly more complex for advanced features.
Best For Simple scripts, quick parsing, environments where you can't install packages. Production applications, large XML files, complex queries, web scraping.

Conclusion

For most day-to-day XML handling, start with the built-in xml.etree.ElementTree. It's simple, reliable, and gets the job done.

If you find yourself needing more power, better performance, or the ability to write complex XPath queries, switch to lxml. It's the industry standard for advanced XML processing in Python.

分享:
扫描分享到社交APP
上一篇
下一篇