杰瑞科技汇

java html xpath

Of course! This is a very common and powerful combination in web scraping and data extraction. Here’s a comprehensive guide covering how to use Java to parse HTML with XPath.

java html xpath-图1
(图片来源网络,侵删)

We'll use the popular Jsoup library for HTML parsing and then apply XPath expressions to the parsed document.

Why Use Java + XPath for HTML?

  • XPath Power: XPath is a powerful query language designed for selecting nodes from an XML (and by extension, HTML) document. It's much more expressive and flexible than CSS selectors for complex queries.
  • Java's Robustness: Java is a stable, platform-independent language, making it ideal for building reliable, long-running scraping applications or backend services.
  • Jsoup's Simplicity: Jsoup is a fantastic library that parses HTML into a Document Object Model (DOM), which can then be queried using both CSS selectors and XPath.

Step 1: Setup Your Java Project

First, you need to add the Jsoup library to your project. If you're using a build tool like Maven or Gradle, it's very easy.

Using Maven (pom.xml)

Add this dependency to your pom.xml file:

<dependencies>
    <dependency>
        <groupId>org.jsoup</groupId>
        <artifactId>jsoup</artifactId>
        <version>1.17.2</version> <!-- Check for the latest version -->
    </dependency>
</dependencies>

Using Gradle (build.gradle)

Add this dependency to your build.gradle file:

java html xpath-图2
(图片来源网络,侵删)
dependencies {
    implementation 'org.jsoup:jsoup:1.17.2' // Check for the latest version
}

Step 2: Load and Parse HTML

The first step is to get the HTML content and parse it into a Jsoup Document object. You can do this from a URL, a file, or a string.

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import java.io.IOException;
public class HtmlParser {
    public static void main(String[] args) {
        String url = "https://en.wikipedia.org/wiki/Main_Page";
        try {
            // 1. Fetch and parse the HTML from a URL
            Document doc = Jsoup.connect(url)
                                .userAgent("Mozilla/5.0") // Set a user-agent to avoid being blocked
                                .get();
            // You can also parse from a string or a file
            // String html = "<html><head><title>My Title</title></head><body><p>Hello, world!</p></body></html>";
            // Document docFromString = Jsoup.parse(html);
            System.out.println("Title: " + doc.title());
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

Step 3: Using XPath to Query the Document

This is the core of the process. Jsoup doesn't have a built-in XPath engine. Instead, we use a library that can evaluate XPath expressions on a Jsoup Document.

A great, lightweight library for this is **jsoup-xpath`.

Setup jsoup-xpath

Add this dependency to your pom.xml:

java html xpath-图3
(图片来源网络,侵删)
<dependencies>
    <!-- Jsoup for HTML parsing -->
    <dependency>
        <groupId>org.jsoup</groupId>
        <artifactId>jsoup</artifactId>
        <version>1.17.2</version>
    </dependency>
    <!-- Jsoup-XPath for XPath support -->
    <dependency>
        <groupId>cn.wanghaomiao</groupId>
        <artifactId>jsoup-xpath</artifactId>
        <version>2.3.2</version> <!-- Check for the latest version -->
    </dependency>
</dependencies>

How to Use JsoupXpath

The main class is JsoupXpath. You call its select() method, passing in your XPath expression and the Jsoup Document.

Let's use a sample HTML string for our examples:

<!DOCTYPE html>
<html>
<head>Sample Page</title>
</head>
<body>
    <h1>Welcome to the Page</h1>
    <div id="content">
        <p class="main-paragraph">This is the first paragraph.</p>
        <p class="main-paragraph">This is the second paragraph with a <a href="https://example.com">link</a>.</p>
    </div>
    <ul class="items">
        <li>Item 1</li>
        <li>Item 2</li>
        <li>Item 3</li>
    </ul>
</body>
</html>

Example Code with XPath Queries

import cn.wanghaomiao.xpath.model.JXDocument;
import cn.wanghaomiao.xpath.model.JXNode;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import java.util.List;
public class XPathExample {
    public static void main(String[] args) {
        String html = "<!DOCTYPE html><html><head><title>Sample Page</title></head>" +
                       "<body><h1>Welcome to the Page</h1>" +
                       "<div id=\"content\"><p class=\"main-paragraph\">This is the first paragraph.</p>" +
                       "<p class=\"main-paragraph\">This is the second paragraph with a <a href=\"https://example.com\">link</a>.</p></div>" +
                       "<ul class=\"items\"><li>Item 1</li><li>Item 2</li><li>Item 3</li></ul></body></html>";
        // Parse the HTML into a Jsoup Document
        Document doc = Jsoup.parse(html);
        // Create a JXDocument from the Jsoup Document
        JXDocument jxdoc = new JXDocument(doc);
        try {
            // 1. Select the title of the page
            // XPath: /html/head/title
            System.out.println("--- 1. Title ---");
            List<JXNode> titles = jxdoc.selN("//title"); // //title is a shortcut for /html/head/title
            for (JXNode node : titles) {
                System.out.println("Title Text: " + node.toString());
            }
            // 2. Select all paragraph elements with class 'main-paragraph'
            // XPath: //p[@class='main-paragraph']
            System.out.println("\n--- 2. Main Paragraphs ---");
            List<JXNode> paragraphs = jxdoc.selN("//p[@class='main-paragraph']");
            for (JXNode node : paragraphs) {
                System.out.println("Paragraph Text: " + node.toString());
            }
            // 3. Select all list items from the 'items' unordered list
            // XPath: //ul[@class='items']/li
            System.out.println("\n--- 3. List Items ---");
            List<JXNode> listItems = jxdoc.selN("//ul[@class='items']/li");
            for (JXNode node : listItems) {
                System.out.println("List Item Text: " + node.toString());
            }
            // 4. Select the 'href' attribute of the link inside a paragraph
            // XPath: //p[@class='main-paragraph']//a/@href
            System.out.println("\n--- 4. Link Href ---");
            List<JXNode> hrefs = jxdoc.selN("//p[@class='main-paragraph']//a/@href");
            for (JXNode node : hrefs) {
                System.out.println("Link Href: " + node.toString());
            }
            // 5. Select the text of the second paragraph
            // XPath: (//p[@class='main-paragraph'])[2]/text()
            System.out.println("\n--- 5. Second Paragraph Text ---");
            List<JXNode> secondParagraphText = jxdoc.selN("(//p[@class='main-paragraph'])[2]/text()");
            for (JXNode node : secondParagraphText) {
                System.out.println("Text: " + node.toString().trim());
            }
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

Common XPath Expressions for HTML

Goal XPath Expression Description
All Elements Selects all elements in the document.
Element by Tag //div Selects all <div> elements.
Element by ID //*[@id='main-content'] Selects any element with id="main-content".
Element by Class //p[@class='intro'] Selects all <p> elements with class="intro".
Attribute Value //a[@href='https://...'] Selects all <a> elements with a specific href.
Text Content //h1/text() Selects the text node of all <h1> elements.
Contains Text //*[contains(text(), 'Welcome')] Selects any element whose text contains "Welcome".
Starts With //a[starts-with(@href, 'https')] Selects all <a> elements whose href starts with "https".
Parent Element //a/.. Selects the parent of all <a> elements.
Specific Child //ul/li[1] Selects the first <li> child of every <ul>.
Last Child //ul/li[last()] Selects the last <li> child of every <ul>.

Important Considerations and Best Practices

  1. HTML vs. XML: HTML is often "messy" (unclosed tags, case-insensitive). Jsoup does a great job of cleaning it up to make it more XML-like, which is essential for XPath to work reliably.
  2. Robustness: Websites change. If a website changes its class names or structure, your XPath queries will break. Always build your scraper with this in mind.
  3. Ethics and robots.txt: Always check a website's robots.txt file (e.g., https://example.com/robots.txt) to see which parts of the site they ask scrapers not to visit. Respect these rules.
  4. Rate Limiting: Don't send too many requests in a short period. You can get your IP address blocked. Add delays between requests.
  5. Error Handling: Network connections can fail, and HTML can be unexpected. Always wrap your code in try-catch blocks.
  6. Alternatives to XPath: While XPath is powerful, CSS selectors are also supported natively by Jsoup (doc.select("div.content > p")) and are often more readable for simple to medium queries. Many developers prefer CSS selectors for this reason. Use XPath when you need its advanced features like axes (ancestor, following-sibling) or complex conditions.
分享:
扫描分享到社交APP
上一篇
下一篇