java html xpath-杰瑞科技汇

Of course! This is a very common and powerful combination in web scraping and data extraction. Here’s a comprehensive guide covering how to use Java to parse HTML with XPath.

（图片来源网络，侵删）

We'll use the popular Jsoup library for HTML parsing and then apply XPath expressions to the parsed document.

Why Use Java + XPath for HTML?

XPath Power: XPath is a powerful query language designed for selecting nodes from an XML (and by extension, HTML) document. It's much more expressive and flexible than CSS selectors for complex queries.
Java's Robustness: Java is a stable, platform-independent language, making it ideal for building reliable, long-running scraping applications or backend services.
Jsoup's Simplicity: Jsoup is a fantastic library that parses HTML into a Document Object Model (DOM), which can then be queried using both CSS selectors and XPath.

Step 1: Setup Your Java Project

First, you need to add the Jsoup library to your project. If you're using a build tool like Maven or Gradle, it's very easy.

Using Maven (`pom.xml`)

Add this dependency to your pom.xml file:

<dependencies>
    <dependency>
        <groupId>org.jsoup</groupId>
        <artifactId>jsoup</artifactId>
        <version>1.17.2</version> <!-- Check for the latest version -->
    </dependency>
</dependencies>

Using Gradle (`build.gradle`)

Add this dependency to your build.gradle file:

（图片来源网络，侵删）

dependencies {
    implementation 'org.jsoup:jsoup:1.17.2' // Check for the latest version
}

Step 2: Load and Parse HTML

The first step is to get the HTML content and parse it into a Jsoup Document object. You can do this from a URL, a file, or a string.

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import java.io.IOException;
public class HtmlParser {
    public static void main(String[] args) {
        String url = "https://en.wikipedia.org/wiki/Main_Page";
        try {
            // 1. Fetch and parse the HTML from a URL
            Document doc = Jsoup.connect(url)
                                .userAgent("Mozilla/5.0") // Set a user-agent to avoid being blocked
                                .get();
            // You can also parse from a string or a file
            // String html = "<html><head><title>My Title</title></head><body><p>Hello, world!</p></body></html>";
            // Document docFromString = Jsoup.parse(html);
            System.out.println("Title: " + doc.title());
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

Step 3: Using XPath to Query the Document

This is the core of the process. Jsoup doesn't have a built-in XPath engine. Instead, we use a library that can evaluate XPath expressions on a Jsoup Document.

A great, lightweight library for this is **jsoup-xpath`.

Setup `jsoup-xpath`

Add this dependency to your pom.xml:

（图片来源网络，侵删）

<dependencies>
    <!-- Jsoup for HTML parsing -->
    <dependency>
        <groupId>org.jsoup</groupId>
        <artifactId>jsoup</artifactId>
        <version>1.17.2</version>
    </dependency>
    <!-- Jsoup-XPath for XPath support -->
    <dependency>
        <groupId>cn.wanghaomiao</groupId>
        <artifactId>jsoup-xpath</artifactId>
        <version>2.3.2</version> <!-- Check for the latest version -->
    </dependency>
</dependencies>

How to Use `JsoupXpath`

The main class is JsoupXpath. You call its select() method, passing in your XPath expression and the Jsoup Document.

Let's use a sample HTML string for our examples:

<!DOCTYPE html>
<html>
<head>Sample Page</title>
</head>
<body>
    <h1>Welcome to the Page</h1>
    <div id="content">
        <p class="main-paragraph">This is the first paragraph.</p>
        <p class="main-paragraph">This is the second paragraph with a <a href="https://example.com">link</a>.</p>
    </div>
    <ul class="items">
        <li>Item 1</li>
        <li>Item 2</li>
        <li>Item 3</li>
    </ul>
</body>
</html>

Example Code with XPath Queries

import cn.wanghaomiao.xpath.model.JXDocument;
import cn.wanghaomiao.xpath.model.JXNode;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import java.util.List;
public class XPathExample {
    public static void main(String[] args) {
        String html = "<!DOCTYPE html><html><head><title>Sample Page</title></head>" +
                       "<body><h1>Welcome to the Page</h1>" +
                       "<div id=\"content\"><p class=\"main-paragraph\">This is the first paragraph.</p>" +
                       "<p class=\"main-paragraph\">This is the second paragraph with a <a href=\"https://example.com\">link</a>.</p></div>" +
                       "<ul class=\"items\"><li>Item 1</li><li>Item 2</li><li>Item 3</li></ul></body></html>";
        // Parse the HTML into a Jsoup Document
        Document doc = Jsoup.parse(html);
        // Create a JXDocument from the Jsoup Document
        JXDocument jxdoc = new JXDocument(doc);
        try {
            // 1. Select the title of the page
            // XPath: /html/head/title
            System.out.println("--- 1. Title ---");
            List<JXNode> titles = jxdoc.selN("//title"); // //title is a shortcut for /html/head/title
            for (JXNode node : titles) {
                System.out.println("Title Text: " + node.toString());
            }
            // 2. Select all paragraph elements with class 'main-paragraph'
            // XPath: //p[@class='main-paragraph']
            System.out.println("\n--- 2. Main Paragraphs ---");
            List<JXNode> paragraphs = jxdoc.selN("//p[@class='main-paragraph']");
            for (JXNode node : paragraphs) {
                System.out.println("Paragraph Text: " + node.toString());
            }
            // 3. Select all list items from the 'items' unordered list
            // XPath: //ul[@class='items']/li
            System.out.println("\n--- 3. List Items ---");
            List<JXNode> listItems = jxdoc.selN("//ul[@class='items']/li");
            for (JXNode node : listItems) {
                System.out.println("List Item Text: " + node.toString());
            }
            // 4. Select the 'href' attribute of the link inside a paragraph
            // XPath: //p[@class='main-paragraph']//a/@href
            System.out.println("\n--- 4. Link Href ---");
            List<JXNode> hrefs = jxdoc.selN("//p[@class='main-paragraph']//a/@href");
            for (JXNode node : hrefs) {
                System.out.println("Link Href: " + node.toString());
            }
            // 5. Select the text of the second paragraph
            // XPath: (//p[@class='main-paragraph'])[2]/text()
            System.out.println("\n--- 5. Second Paragraph Text ---");
            List<JXNode> secondParagraphText = jxdoc.selN("(//p[@class='main-paragraph'])[2]/text()");
            for (JXNode node : secondParagraphText) {
                System.out.println("Text: " + node.toString().trim());
            }
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

Common XPath Expressions for HTML

Goal	XPath Expression	Description
All Elements		Selects all elements in the document.
Element by Tag	`//div`	Selects all `<div>` elements.
Element by ID	`//*[@id='main-content']`	Selects any element with `id="main-content"`.
Element by Class	`//p[@class='intro']`	Selects all `<p>` elements with `class="intro"`.
Attribute Value	`//a[@href='https://...']`	Selects all `<a>` elements with a specific `href`.
Text Content	`//h1/text()`	Selects the text node of all `<h1>` elements.
Contains Text	`//*[contains(text(), 'Welcome')]`	Selects any element whose text contains "Welcome".
Starts With	`//a[starts-with(@href, 'https')]`	Selects all `<a>` elements whose `href` starts with "https".
Parent Element	`//a/..`	Selects the parent of all `<a>` elements.
Specific Child	`//ul/li[1]`	Selects the first `<li>` child of every `<ul>`.
Last Child	`//ul/li[last()]`	Selects the last `<li>` child of every `<ul>`.

Important Considerations and Best Practices

HTML vs. XML: HTML is often "messy" (unclosed tags, case-insensitive). Jsoup does a great job of cleaning it up to make it more XML-like, which is essential for XPath to work reliably.
Robustness: Websites change. If a website changes its class names or structure, your XPath queries will break. Always build your scraper with this in mind.
Ethics and robots.txt: Always check a website's robots.txt file (e.g., https://example.com/robots.txt) to see which parts of the site they ask scrapers not to visit. Respect these rules.
Rate Limiting: Don't send too many requests in a short period. You can get your IP address blocked. Add delays between requests.
Error Handling: Network connections can fail, and HTML can be unexpected. Always wrap your code in try-catch blocks.
Alternatives to XPath: While XPath is powerful, CSS selectors are also supported natively by Jsoup (doc.select("div.content > p")) and are often more readable for simple to medium queries. Many developers prefer CSS selectors for this reason. Use XPath when you need its advanced features like axes (ancestor, following-sibling) or complex conditions.

java html xpath

Why Use Java + XPath for HTML?

Step 1: Setup Your Java Project

Using Maven (`pom.xml`)

Using Gradle (`build.gradle`)

Step 2: Load and Parse HTML

Step 3: Using XPath to Query the Document

Setup `jsoup-xpath`

How to Use `JsoupXpath`

Example Code with XPath Queries

Common XPath Expressions for HTML

Important Considerations and Best Practices

99ANYc3cd6

柯尼卡6180e安装教程

宏碁V5471G拆机教程详细步骤有哪些？

Python sklearn RF如何调参提升模型性能？

Win7分区怎么操作？新手必看教程！

Python线程如何共享session数据？

Java Socket编程实例具体怎么实现？

cxf wsdl2java命令如何正确生成Java代码？

Java课后习题答案哪里找？

Android、Java、Web 如何协同开发？

Java子类对象赋值给父类对象，底层机制是什么？

Java Windows环境变量如何配置？

Python入门helloworld怎么写？

如何用Python通过paramiko下载文件？

Fluent流体计算应用教程，从入门到精通？

UiAutomator如何用Python实现自动化测试？

如何在Python中使用ECharts？

java html xpath

Why Use Java + XPath for HTML?

Step 1: Setup Your Java Project

Using Maven (pom.xml)

Using Gradle (build.gradle)

Step 2: Load and Parse HTML

Step 3: Using XPath to Query the Document

Setup jsoup-xpath

How to Use JsoupXpath

Example Code with XPath Queries

Common XPath Expressions for HTML

Important Considerations and Best Practices

相关推荐

Java Socket编程实例具体怎么实现？

Using Maven (`pom.xml`)

Using Gradle (`build.gradle`)

Setup `jsoup-xpath`

How to Use `JsoupXpath`