java html to text-杰瑞科技汇

Of course! Converting HTML to plain text in Java is a common task, especially for tasks like email processing, content analysis, or displaying a summary of a web page.

（图片来源网络，侵删）

Here’s a comprehensive guide covering the best methods, from simple built-in libraries to robust, production-grade solutions.

Summary of Methods

Method	Pros	Cons	Best For
`Jsoup`	Excellent choice. Very easy to use, robust, handles malformed HTML, and has a dedicated `.text()` method.	External dependency (but very popular and well-maintained).	Almost all use cases. Web scraping, content extraction, general-purpose HTML parsing.
`javax.swing.text.html`	Built into Java (no dependency). Good for simple, well-formed HTML.	Verbose, not designed for general-purpose HTML parsing, can be brittle with real-world "messy" HTML.	Quick, simple tasks in environments where you can't add external libraries.
Regular Expressions	No dependencies.	Extremely brittle and not recommended. Fails on complex or malformed HTML.	Only for the simplest, most predictable HTML snippets. Avoid in production.

Method 1: Using Jsoup (Recommended)

Jsoup is a fantastic open-source library designed for parsing, manipulating, and cleaning HTML. It's the go-to solution for this task in the Java world because it's powerful, easy to use, and very forgiving of imperfect HTML.

Step 1: Add Jsoup Dependency

You need to include Jsoup in your project. If you're using a build tool like Maven or Gradle, add the following dependency.

Maven (pom.xml):

（图片来源网络，侵删）

<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.17.2</version> <!-- Check for the latest version -->
</dependency>

Gradle (build.gradle):

implementation 'org.jsoup:jsoup:1.17.2' // Check for the latest version

Step 2: Write the Java Code

Jsoup provides a static Jsoup.parse() method and a .text() method on the Document object that does all the heavy lifting.

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class HtmlToTextWithJsoup {
    public static void main(String[] args) {
        String html = "<html><head><title>My Title</title></head>"
                     + "<body><h1>A Heading</h1>"
                     + "<p>This is a <b>paragraph</b> with a <a href='https://example.com'>link</a>.</p>"
                     + "<script>alert('This should be ignored');</script>"
                     + "<style>body { color: red; }</style></body></html>";
        // Parse the HTML string
        Document doc = Jsoup.parse(html);
        // Get the plain text
        String text = doc.text();
        System.out.println("--- Original HTML ---");
        System.out.println(html);
        System.out.println("\n--- Converted Text ---");
        System.out.println(text);
    }
}

Output:

--- Original HTML ---
<html><head><title>My Title</title></head><body><h1>A Heading</h1><p>This is a <b>paragraph</b> with a <a href='https://example.com'>link</a>.</p><script>alert('This should be ignored');</script><style>body { color: red; }</style></body></html>
--- Converted Text ---A Heading This is a paragraph with a link.

Key Advantages of Jsoup:

Simplicity: The doc.text() method is all you need.
Robustness: It handles real-world HTML that is often not perfectly formed.
Safety: It automatically strips out <script> and <style> tags, which is usually what you want.
Whitespace Handling: It intelligently collapses multiple spaces and newlines into single spaces, making the output clean.

Method 2: Using `javax.swing.text.html` (Built-in)

This approach uses classes from the Swing library, which are part of the standard Java Development Kit (JDK). It doesn't require any external dependencies.

（图片来源网络，侵删）

Step 1: Write the Java Code

This method is more verbose. You create a StringReader from your HTML, pass it to a HTMLEditorKit, and then use a JTextPane to render the HTML. Finally, you extract the text from the JTextPane.

import javax.swing.text.html.HTMLEditorKit;
import javax.swing.text.html.parser.ParserDelegator;
import javax.swing.text.html.HTMLEditorKit.ParserCallback;
import java.io.StringReader;
import java.io.IOException;
public class HtmlToTextWithSwing {
    public static void main(String[] args) throws IOException {
        String html = "<html><head><title>My Title</title></head>"
                     + "<body><h1>A Heading</h1>"
                     + "<p>This is a <b>paragraph</b>.</p></body></html>";
        // Use a custom ParserCallback to handle the text extraction
        HTMLEditorKit.ParserCallback parserCallback = new HTMLEditorKit.ParserCallback() {
            @Override
            public void handleText(char[] data, int pos) {
                // This method is called for each piece of text
                System.out.print(new String(data));
            }
        };
        // Create a new parser delegator and parse the HTML
        new ParserDelegator().parse(new StringReader(html), parserCallback, true);
        // A simpler, but less flexible way using JTextPane:
        /*
        JTextPane textPane = new JTextPane();
        textPane.setContentType("text/html");
        textPane.setText(html);
        String text = textPane.getText();
        System.out.println(text);
        */
    }
}

Output:

Disadvantages of this method:

Verbose: Requires more boilerplate code.
Brittle: The Swing text parser is not as robust as dedicated HTML parsers. It can fail on complex or non-standard HTML.
Not for UI: While you use JTextPane, this method is for text extraction, not for displaying content in a GUI. Using Swing components for non-GUI tasks is generally not a good practice.

Method 3: Using Regular Expressions (Not Recommended)

You might be tempted to use a regex to strip all HTML tags. This is a very bad idea for anything but the most trivial cases.

HTML is not a regular language, and regex is not designed to parse it. A simple regex will fail on nested tags, attributes with > characters, or malformed HTML.

import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class HtmlToTextWithRegex {
    public static void main(String[] args) {
        String html = "<p>This is a <b>paragraph</b>.</p>";
        // A simple regex to remove tags
        String text = html.replaceAll("<[^>]*>", "");
        System.out.println("--- Original HTML ---");
        System.out.println(html);
        System.out.println("\n--- Converted Text (with Regex) ---");
        System.out.println(text);
    }
}

Output:

--- Original HTML ---
<p>This is a <b>paragraph</b>.</p>
--- Converted Text (with Regex) ---
This is a paragraph.

Why it's a bad idea: Consider this slightly more complex HTML: "<div class='main'>Hello <span id='1'>world</span></div>"

A simple replaceAll("<[^>]*>", "") would work, but what if you have an attribute with a >? Like <a href="example.com?q=1>2">link</a>. The regex would fail catastrophically.
It cannot handle comments () or script/style tags correctly without an extremely complex and unreadable regex.

Conclusion and Recommendation

For any serious application, use Jsoup. It's the industry standard for Java HTML processing, it's easy to set up, and it will save you from countless headaches caused by malformed HTML.

Use the built-in javax.swing.text.html method only if you are in a highly restricted environment where you absolutely cannot add any external libraries.

Never use regular expressions for parsing HTML.

java html to text

Summary of Methods

Method 1: Using Jsoup (Recommended)

Step 1: Add Jsoup Dependency

Step 2: Write the Java Code

Output:

Method 2: Using `javax.swing.text.html` (Built-in)

Step 1: Write the Java Code

Output:

Method 3: Using Regular Expressions (Not Recommended)

Output:

Conclusion and Recommendation

99ANYc3cd6

java nio selector

java websocket 聊天

Python SetWindowPos如何调整窗口位置大小？

Cornerstone怎么用？新手入门教程详解

java arraylist 用法

androidkiller安装教程

Python thread join 如何正确使用？

Java String转Double如何实现？

Win10装Win7双系统，U盘启动怎么弄？

Python VSCode中Pylint如何配置与使用？

SQL Server 2008视频教程哪里学？

Tkinter Frame布局控件如何高效使用？

Java StringTokenizer如何正确分割字符串？

Python列表如何判断元素是否包含？

ImageReady教程，新手如何快速上手？

Python socket 如何收发完整数据包？

java html to text

Summary of Methods

Method 1: Using Jsoup (Recommended)

Step 1: Add Jsoup Dependency

Step 2: Write the Java Code

Output:

Method 2: Using javax.swing.text.html (Built-in)

Step 1: Write the Java Code

Output:

Method 3: Using Regular Expressions (Not Recommended)

Output:

Conclusion and Recommendation

相关推荐

androidkiller安装教程

Method 2: Using `javax.swing.text.html` (Built-in)