杰瑞科技汇

java html to text

Of course! Converting HTML to plain text in Java is a common task, especially for tasks like email processing, content analysis, or displaying a summary of a web page.

java html to text-图1
(图片来源网络,侵删)

Here’s a comprehensive guide covering the best methods, from simple built-in libraries to robust, production-grade solutions.

Summary of Methods

Method Pros Cons Best For
Jsoup Excellent choice. Very easy to use, robust, handles malformed HTML, and has a dedicated .text() method. External dependency (but very popular and well-maintained). Almost all use cases. Web scraping, content extraction, general-purpose HTML parsing.
javax.swing.text.html Built into Java (no dependency). Good for simple, well-formed HTML. Verbose, not designed for general-purpose HTML parsing, can be brittle with real-world "messy" HTML. Quick, simple tasks in environments where you can't add external libraries.
Regular Expressions No dependencies. Extremely brittle and not recommended. Fails on complex or malformed HTML. Only for the simplest, most predictable HTML snippets. Avoid in production.

Method 1: Using Jsoup (Recommended)

Jsoup is a fantastic open-source library designed for parsing, manipulating, and cleaning HTML. It's the go-to solution for this task in the Java world because it's powerful, easy to use, and very forgiving of imperfect HTML.

Step 1: Add Jsoup Dependency

You need to include Jsoup in your project. If you're using a build tool like Maven or Gradle, add the following dependency.

Maven (pom.xml):

java html to text-图2
(图片来源网络,侵删)
<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.17.2</version> <!-- Check for the latest version -->
</dependency>

Gradle (build.gradle):

implementation 'org.jsoup:jsoup:1.17.2' // Check for the latest version

Step 2: Write the Java Code

Jsoup provides a static Jsoup.parse() method and a .text() method on the Document object that does all the heavy lifting.

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class HtmlToTextWithJsoup {
    public static void main(String[] args) {
        String html = "<html><head><title>My Title</title></head>"
                     + "<body><h1>A Heading</h1>"
                     + "<p>This is a <b>paragraph</b> with a <a href='https://example.com'>link</a>.</p>"
                     + "<script>alert('This should be ignored');</script>"
                     + "<style>body { color: red; }</style></body></html>";
        // Parse the HTML string
        Document doc = Jsoup.parse(html);
        // Get the plain text
        String text = doc.text();
        System.out.println("--- Original HTML ---");
        System.out.println(html);
        System.out.println("\n--- Converted Text ---");
        System.out.println(text);
    }
}

Output:

--- Original HTML ---
<html><head><title>My Title</title></head><body><h1>A Heading</h1><p>This is a <b>paragraph</b> with a <a href='https://example.com'>link</a>.</p><script>alert('This should be ignored');</script><style>body { color: red; }</style></body></html>
--- Converted Text ---A Heading This is a paragraph with a link.

Key Advantages of Jsoup:

  • Simplicity: The doc.text() method is all you need.
  • Robustness: It handles real-world HTML that is often not perfectly formed.
  • Safety: It automatically strips out <script> and <style> tags, which is usually what you want.
  • Whitespace Handling: It intelligently collapses multiple spaces and newlines into single spaces, making the output clean.

Method 2: Using javax.swing.text.html (Built-in)

This approach uses classes from the Swing library, which are part of the standard Java Development Kit (JDK). It doesn't require any external dependencies.

java html to text-图3
(图片来源网络,侵删)

Step 1: Write the Java Code

This method is more verbose. You create a StringReader from your HTML, pass it to a HTMLEditorKit, and then use a JTextPane to render the HTML. Finally, you extract the text from the JTextPane.

import javax.swing.text.html.HTMLEditorKit;
import javax.swing.text.html.parser.ParserDelegator;
import javax.swing.text.html.HTMLEditorKit.ParserCallback;
import java.io.StringReader;
import java.io.IOException;
public class HtmlToTextWithSwing {
    public static void main(String[] args) throws IOException {
        String html = "<html><head><title>My Title</title></head>"
                     + "<body><h1>A Heading</h1>"
                     + "<p>This is a <b>paragraph</b>.</p></body></html>";
        // Use a custom ParserCallback to handle the text extraction
        HTMLEditorKit.ParserCallback parserCallback = new HTMLEditorKit.ParserCallback() {
            @Override
            public void handleText(char[] data, int pos) {
                // This method is called for each piece of text
                System.out.print(new String(data));
            }
        };
        // Create a new parser delegator and parse the HTML
        new ParserDelegator().parse(new StringReader(html), parserCallback, true);
        // A simpler, but less flexible way using JTextPane:
        /*
        JTextPane textPane = new JTextPane();
        textPane.setContentType("text/html");
        textPane.setText(html);
        String text = textPane.getText();
        System.out.println(text);
        */
    }
}

Output:


Disadvantages of this method:

  • Verbose: Requires more boilerplate code.
  • Brittle: The Swing text parser is not as robust as dedicated HTML parsers. It can fail on complex or non-standard HTML.
  • Not for UI: While you use JTextPane, this method is for text extraction, not for displaying content in a GUI. Using Swing components for non-GUI tasks is generally not a good practice.

Method 3: Using Regular Expressions (Not Recommended)

You might be tempted to use a regex to strip all HTML tags. This is a very bad idea for anything but the most trivial cases.

HTML is not a regular language, and regex is not designed to parse it. A simple regex will fail on nested tags, attributes with > characters, or malformed HTML.

import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class HtmlToTextWithRegex {
    public static void main(String[] args) {
        String html = "<p>This is a <b>paragraph</b>.</p>";
        // A simple regex to remove tags
        String text = html.replaceAll("<[^>]*>", "");
        System.out.println("--- Original HTML ---");
        System.out.println(html);
        System.out.println("\n--- Converted Text (with Regex) ---");
        System.out.println(text);
    }
}

Output:

--- Original HTML ---
<p>This is a <b>paragraph</b>.</p>
--- Converted Text (with Regex) ---
This is a paragraph.

Why it's a bad idea: Consider this slightly more complex HTML: "<div class='main'>Hello <span id='1'>world</span></div>"

  • A simple replaceAll("<[^>]*>", "") would work, but what if you have an attribute with a >? Like <a href="example.com?q=1>2">link</a>. The regex would fail catastrophically.
  • It cannot handle comments (<!-- comment -->) or script/style tags correctly without an extremely complex and unreadable regex.

Conclusion and Recommendation

For any serious application, use Jsoup. It's the industry standard for Java HTML processing, it's easy to set up, and it will save you from countless headaches caused by malformed HTML.

Use the built-in javax.swing.text.html method only if you are in a highly restricted environment where you absolutely cannot add any external libraries.

Never use regular expressions for parsing HTML.

分享:
扫描分享到社交APP
上一篇
下一篇