Of course! Converting HTML to plain text in Java is a common task, especially for tasks like email processing, content analysis, or displaying a summary of a web page.

Here’s a comprehensive guide covering the best methods, from simple built-in libraries to robust, production-grade solutions.
Summary of Methods
| Method | Pros | Cons | Best For |
|---|---|---|---|
Jsoup |
Excellent choice. Very easy to use, robust, handles malformed HTML, and has a dedicated .text() method. |
External dependency (but very popular and well-maintained). | Almost all use cases. Web scraping, content extraction, general-purpose HTML parsing. |
javax.swing.text.html |
Built into Java (no dependency). Good for simple, well-formed HTML. | Verbose, not designed for general-purpose HTML parsing, can be brittle with real-world "messy" HTML. | Quick, simple tasks in environments where you can't add external libraries. |
| Regular Expressions | No dependencies. | Extremely brittle and not recommended. Fails on complex or malformed HTML. | Only for the simplest, most predictable HTML snippets. Avoid in production. |
Method 1: Using Jsoup (Recommended)
Jsoup is a fantastic open-source library designed for parsing, manipulating, and cleaning HTML. It's the go-to solution for this task in the Java world because it's powerful, easy to use, and very forgiving of imperfect HTML.
Step 1: Add Jsoup Dependency
You need to include Jsoup in your project. If you're using a build tool like Maven or Gradle, add the following dependency.
Maven (pom.xml):

<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.17.2</version> <!-- Check for the latest version -->
</dependency>
Gradle (build.gradle):
implementation 'org.jsoup:jsoup:1.17.2' // Check for the latest version
Step 2: Write the Java Code
Jsoup provides a static Jsoup.parse() method and a .text() method on the Document object that does all the heavy lifting.
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class HtmlToTextWithJsoup {
public static void main(String[] args) {
String html = "<html><head><title>My Title</title></head>"
+ "<body><h1>A Heading</h1>"
+ "<p>This is a <b>paragraph</b> with a <a href='https://example.com'>link</a>.</p>"
+ "<script>alert('This should be ignored');</script>"
+ "<style>body { color: red; }</style></body></html>";
// Parse the HTML string
Document doc = Jsoup.parse(html);
// Get the plain text
String text = doc.text();
System.out.println("--- Original HTML ---");
System.out.println(html);
System.out.println("\n--- Converted Text ---");
System.out.println(text);
}
}
Output:
--- Original HTML ---
<html><head><title>My Title</title></head><body><h1>A Heading</h1><p>This is a <b>paragraph</b> with a <a href='https://example.com'>link</a>.</p><script>alert('This should be ignored');</script><style>body { color: red; }</style></body></html>
--- Converted Text ---A Heading This is a paragraph with a link.
Key Advantages of Jsoup:
- Simplicity: The
doc.text()method is all you need. - Robustness: It handles real-world HTML that is often not perfectly formed.
- Safety: It automatically strips out
<script>and<style>tags, which is usually what you want. - Whitespace Handling: It intelligently collapses multiple spaces and newlines into single spaces, making the output clean.
Method 2: Using javax.swing.text.html (Built-in)
This approach uses classes from the Swing library, which are part of the standard Java Development Kit (JDK). It doesn't require any external dependencies.

Step 1: Write the Java Code
This method is more verbose. You create a StringReader from your HTML, pass it to a HTMLEditorKit, and then use a JTextPane to render the HTML. Finally, you extract the text from the JTextPane.
import javax.swing.text.html.HTMLEditorKit;
import javax.swing.text.html.parser.ParserDelegator;
import javax.swing.text.html.HTMLEditorKit.ParserCallback;
import java.io.StringReader;
import java.io.IOException;
public class HtmlToTextWithSwing {
public static void main(String[] args) throws IOException {
String html = "<html><head><title>My Title</title></head>"
+ "<body><h1>A Heading</h1>"
+ "<p>This is a <b>paragraph</b>.</p></body></html>";
// Use a custom ParserCallback to handle the text extraction
HTMLEditorKit.ParserCallback parserCallback = new HTMLEditorKit.ParserCallback() {
@Override
public void handleText(char[] data, int pos) {
// This method is called for each piece of text
System.out.print(new String(data));
}
};
// Create a new parser delegator and parse the HTML
new ParserDelegator().parse(new StringReader(html), parserCallback, true);
// A simpler, but less flexible way using JTextPane:
/*
JTextPane textPane = new JTextPane();
textPane.setContentType("text/html");
textPane.setText(html);
String text = textPane.getText();
System.out.println(text);
*/
}
}
Output:
Disadvantages of this method:
- Verbose: Requires more boilerplate code.
- Brittle: The Swing text parser is not as robust as dedicated HTML parsers. It can fail on complex or non-standard HTML.
- Not for UI: While you use
JTextPane, this method is for text extraction, not for displaying content in a GUI. Using Swing components for non-GUI tasks is generally not a good practice.
Method 3: Using Regular Expressions (Not Recommended)
You might be tempted to use a regex to strip all HTML tags. This is a very bad idea for anything but the most trivial cases.
HTML is not a regular language, and regex is not designed to parse it. A simple regex will fail on nested tags, attributes with > characters, or malformed HTML.
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class HtmlToTextWithRegex {
public static void main(String[] args) {
String html = "<p>This is a <b>paragraph</b>.</p>";
// A simple regex to remove tags
String text = html.replaceAll("<[^>]*>", "");
System.out.println("--- Original HTML ---");
System.out.println(html);
System.out.println("\n--- Converted Text (with Regex) ---");
System.out.println(text);
}
}
Output:
--- Original HTML ---
<p>This is a <b>paragraph</b>.</p>
--- Converted Text (with Regex) ---
This is a paragraph.
Why it's a bad idea:
Consider this slightly more complex HTML:
"<div class='main'>Hello <span id='1'>world</span></div>"
- A simple
replaceAll("<[^>]*>", "")would work, but what if you have an attribute with a>? Like<a href="example.com?q=1>2">link</a>. The regex would fail catastrophically. - It cannot handle comments (
<!-- comment -->) or script/style tags correctly without an extremely complex and unreadable regex.
Conclusion and Recommendation
For any serious application, use Jsoup. It's the industry standard for Java HTML processing, it's easy to set up, and it will save you from countless headaches caused by malformed HTML.
Use the built-in javax.swing.text.html method only if you are in a highly restricted environment where you absolutely cannot add any external libraries.
Never use regular expressions for parsing HTML.
