Of course! Converting a DOCX file to HTML in Java is a common requirement for web applications, document processing, or content management systems. Here’s a comprehensive guide covering the most popular and robust libraries, with complete code examples.

Summary of Options
| Library | Pros | Cons | Best For |
|---|---|---|---|
| Apache POI | Industry standard, very powerful, handles complex formatting. | Can be complex for simple tasks, HTML output can be verbose. | Complex DOCX files where you need fine-grained control over conversion. |
| docx4j | Excellent for DOCX, handles complex structures (headers, footers, relationships) well. | Another complex library, steeper learning curve. | High-fidelity conversion of complex Word documents, especially with advanced features. |
| jsoup | Extremely simple and fast for parsing HTML. | Cannot read DOCX files. You must first get the HTML content from DOCX using another method (like Apache POI's html package). |
Cleaning or manipulating the already extracted HTML content. Often used as a second step. |
Method 1: Using Apache POI (Recommended & Powerful)
Apache POI is the go-to library for all things Microsoft Office in Java. While it's not a dedicated DOCX-to-HTML converter, its XWPF package has a built-in converter that is surprisingly good for most use cases.
Step 1: Add Dependencies
You'll need the Apache POI core library and the OOXML (Office Open XML) support.
If you're using Maven, add this to your pom.xml:
<dependencies>
<!-- Apache POI Core -->
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi</artifactId>
<version>5.2.4</version>
</dependency>
<!-- Apache POI for Office Open XML formats (like .docx) -->
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-ooxml</artifactId>
<version>5.2.4</version>
</dependency>
</dependencies>
If you're using Gradle, add this to your build.gradle:

implementation 'org.apache.poi:poi:5.2.4' implementation 'org.apache.poi:poi-ooxml:5.2.4'
Step 2: Write the Java Code
The XWPFDocument class has a convenient method getAllPictures() to handle images and a method to write the document to an OutputStream as HTML.
Here is a complete, runnable example:
import org.apache.poi.xwpf.converter.core.XWPFConverterException;
import org.apache.poi.xwpf.converter.html.XHTMLConverter;
import org.apache.poi.xwpf.converter.html.XHTMLOptions;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.xwpf.usermodel.XWPFParagraph;
import org.apache.poi.xwpf.usermodel.XWPFRun;
import org.apache.poi.xwpf.usermodel.XWPFTable;
import org.apache.poi.xwpf.usermodel.XWPFTableCell;
import org.apache.poi.xwpf.usermodel.XWPFTableRow;
import java.io.*;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.List;
public class DocxToHtmlConverter {
public static void main(String[] args) {
// 1. Define input and output paths
Path docxPath = Paths.get("input.docx");
Path htmlPath = Paths.get("output.html");
Path imageDir = Paths.get("images");
try {
// 2. Create directories if they don't exist
if (!Files.exists(imageDir)) {
Files.createDirectories(imageDir);
}
// 3. Load the DOCX file
try (InputStream docxInputStream = Files.newInputStream(docxPath);
XWPFDocument document = new XWPFDocument(docxInputStream)) {
// 4. Configure HTML conversion options
XHTMLOptions options = XHTMLOptions.create();
// Set a directory to save images extracted from the DOCX
options.setIgnoreIfMissingPicture(true);
options.setExtractImageToFolder(imageDir.toFile());
// 5. Perform the conversion
try (OutputStream htmlOutputStream = Files.newOutputStream(htmlPath)) {
XHTMLConverter.getInstance().convert(document, htmlOutputStream, options);
}
System.out.println("Conversion successful! HTML saved to: " + htmlPath);
System.out.println("Images saved to: " + imageDir);
} catch (XWPFConverterException e) {
System.err.println("Error during DOCX to HTML conversion: " + e.getMessage());
}
} catch (IOException e) {
System.err.println("Error reading DOCX file or writing HTML file: " + e.getMessage());
}
}
}
Explanation:
- Paths: We define the path for the input
.docxfile and the output.htmlfile. We also create a directory (images) to store any images embedded in the document. - Load Document:
new XWPFDocument(docxInputStream)loads the DOCX file into memory. - XHTMLOptions: This class is crucial for customizing the conversion.
XHTMLOptions.create()creates a default set of options.options.setExtractImageToFolder(imageDir.toFile())tells POI to extract all images from the DOCX and save them into the specified directory. The HTML will then reference these images using relative paths (e.g.,images/image1.png).
- XHTMLConverter: This is the core class that performs the conversion. Its
convertmethod takes theXWPFDocument, the output stream for the HTML, and the options.
Method 2: Using docx4j (Excellent Alternative)
docx4j is another powerful library, often praised for its high-fidelity conversion, especially with complex documents. It also handles headers, footers, and other document parts very well.
Step 1: Add Dependencies
Maven (pom.xml):
<dependencies>
<!-- docx4j Core Library -->
<dependency>
<groupId>org.docx4j</groupId>
<artifactId>docx4j-core</artifactId>
<version>11.4.4</version>
</dependency>
<!-- docx4j Exporter for HTML -->
<dependency>
<groupId>org.docx4j</groupId>
<artifactId>docx4j-export-fo</artifactId>
<version>11.4.4</version>
</dependency>
<!-- For SVG support, sometimes needed for complex shapes -->
<dependency>
<groupId>org.apache.xmlgraphics</groupId>
<artifactId>batik-all</artifactId>
<version>1.14</version>
<type>pom</type>
</dependency>
</dependencies>
Step 2: Write the Java Code
docx4j's conversion process involves transforming the DOCX into an intermediate format (XSL-FO) and then converting that to HTML.
import org.docx4j.Docx4J;
import org.docx4j.openpackaging.packages.WordprocessingMLPackage;
import org.docx4j.openpackaging.parts.WordprocessingML.MainDocumentPart;
import java.io.File;
import java.io.FileOutputStream;
import java.io.OutputStream;
public class Docx4jConverter {
public static void main(String[] args) {
// 1. Define input and output paths
String docxPath = "input.docx";
String htmlPath = "output_docx4j.html";
String imageDir = "images_docx4j";
// Create image directory
new File(imageDir).mkdirs();
try {
// 2. Load the DOCX package
WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.load(new File(docxPath));
// 3. Configure conversion options
// You can set options here, like image handling.
// Docx4j handles image extraction automatically.
Docx4J.setHyperlinkStyle("Hyperlink"); // Optional: set a style for hyperlinks
// 4. Perform the conversion
// Docx4j converts DOCX -> XSL-FO -> HTML
OutputStream htmlOutputStream = new FileOutputStream(htmlPath);
Docx4J.convert(wordMLPackage, htmlOutputStream, Docx4J.HTML);
System.out.println("Conversion successful! HTML saved to: " + htmlPath);
System.out.println("Images should be in a sub-folder of the document's location.");
} catch (Exception e) {
e.printStackTrace();
}
}
}
Explanation:
- Load Package:
WordprocessingMLPackage.load()is the entry point for docx4j. - Docx4J.convert(): This is the main method. It's smart enough to know you want HTML. It will:
- Extract images and save them in a folder named
_relsor similar, relative to your output HTML file. - Convert the document content, preserving styles, tables, and lists.
- The conversion is very accurate but can produce more complex HTML than Apache POI.
- Extract images and save them in a folder named
Method 3: The Two-Step Process (POI + jsoup)
Sometimes, the HTML generated by POI or docx4j is good but not perfectly clean. You might want to remove unnecessary <span> tags, fix attributes, or simplify the structure. This is where jsoup shines.
Step 1: Add jsoup Dependency
Maven (pom.xml):
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.15.4</version>
</dependency>
Step 2: Write the Java Code
This example first converts the DOCX to HTML using Apache POI, then cleans it up using jsoup.
import org.apache.poi.xwpf.converter.core.XWPFConverterException;
import org.apache.poi.xwpf.converter.html.XHTMLConverter;
import org.apache.poi.xwpf.converter.html.XHTMLOptions;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.safety.Safelist;
import java.io.*;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
public class DocxToHtmlWithJsoup {
public static void main(String[] args) {
Path docxPath = Paths.get("input.docx");
Path rawHtmlPath = Paths.get("raw_output.html");
Path cleanHtmlPath = Paths.get("clean_output.html");
try {
// Step 1: Convert DOCX to HTML using Apache POI
try (InputStream docxInputStream = Files.newInputStream(docxPath);
XWPFDocument document = new XWPFDocument(docxInputStream);
OutputStream htmlOutputStream = Files.newOutputStream(rawHtmlPath)) {
XHTMLOptions options = XHTMLOptions.create();
XHTMLConverter.getInstance().convert(document, htmlOutputStream, options);
}
System.out.println("Raw HTML generated by POI.");
// Step 2: Clean the HTML using jsoup
String rawHtml = new String(Files.readAllBytes(rawHtmlPath));
// Safelist allows only specific tags and attributes.
// This is a good starting point to remove inline styles and other clutter.
Safelist cleanList = Safelist.relaxed()
.addTags("p", "h1", "h2", "h3", "h4", "h5", "h6", "ul", "ol", "li",
"strong", "em", "u", "br", "hr", "a", "img", "table", "thead",
"tbody", "tr", "th", "td", "div", "span");
// Jsoup's .clean() method sanitizes the HTML based on the safelist.
// For cleaning instead of sanitizing, you can use Jsoup's other methods like .select().
// A more common approach is to remove unwanted elements:
Document doc = Jsoup.parse(rawHtml);
// Remove all <span> tags, keeping their content
doc.select("span").unwrap();
// Remove all style attributes
doc.select("style").remove();
doc.select("[style]").removeAttr("style");
String cleanHtml = doc.html();
// Save the cleaned HTML
Files.write(cleanHtmlPath, cleanHtml.getBytes());
System.out.println("HTML cleaned and saved to: " + cleanHtmlPath);
} catch (IOException | XWPFConverterException e) {
e.printStackTrace();
}
}
}
Which One Should You Choose?
- For most projects: Start with Apache POI. It's widely used, well-documented, and the built-in HTML converter is sufficient for many standard documents. The code is straightforward.
- For high-fidelity, complex documents: Use docx4j. If your DOCX files have complex headers, footers, or intricate formatting that POI struggles with, docx4j will likely produce a more accurate result.
- For clean, simple HTML: Use the POI + jsoup combination. This gives you the power of POI to extract content and the elegance of jsoup to produce clean, web-ready HTML. This is often the best approach for content that will be displayed on a website.
