Java实现Word转HTML的方法
在Java中,有几种常见的方法可以将Word文档(.docx)转换为HTML格式,以下是几种主流的实现方式:

使用Apache POI (推荐)
Apache POI是处理Office文档的常用库,虽然它本身不直接支持Word转HTML,但可以通过组合使用其他库实现。
import org.apache.poi.xwpf.converter.core.XHTMLConverter;
import org.apache.poi.xwpf.converter.core.XHTMLOptions;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.xwpf.usermodel.XWPFParagraph;
import java.io.*;
public class WordToHtmlConverter {
public static void convertDocxToHtml(String inputPath, String outputPath) throws IOException {
try (InputStream in = new FileInputStream(inputPath);
OutputStream out = new FileOutputStream(outputPath);
XWPFDocument document = new XWPFDocument(in)) {
XHTMLConverter.getInstance().convert(document, out, null);
}
}
public static void main(String[] args) {
try {
convertDocxToHtml("input.docx", "output.html");
System.out.println("转换成功!");
} catch (IOException e) {
e.printStackTrace();
}
}
}
依赖配置(Maven):
<dependencies>
<!-- Apache POI for .docx files -->
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi</artifactId>
<version>5.2.3</version>
</dependency>
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-ooxml</artifactId>
<version>5.2.3</version>
</dependency>
<!-- POI Word to HTML converter -->
<dependency>
<groupId>fr.opensagres.xdocreport</groupId>
<artifactId>fr.opensagres.poi.xwpf.converter.core</artifactId>
<version>2.0.3</version>
</dependency>
<dependency>
<groupId>fr.opensagres.xdocreport</groupId>
<artifactId>fr.opensagres.poi.xwpf.converter.xhtml</artifactId>
<version>2.0.3</version>
</dependency>
</dependencies>
使用docx4j
docx4j是专门处理Office Open XML格式的库,支持Word转HTML。
import org.docx4j.convert.in.xhtml.XHTMLImporter;
import org.docx4j.openpackaging.packages.WordprocessingMLPackage;
import org.docx4j.openpackaging.parts.WordprocessingML.MainDocumentPart;
import java.io.File;
public class Docx4jConverter {
public static void convertToHtml(String inputPath, String outputPath) throws Exception {
WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.load(new File(inputPath));
MainDocumentPart documentPart = wordMLPackage.getMainDocumentPart();
String html = XHTMLImporter.convert(documentPart, null);
try (java.io.Writer out = new java.io.FileWriter(outputPath)) {
out.write(html);
}
}
public static void main(String[] args) {
try {
convertToHtml("input.docx", "output.html");
System.out.println("转换成功!");
} catch (Exception e) {
e.printStackTrace();
}
}
}
依赖配置(Maven):
<dependencies>
<dependency>
<groupId>org.docx4j</groupId>
<artifactId>docx4j</artifactId>
<version>11.4.4</version>
</dependency>
</dependencies>
使用iText (商业版)
iText提供了商业版本的Word转换功能,功能更强大但需要许可证。
import com.itextpdf.text.Document;
import com.itextpdf.text.DocumentException;
import com.itextpdf.text.html.simpleparser.HTMLWorker;
import com.itextpdf.text.pdf.PdfWriter;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.StringReader;
public class ItextConverter {
public static void convertDocxToHtml(String docxPath, String htmlPath) throws IOException, DocumentException {
// 注意:iText本身不直接支持.docx转HTML,需要先转换为PDF再处理
// 这里只是示例,实际实现会更复杂
}
}
使用Jacob (Windows平台)
Jacob (Java-COM Bridge)允许调用Windows的COM组件进行转换。

import com.jacob.activeX.ActiveXComponent;
import com.jacob.com.Dispatch;
import com.jacob.com.Variant;
public class JacobConverter {
public static void convertToHtml(String docxPath, String htmlPath) {
ActiveXComponent word = new ActiveXComponent("Word.Application");
try {
word.setProperty("Visible", new Variant(false));
Dispatch documents = word.getProperty("Documents").toDispatch();
Dispatch document = Dispatch.call(documents, "Open", docxPath).toDispatch();
Dispatch.call(document, "SaveAs", htmlPath, 8); // 8 represents HTML format
Dispatch.call(document, "Close", new Variant(false));
} finally {
word.invoke("Quit", new Variant[0]);
}
}
}
依赖配置(Maven):
<dependency>
<groupId>net.sf.jacob-project</groupId>
<artifactId>jacob</artifactId>
<version>1.20</version>
</dependency>
注意事项
- 样式保留:不同库对Word样式的保留程度不同,可能需要额外配置
- 图片处理:转换后的HTML可能需要单独处理图片
- 性能考虑:大文件转换可能需要优化内存使用
- 跨平台:Jacob仅适用于Windows平台,其他库是跨平台的
选择哪种方法取决于你的具体需求、项目环境以及对转换质量的要求,对于大多数Java应用场景,Apache POI + XHTMLConverter是较好的选择。

