杰瑞科技汇

Java docx如何转html?工具方法有哪些?

Of course! Converting a DOCX file to HTML in Java is a common requirement for web applications, document processing, or content management systems. Here’s a comprehensive guide covering the most popular and robust libraries, with complete code examples.

Java docx如何转html?工具方法有哪些?-图1
(图片来源网络,侵删)

Summary of Options

Library Pros Cons Best For
Apache POI Industry standard, very powerful, handles complex formatting. Can be complex for simple tasks, HTML output can be verbose. Complex DOCX files where you need fine-grained control over conversion.
docx4j Excellent for DOCX, handles complex structures (headers, footers, relationships) well. Another complex library, steeper learning curve. High-fidelity conversion of complex Word documents, especially with advanced features.
jsoup Extremely simple and fast for parsing HTML. Cannot read DOCX files. You must first get the HTML content from DOCX using another method (like Apache POI's html package). Cleaning or manipulating the already extracted HTML content. Often used as a second step.

Method 1: Using Apache POI (Recommended & Powerful)

Apache POI is the go-to library for all things Microsoft Office in Java. While it's not a dedicated DOCX-to-HTML converter, its XWPF package has a built-in converter that is surprisingly good for most use cases.

Step 1: Add Dependencies

You'll need the Apache POI core library and the OOXML (Office Open XML) support.

If you're using Maven, add this to your pom.xml:

<dependencies>
    <!-- Apache POI Core -->
    <dependency>
        <groupId>org.apache.poi</groupId>
        <artifactId>poi</artifactId>
        <version>5.2.4</version>
    </dependency>
    <!-- Apache POI for Office Open XML formats (like .docx) -->
    <dependency>
        <groupId>org.apache.poi</groupId>
        <artifactId>poi-ooxml</artifactId>
        <version>5.2.4</version>
    </dependency>
</dependencies>

If you're using Gradle, add this to your build.gradle:

Java docx如何转html?工具方法有哪些?-图2
(图片来源网络,侵删)
implementation 'org.apache.poi:poi:5.2.4'
implementation 'org.apache.poi:poi-ooxml:5.2.4'

Step 2: Write the Java Code

The XWPFDocument class has a convenient method getAllPictures() to handle images and a method to write the document to an OutputStream as HTML.

Here is a complete, runnable example:

import org.apache.poi.xwpf.converter.core.XWPFConverterException;
import org.apache.poi.xwpf.converter.html.XHTMLConverter;
import org.apache.poi.xwpf.converter.html.XHTMLOptions;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.xwpf.usermodel.XWPFParagraph;
import org.apache.poi.xwpf.usermodel.XWPFRun;
import org.apache.poi.xwpf.usermodel.XWPFTable;
import org.apache.poi.xwpf.usermodel.XWPFTableCell;
import org.apache.poi.xwpf.usermodel.XWPFTableRow;
import java.io.*;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.List;
public class DocxToHtmlConverter {
    public static void main(String[] args) {
        // 1. Define input and output paths
        Path docxPath = Paths.get("input.docx");
        Path htmlPath = Paths.get("output.html");
        Path imageDir = Paths.get("images");
        try {
            // 2. Create directories if they don't exist
            if (!Files.exists(imageDir)) {
                Files.createDirectories(imageDir);
            }
            // 3. Load the DOCX file
            try (InputStream docxInputStream = Files.newInputStream(docxPath);
                 XWPFDocument document = new XWPFDocument(docxInputStream)) {
                // 4. Configure HTML conversion options
                XHTMLOptions options = XHTMLOptions.create();
                // Set a directory to save images extracted from the DOCX
                options.setIgnoreIfMissingPicture(true);
                options.setExtractImageToFolder(imageDir.toFile());
                // 5. Perform the conversion
                try (OutputStream htmlOutputStream = Files.newOutputStream(htmlPath)) {
                    XHTMLConverter.getInstance().convert(document, htmlOutputStream, options);
                }
                System.out.println("Conversion successful! HTML saved to: " + htmlPath);
                System.out.println("Images saved to: " + imageDir);
            } catch (XWPFConverterException e) {
                System.err.println("Error during DOCX to HTML conversion: " + e.getMessage());
            }
        } catch (IOException e) {
            System.err.println("Error reading DOCX file or writing HTML file: " + e.getMessage());
        }
    }
}

Explanation:

  1. Paths: We define the path for the input .docx file and the output .html file. We also create a directory (images) to store any images embedded in the document.
  2. Load Document: new XWPFDocument(docxInputStream) loads the DOCX file into memory.
  3. XHTMLOptions: This class is crucial for customizing the conversion.
    • XHTMLOptions.create() creates a default set of options.
    • options.setExtractImageToFolder(imageDir.toFile()) tells POI to extract all images from the DOCX and save them into the specified directory. The HTML will then reference these images using relative paths (e.g., images/image1.png).
  4. XHTMLConverter: This is the core class that performs the conversion. Its convert method takes the XWPFDocument, the output stream for the HTML, and the options.

Method 2: Using docx4j (Excellent Alternative)

docx4j is another powerful library, often praised for its high-fidelity conversion, especially with complex documents. It also handles headers, footers, and other document parts very well.

Step 1: Add Dependencies

Maven (pom.xml):

<dependencies>
    <!-- docx4j Core Library -->
    <dependency>
        <groupId>org.docx4j</groupId>
        <artifactId>docx4j-core</artifactId>
        <version>11.4.4</version>
    </dependency>
    <!-- docx4j Exporter for HTML -->
    <dependency>
        <groupId>org.docx4j</groupId>
        <artifactId>docx4j-export-fo</artifactId>
        <version>11.4.4</version>
    </dependency>
    <!-- For SVG support, sometimes needed for complex shapes -->
    <dependency>
        <groupId>org.apache.xmlgraphics</groupId>
        <artifactId>batik-all</artifactId>
        <version>1.14</version>
        <type>pom</type>
    </dependency>
</dependencies>

Step 2: Write the Java Code

docx4j's conversion process involves transforming the DOCX into an intermediate format (XSL-FO) and then converting that to HTML.

import org.docx4j.Docx4J;
import org.docx4j.openpackaging.packages.WordprocessingMLPackage;
import org.docx4j.openpackaging.parts.WordprocessingML.MainDocumentPart;
import java.io.File;
import java.io.FileOutputStream;
import java.io.OutputStream;
public class Docx4jConverter {
    public static void main(String[] args) {
        // 1. Define input and output paths
        String docxPath = "input.docx";
        String htmlPath = "output_docx4j.html";
        String imageDir = "images_docx4j";
        // Create image directory
        new File(imageDir).mkdirs();
        try {
            // 2. Load the DOCX package
            WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.load(new File(docxPath));
            // 3. Configure conversion options
            // You can set options here, like image handling.
            // Docx4j handles image extraction automatically.
            Docx4J.setHyperlinkStyle("Hyperlink"); // Optional: set a style for hyperlinks
            // 4. Perform the conversion
            // Docx4j converts DOCX -> XSL-FO -> HTML
            OutputStream htmlOutputStream = new FileOutputStream(htmlPath);
            Docx4J.convert(wordMLPackage, htmlOutputStream, Docx4J.HTML);
            System.out.println("Conversion successful! HTML saved to: " + htmlPath);
            System.out.println("Images should be in a sub-folder of the document's location.");
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

Explanation:

  1. Load Package: WordprocessingMLPackage.load() is the entry point for docx4j.
  2. Docx4J.convert(): This is the main method. It's smart enough to know you want HTML. It will:
    • Extract images and save them in a folder named _rels or similar, relative to your output HTML file.
    • Convert the document content, preserving styles, tables, and lists.
    • The conversion is very accurate but can produce more complex HTML than Apache POI.

Method 3: The Two-Step Process (POI + jsoup)

Sometimes, the HTML generated by POI or docx4j is good but not perfectly clean. You might want to remove unnecessary <span> tags, fix attributes, or simplify the structure. This is where jsoup shines.

Step 1: Add jsoup Dependency

Maven (pom.xml):

<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.15.4</version>
</dependency>

Step 2: Write the Java Code

This example first converts the DOCX to HTML using Apache POI, then cleans it up using jsoup.

import org.apache.poi.xwpf.converter.core.XWPFConverterException;
import org.apache.poi.xwpf.converter.html.XHTMLConverter;
import org.apache.poi.xwpf.converter.html.XHTMLOptions;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.safety.Safelist;
import java.io.*;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
public class DocxToHtmlWithJsoup {
    public static void main(String[] args) {
        Path docxPath = Paths.get("input.docx");
        Path rawHtmlPath = Paths.get("raw_output.html");
        Path cleanHtmlPath = Paths.get("clean_output.html");
        try {
            // Step 1: Convert DOCX to HTML using Apache POI
            try (InputStream docxInputStream = Files.newInputStream(docxPath);
                 XWPFDocument document = new XWPFDocument(docxInputStream);
                 OutputStream htmlOutputStream = Files.newOutputStream(rawHtmlPath)) {
                XHTMLOptions options = XHTMLOptions.create();
                XHTMLConverter.getInstance().convert(document, htmlOutputStream, options);
            }
            System.out.println("Raw HTML generated by POI.");
            // Step 2: Clean the HTML using jsoup
            String rawHtml = new String(Files.readAllBytes(rawHtmlPath));
            // Safelist allows only specific tags and attributes.
            // This is a good starting point to remove inline styles and other clutter.
            Safelist cleanList = Safelist.relaxed()
                    .addTags("p", "h1", "h2", "h3", "h4", "h5", "h6", "ul", "ol", "li", 
                             "strong", "em", "u", "br", "hr", "a", "img", "table", "thead", 
                             "tbody", "tr", "th", "td", "div", "span");
            // Jsoup's .clean() method sanitizes the HTML based on the safelist.
            // For cleaning instead of sanitizing, you can use Jsoup's other methods like .select().
            // A more common approach is to remove unwanted elements:
            Document doc = Jsoup.parse(rawHtml);
            // Remove all <span> tags, keeping their content
            doc.select("span").unwrap();
            // Remove all style attributes
            doc.select("style").remove();
            doc.select("[style]").removeAttr("style");
            String cleanHtml = doc.html();
            // Save the cleaned HTML
            Files.write(cleanHtmlPath, cleanHtml.getBytes());
            System.out.println("HTML cleaned and saved to: " + cleanHtmlPath);
        } catch (IOException | XWPFConverterException e) {
            e.printStackTrace();
        }
    }
}

Which One Should You Choose?

  • For most projects: Start with Apache POI. It's widely used, well-documented, and the built-in HTML converter is sufficient for many standard documents. The code is straightforward.
  • For high-fidelity, complex documents: Use docx4j. If your DOCX files have complex headers, footers, or intricate formatting that POI struggles with, docx4j will likely produce a more accurate result.
  • For clean, simple HTML: Use the POI + jsoup combination. This gives you the power of POI to extract content and the elegance of jsoup to produce clean, web-ready HTML. This is often the best approach for content that will be displayed on a website.
分享:
扫描分享到社交APP
上一篇
下一篇