杰瑞科技汇

如何用Java将HTML转为Word文档?

Of course! Converting HTML to a Word document (.docx) in Java is a common requirement for generating reports, invoices, or any formatted document from web-based content.

如何用Java将HTML转为Word文档?-图1
(图片来源网络,侵删)

Here’s a comprehensive guide covering the most popular and effective Java libraries, with complete code examples for each.

Summary of Libraries

Library Pros Cons Best For
Apache POI Industry standard, very powerful, full control over Word document structure. Steep learning curve, complex API, requires manual handling of HTML structure. Complex, highly customized Word documents where you need pixel-perfect control.
docx4j Excellent HTML-to-DOCX conversion, good support for CSS, easier to use than POI for this task. Can be slower than POI, fewer low-level document manipulation features. Most scenarios. The best choice if your primary goal is converting HTML to a Word document with good fidelity.
Flying Saucer (xhtmlrenderer) Renders HTML/CSS to an image, which you can then embed in a Word doc. Excellent for visual accuracy. Indirect method (image-based), not true text, text is not selectable/searchable. Converting complex, modern web pages with advanced CSS into a visual snapshot in Word.

Method 1: Using docx4j (Recommended for HTML Conversion)

docx4j has a dedicated HtmlImporter that is specifically designed for this task. It does a great job of translating HTML tags and even some CSS styles into Word's native format.

Add the Dependency

Add the docx4j library to your project. If you're using Maven, add this to your pom.xml:

<dependency>
    <groupId>org.docx4j</groupId>
    <artifactId>docx4j-core</artifactId>
    <version>11.4.4</version> <!-- Check for the latest version -->
</dependency>
<dependency>
    <groupId>org.docx4j</groupId>
    <artifactId>docx4j-export-fo</artifactId>
    <version>11.4.4</version> <!-- This is often needed for the conversion process -->
</dependency>

Java Code Example

This code takes a simple HTML string and converts it into a .docx file.

如何用Java将HTML转为Word文档?-图2
(图片来源网络,侵删)
import org.docx4j.Docx4J;
import org.docx4j.openpackaging.packages.WordprocessingMLPackage;
import org.docx4j.openpackaging.parts.WordprocessingML.MainDocumentPart;
import org.docx4j.wml.P;
public class HtmlToWordDocx4j {
    public static void main(String[] args) {
        try {
            // 1. Create a new Word document package
            WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.createPackage();
            // 2. Get the main document part
            MainDocumentPart documentPart = wordMLPackage.getMainDocumentPart();
            // 3. Define the HTML content
            String html = "<html>"
                        + "<head><style>h1 {color: blue;}</style></head>"
                        + "<body>"
                        + "   <h1>Hello from HTML!</h1>"
                        + "   <p>This is a <b>paragraph</b> with some <i>italic</i> text.</p>"
                        + "   <ul>"
                        + "       <li>List item 1</li>"
                        + "       <li>List item 2</li>"
                        + "   </ul>"
                        + "</body>"
                        + "</html>";
            // 4. Import the HTML into the document part
            // The 'false' parameter means don't use an XHTML namespace
            documentPart.addHtml(html);
            // 5. Save the document to a file
            Docx4J.save(wordMLPackage, new java.io.File("output.docx"));
            System.out.println("Successfully created output.docx");
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

How to Run:

  1. Save the code as HtmlToWordDocx4j.java.
  2. Compile and run it with Maven, or include the JARs in your classpath.
  3. An output.docx file will be created in your project's root directory.

Method 2: Using Apache POI (For Full Control)

Apache POI is the most powerful library for manipulating Office documents, but it's more verbose. Converting HTML with POI is a manual process where you essentially parse the HTML and build the Word document element by element.

Add the Dependency

Add the Apache POI library to your pom.xml:

<dependencies>
    <dependency>
        <groupId>org.apache.poi</groupId>
        <artifactId>poi</artifactId>
        <version>5.2.3</version> <!-- Check for the latest version -->
    </dependency>
    <dependency>
        <groupId>org.apache.poi</groupId>
        <artifactId>poi-ooxml</artifactId>
        <version>5.2.3</version>
    </dependency>
    <!-- You'll need an HTML parser like Jsoup -->
    <dependency>
        <groupId>org.jsoup</groupId>
        <artifactId>jsoup</artifactId>
        <version>1.15.3</version>
    </dependency>
</dependencies>

Java Code Example

This example uses Jsoup to parse the HTML and Apache POI to create the Word document.

如何用Java将HTML转为Word文档?-图3
(图片来源网络,侵删)
import org.apache.poi.xwpf.usermodel.*;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.nodes.Node;
import org.jsoup.nodes.TextNode;
import org.jsoup.select.NodeVisitor;
import java.io.FileOutputStream;
public class HtmlToWordApachePOI {
    public static void main(String[] args) {
        try {
            // 1. Define the HTML content
            String html = "<h1>Hello from POI!</h1>"
                        + "<p>This is a <b>paragraph</b> with some <i>italic</i> text.</p>"
                        + "<ul><li>List item 1</li><li>List item 2</li></ul>";
            // 2. Create a new Word document
            XWPFDocument document = new XWPFDocument();
            // 3. Parse the HTML using Jsoup
            Document jsoupDoc = Jsoup.parse(html);
            // 4. Recursively process the HTML body and add content to the Word doc
            processNode(document.createParagraph(), jsoupDoc.body());
            // 5. Save the document
            try (FileOutputStream out = new FileOutputStream("output_poi.docx")) {
                document.write(out);
            }
            System.out.println("Successfully created output_poi.docx");
            document.close();
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
    /**
     * Recursively processes a Jsoup node and adds its content to a Word paragraph.
     * This is a simplified example and would need to be expanded for full HTML/CSS support.
     */
    private static void processNode(XWPFParagraph paragraph, org.jsoup.nodes.Node node) {
        for (org.jsoup.nodes.Node child : node.childNodes()) {
            if (child instanceof Element) {
                Element element = (Element) child;
                switch (element.tagName().toLowerCase()) {
                    case "h1":
                        XWPFParagraph h1Para = paragraph.getParagraph().getDocument().createParagraph();
                        h1Para.getCTP().addNewPPr().addNewShd().setFill("E0E0E0"); // Light grey background
                        XWPFRun h1Run = h1Para.createRun();
                        h1Run.setBold(true);
                        h1Run.setFontSize(20);
                        h1Run.setText(element.text());
                        break;
                    case "p":
                        // Create a new paragraph for each <p> tag
                        XWPFParagraph pPara = paragraph.getParagraph().getDocument().createParagraph();
                        processNode(pPara, element); // Process children of <p>
                        break;
                    case "b":
                    case "strong":
                        XWPFRun boldRun = paragraph.createRun();
                        boldRun.setBold(true);
                        boldRun.setText(element.text());
                        break;
                    case "i":
                    case "em":
                        XWPFRun italicRun = paragraph.createRun();
                        italicRun.setItalic(true);
                        italicRun.setText(element.text());
                        break;
                    case "ul":
                        // For simplicity, just process list items. A full implementation
                        // would handle indentation and bullet styles.
                        for (org.jsoup.nodes.Node li : element.childNodes()) {
                            if (li instanceof Element && li.nodeName().equals("li")) {
                                XWPFParagraph liPara = paragraph.getParagraph().getDocument().createParagraph();
                                XWPFRun liRun = liPara.createRun();
                                liRun.setText("• " + ((Element) li).text());
                            }
                        }
                        break;
                    case "li": // Handled by the 'ul' case for simplicity
                        break;
                    default:
                        // For unknown tags, just add the text
                        if (!child.nodeName().equals("#text")) {
                            paragraph.createRun().setText("[" + element.tagName() + "]");
                        }
                        processNode(paragraph, element); // Process children anyway
                }
            } else if (child instanceof TextNode) {
                // Add text content to the current paragraph
                TextNode textNode = (TextNode) child;
                if (!textNode.isBlank()) {
                    paragraph.createRun().setText(textNode.text());
                }
            }
        }
    }
}

How to Run:

  1. Save the code as HtmlToWordApachePOI.java.
  2. Compile and run it with Maven.
  3. An output_poi.docx file will be created.

Method 3: Using Flying Saucer (For Image-Based Conversion)

Flying Saucer is an XHTML/CSS renderer. You can use it to render your HTML to a BufferedImage, and then embed that image into a Word document created with Apache POI.

Add Dependencies

You'll need Flying Saucer and Apache POI.

<dependencies>
    <!-- Flying Saucer -->
    <dependency>
        <groupId>org.xhtmlrenderer</groupId>
        <artifactId>flying-saucer-pdf</artifactId>
        <version>9.1.22</version> <!-- Check for the latest version -->
    </dependency>
    <!-- Apache POI -->
    <dependency>
        <groupId>org.apache.poi</groupId>
        <artifactId>poi-ooxml</artifactId>
        <version>5.2.3</version>
    </dependency>
</dependencies>

Java Code Example

import org.apache.poi.xwpf.usermodel.*;
import org.xhtmlrenderer.swing.Java2DRenderer;
import org.xhtmlrenderer.util.FSImageWriter;
import java.awt.image.BufferedImage;
import java.io.ByteArrayOutputStream;
import java.io.FileOutputStream;
public class HtmlToWordFlyingSaucer {
    public static void main(String[] args) {
        try {
            // 1. Define the HTML content (must be XHTML-compliant)
            String html = "<html>"
                        + "<head><style>body { font-family: sans-serif; }</style></head>"
                        + "<body>"
                        + "   <h1>Hello from Flying Saucer!</h1>"
                        + "   <p>This is a paragraph rendered as an image.</p>"
                        + "</body>"
                        + "</html>";
            // 2. Create a renderer
            Java2DRenderer renderer = new Java2DRenderer(html, 800, 600); // width, height
            // 3. Render the HTML to a BufferedImage
            BufferedImage image = renderer.getImage();
            // 4. Create a new Word document
            XWPFDocument document = new XWPFDocument();
            XWPFParagraph paragraph = document.createParagraph();
            XWPFRun run = paragraph.createRun();
            // 5. Convert the image to a byte array
            ByteArrayOutputStream baos = new ByteArrayOutputStream();
            FSImageWriter imageWriter = new FSImageWriter();
            imageWriter.write(image, "png", baos);
            // 6. Add the image to the Word document
            run.addPicture(baos.toByteArray(), XWPFDocument.PICTURE_TYPE_PNG, "image.png", Units.toEMU(600), Units.toEMU(400));
            // 7. Save the document
            try (FileOutputStream out = new FileOutputStream("output_flying_saucer.docx")) {
                document.write(out);
            }
            System.out.println("Successfully created output_flying_saucer.docx");
            document.close();
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

Which Method Should You Choose?

  • For most use cases: Use docx4j. It's the most direct and effective tool for converting HTML to a .docx file, handling styles and structure reasonably well.
  • For complex, template-based documents: Use Apache POI. If you need to build a Word document from scratch with very specific layouts, tables, headers, and footers, and the HTML is just a source of content, POI gives you the most control.
  • For visual snapshots: Use Flying Saucer. If your goal is to create a document that looks exactly like a web page, including complex CSS layouts that are hard to translate to Word, rendering it as an image is the best approach.
分享:
扫描分享到社交APP
上一篇
下一篇