杰瑞科技汇

Java POI如何将Word转为HTML?

Of course! Using Apache POI to work with Word documents (.docx) and converting them to HTML is a common task for generating reports or making document content web-accessible.

Java POI如何将Word转为HTML?-图1
(图片来源网络,侵删)

Here’s a comprehensive guide covering:

  1. Core Concepts: Why this is a multi-step process.
  2. The Modern Approach: Using the docx4j library (recommended).
  3. The "Pure" POI Approach: Using XWPF2FO and FOP (more complex).
  4. A Complete, Runnable Example using the modern approach.
  5. Important Considerations and limitations.

Core Concepts: Why It's Not a One-Click Solution

Apache POI's main focus is on the low-level structure of Office files (.docx, .xlsx, .pptx). It can read and write XML parts, but it doesn't have a built-in, high-level function to render a .docx directly to a perfectly formatted HTML file.

The conversion process involves these steps:

  1. Parse the .docx file: A .docx is a ZIP archive containing XML files. POI helps you read these.
  2. Extract Content: You need to read the content from the main document XML (document.xml), styles from styles.xml, and relationships from document.xml.rels.
  3. Map Formatting to HTML: This is the hardest part. You need to translate Word-specific styles (fonts, colors, indentation, headers/footers, tables) into equivalent HTML/CSS.
  4. Generate HTML: Assemble the extracted content and the generated CSS into a valid HTML document.

Because of this complexity, it's often easier to use a library that has already solved this problem. docx4j is the most popular and effective choice for this.

Java POI如何将Word转为HTML?-图2
(图片来源网络,侵删)

The Modern & Recommended Approach: Using docx4j

docx4j is a library specifically designed for creating and manipulating Open XML (.docx, .pptx, .xlsx) files. It has excellent built-in support for conversion to various formats, including HTML.

Step 1: Add the docx4j Dependency

You need to add the docx4j-core and docx4j-export-fo (for the HTML conversion engine) libraries to your project. If you use Maven, add this to your pom.xml:

<dependencies>
    <!-- docx4j core library -->
    <dependency>
        <groupId>org.docx4j</groupId>
        <artifactId>docx4j-core</artifactId>
        <version>11.4.4</version> <!-- Check for the latest version -->
    </dependency>
    <!-- This library provides the HTML conversion functionality -->
    <dependency>
        <groupId>org.docx4j</groupId>
        <artifactId>docx4j-export-fo</artifactId>
        <version>11.4.4</version> <!-- Must match the core version -->
    </dependency>
</dependencies>

Step 2: Write the Java Code

The process is straightforward: load the .docx file and call the conversion method.

import org.docx4j.Docx4J;
import org.docx4j.openpackaging.packages.WordprocessingMLPackage;
import org.docx4j.openpackaging.parts.WordprocessingML.MainDocumentPart;
import org.docx4j.wml.P;
import java.io.File;
import java.io.FileOutputStream;
import java.io.OutputStream;
public class DocxToHtmlConverter {
    public static void main(String[] args) {
        try {
            // 1. Load the .docx file
            String inputPath = "path/to/your/document.docx";
            WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.load(new File(inputPath));
            // 2. Convert to HTML
            // The convert method returns the HTML as a String
            String html = Docx4J.toHTML(wordMLPackage, null, Docx4J.FLAG_NONE);
            // 3. Save the HTML to a file
            String outputPath = "path/to/your/output.html";
            try (OutputStream os = new FileOutputStream(outputPath)) {
                os.write(html.getBytes("UTF-8"));
            }
            System.out.println("Successfully converted " + inputPath + " to " + outputPath);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

What the code does:

Java POI如何将Word转为HTML?-图3
(图片来源网络,侵删)
  1. WordprocessingMLPackage.load(...): This loads the entire .docx document into memory, making its content accessible.
  2. Docx4J.toHTML(...): This is the magic method. It takes the loaded package and converts the main document part into an HTML string. It automatically handles most formatting by generating inline CSS.
  3. FileOutputStream: We write the resulting HTML string to a file, ensuring we use a character set like UTF-8 to handle special characters correctly.

The "Pure" POI Approach (Advanced & Not Recommended for HTML)

While you can use POI to build a converter, it's a significant amount of work. The general idea is to use POI to read the document structure and then use a separate tool like Apache FOP (Formatting Objects) to convert it to HTML.

The process looks like this:

  1. Use POI to read the .docx: Iterate through paragraphs (XWPFParagraph) and runs (XWPFRun) to extract text and styling information.
  2. Convert to XSL-FO: Translate the POI objects into XSL-FO (Formatting Objects) documents. XSL-FO is an XML-based page description language. You would need to manually map every Word style (bold, italic, font size, table cells, etc.) to an XSL-FO element.
  3. Use Apache FOP: Feed the generated XSL-FO document to Apache FOP, which can then render it to various output formats, including HTML.

Why this is not ideal:

  • Extremely Complex: You are essentially re-implementing the logic that docx4j provides out-of-the-box.
  • Error-Prone: It's very easy to miss a specific Word feature or edge case, leading to incorrect output.
  • Heavyweight: You're adding the dependency and complexity of FOP on top of POI.

Conclusion: Unless you have a very specific reason and need fine-grained control over the XSL-FO generation, avoid this approach and stick with docx4j.


Complete, Runnable Example with docx4j

Let's put it all together with a sample Word document.

Sample document.docx

Imagine you have a simple Word file at C:/temp/sample.docx with the following content:

Main Title

This is a paragraph with some bold text and some italic text.

Here is a list:

  • Item 1
  • Item 2

And a simple table:

Header 1 Header 2
Cell A Cell B
Cell C Cell D

Java Code

import org.docx4j.Docx4J;
import org.docx4j.openpackaging.packages.WordprocessingMLPackage;
import java.io.File;
import java.io.FileOutputStream;
import java.io.OutputStream;
public class DocxToHtmlExample {
    public static void main(String[] args) {
        // Define input and output file paths
        String inputDocxPath = "C:/temp/sample.docx";
        String outputHtmlPath = "C:/temp/output.html";
        try {
            // 1. Load the Word document
            System.out.println("Loading Word document: " + inputDocxPath);
            WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.load(new File(inputDocxPath));
            // 2. Convert the document to HTML
            // Docx4J.FLAG_NONE means no specific flags are set, using default behavior.
            // It will generate inline CSS for styling.
            System.out.println("Converting to HTML...");
            String htmlContent = Docx4J.toHTML(wordMLPackage, null, Docx4J.FLAG_NONE);
            // 3. Write the HTML content to a file
            System.out.println("Saving HTML to: " + outputHtmlPath);
            try (OutputStream htmlFile = new FileOutputStream(outputHtmlPath)) {
                htmlFile.write(htmlContent.getBytes("UTF-8"));
            }
            System.out.println("Conversion complete!");
        } catch (Exception e) {
            System.err.println("An error occurred during the conversion process:");
            e.printStackTrace();
        }
    }
}

Resulting output.html

When you run the code, C:/temp/output.html will contain something like this (the exact CSS class names may vary):

<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
    <style>
      /* A lot of CSS for styling will be generated here */
      body { font-family: 'Times New Roman'; }
      .p0 { margin-bottom: 0; }
      .p1 { margin-bottom: 0; margin-top: 0; }
      .s0 { font-size: 24pt; font-weight: bold; }
      .s1 { font-weight: bold; }
      .s2 { font-style: italic; }
      /* ... more styles for lists, tables, etc. */
    </style>
  </head>
  <body>
    <p class="p0"><span class="s0">Main Title</span></p>
    <p class="p1">This is a paragraph with some <span class="s1">bold text</span> and some <span class="s2">italic text</span>.</p>
    <p class="p1">Here is a list:</p>
    <ul>
      <li>Item 1</li>
      <li>Item 2</li>
    </ul>
    <table>
      <tr>
        <th>Header 1</th>
        <th>Header 2</th>
      </tr>
      <tr>
        <td>Cell A</td>
        <td>Cell B</td>
      </tr>
      <tr>
        <td>Cell C</td>
        <td>Cell D</td>
      </tr>
    </table>
  </body>
</html>

Important Considerations and Limitations

  • Images: docx4j handles images well. They will be converted to base64 data URLs and embedded directly into the HTML file, making it self-contained.
  • Headers and Footers: By default, Docx4J.toHTML converts only the main document body. To include headers and footers, you need to do it manually. You can get the HeaderFooterPart and convert it separately, then append the HTML to the main content.
  • Complex Layouts: Advanced Word features like text boxes, columns, or complex floating objects may not be perfectly represented in the simple HTML/CSS model. The conversion might be approximate.
  • Styling: docx4j does a great job of mapping styles, but the result is often a lot of inline CSS or <style> block. This is good for preserving fidelity but can make the HTML file large. You might want to post-process the HTML to extract the CSS into a separate file if you need to manage it.
  • Footnotes and Endnotes: These are typically converted to <sup> tags or similar, but their exact behavior can depend on the library version and settings.
分享:
扫描分享到社交APP
上一篇
下一篇