Skip to content

How to Convert Word docx Files to HTML Using Apache POI in Java

I needed to display Word document content in a web application. The documents were stored as .docx files on the server, but my web frontend needed HTML. I tried several approaches before finding a reliable solution with Apache POI and XDocReport.

The Problem

My enterprise application stores reports as Word documents. Users wanted to preview these documents in their browsers without downloading them. I needed a way to convert .docx files to HTML programmatically.

My first attempt used basic file reading, which obviously failed:

Error output
Exception in thread "main" java.io.IOException: Invalid header signature;
read 0x0000000000000000, expected 0xE11AB1A1E011CFD0

A .docx file is a ZIP archive containing XML files, not plain text. I needed a proper library to parse its structure.

Maven Dependencies

I added Apache POI and XDocReport to my pom.xml:

pom.xml
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-ooxml</artifactId>
<version>5.5.1</version>
</dependency>
<dependency>
<groupId>fr.opensagres.xdocreport</groupId>
<artifactId>fr.opensagres.poi.xwpf.converter.xhtml</artifactId>
<version>2.1.0</version>
</dependency>

Apache POI’s poi-ooxml handles the .docx format (which is a ZIP containing XML). XDocReport’s XHTMLConverter transforms the document model to HTML.

Loading the Document

First, I needed to load and validate the .docx file:

DocxLoader.java
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.UncheckedIOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
public XWPFDocument loadDocxFromPath(String path) {
try {
Path file = Paths.get(path);
if (!Files.exists(file)) {
throw new FileNotFoundException("File not found: " + path);
}
XWPFDocument document = new XWPFDocument(Files.newInputStream(file));
boolean hasParagraphs = !document.getParagraphs().isEmpty();
boolean hasTables = !document.getTables().isEmpty();
if (!hasParagraphs && !hasTables) {
document.close();
throw new IllegalArgumentException("Document is empty: " + path);
}
return document;
} catch (IOException ex) {
throw new UncheckedIOException("Cannot load document: " + path, ex);
}
}

The XWPFDocument class provides a clean API over the ZIP+XML structure of .docx files. I validate that the document has content (paragraphs or tables) before returning it.

Why validate before conversion? Empty documents cause cryptic errors during HTML conversion. It’s better to fail fast with a clear message.

Converting to HTML

The core conversion logic:

DocxToHtmlConverter.java
import fr.opensagres.poi.xwpf.converter.xhtml.XHTMLConverter;
import fr.opensagres.poi.xwpf.converter.xhtml.XHTMLOptions;
import fr.opensagres.poi.xwpf.converter.xhtml.Base64EmbedImgManager;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import java.io.IOException;
import java.io.OutputStream;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
public void convertDocxToHtml(String docxPath) throws IOException {
Path input = Paths.get(docxPath);
String htmlFileName = input.getFileName().toString().replaceFirst("\\.[^.]+$", "") + ".html";
Path output = input.resolveSibling(htmlFileName);
try (XWPFDocument document = loadDocxFromPath(docxPath);
OutputStream out = Files.newOutputStream(output)) {
XHTMLOptions options = XHTMLOptions.create();
// Embed images as base64 for self-contained HTML
options.setImageManager(new Base64EmbedImgManager());
XHTMLConverter.getInstance().convert(document, out, options);
}
}

The try-with-resources ensures both the document and output stream close properly, even if an exception occurs.

Handling Images

My documents contained embedded images. The default behavior saves images as separate files, which complicates deployment. I chose to embed images as base64:

DocxToHtmlConverter.java
// Option 1: Embed images as base64 (self-contained HTML)
options.setImageManager(new Base64EmbedImgManager());
// Option 2: Save images to a directory (better for large images)
// options.setImageManager(new ImageManager(output.getParent().toFile(), "images"));

When to use each option:

MethodProsCons
Base64EmbedImgManagerSelf-contained HTML, easy deploymentLarger file size, browser caching disabled
ImageManagerSmaller files, browser caching worksMust manage image directory

For my use case (document previews), base64 embedding was simpler.

Full Working Example

Here’s the complete converter class:

DocxToHtmlConverter.java
import fr.opensagres.poi.xwpf.converter.xhtml.XHTMLConverter;
import fr.opensagres.poi.xwpf.converter.xhtml.XHTMLOptions;
import fr.opensagres.poi.xwpf.converter.xhtml.Base64EmbedImgManager;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.UncheckedIOException;
import java.io.OutputStream;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
public class DocxToHtmlConverter {
public static void main(String[] args) {
if (args.length != 1) {
System.err.println("Usage: java DocxToHtmlConverter <docx-file>");
System.exit(1);
}
DocxToHtmlConverter converter = new DocxToHtmlConverter();
try {
converter.convertDocxToHtml(args[0]);
System.out.println("Conversion completed successfully.");
} catch (IOException e) {
System.err.println("Conversion failed: " + e.getMessage());
System.exit(1);
}
}
public void convertDocxToHtml(String docxPath) throws IOException {
Path input = Paths.get(docxPath);
String htmlFileName = input.getFileName().toString().replaceFirst("\\.[^.]+$", "") + ".html";
Path output = input.resolveSibling(htmlFileName);
try (XWPFDocument document = loadDocxFromPath(docxPath);
OutputStream out = Files.newOutputStream(output)) {
XHTMLOptions options = XHTMLOptions.create();
options.setImageManager(new Base64EmbedImgManager());
XHTMLConverter.getInstance().convert(document, out, options);
}
}
public XWPFDocument loadDocxFromPath(String path) {
try {
Path file = Paths.get(path);
if (!Files.exists(file)) {
throw new FileNotFoundException("File not found: " + path);
}
XWPFDocument document = new XWPFDocument(Files.newInputStream(file));
boolean hasParagraphs = !document.getParagraphs().isEmpty();
boolean hasTables = !document.getTables().isEmpty();
if (!hasParagraphs && !hasTables) {
document.close();
throw new IllegalArgumentException("Document is empty: " + path);
}
return document;
} catch (IOException ex) {
throw new UncheckedIOException("Cannot load document: " + path, ex);
}
}
}

Common Mistakes

Mistake 1: Not closing XWPFDocument

Wrong.java
// WRONG: Resource leak
XWPFDocument document = new XWPFDocument(Files.newInputStream(file));
XHTMLConverter.getInstance().convert(document, out, options);
// document never closed!

Always use try-with-resources:

Right.java
// RIGHT: Automatic resource management
try (XWPFDocument document = new XWPFDocument(Files.newInputStream(file))) {
XHTMLConverter.getInstance().convert(document, out, options);
}

Mistake 2: Forgetting UTF-8 encoding

The converter outputs UTF-8 by default, but if you write to a FileWriter, you’ll corrupt non-ASCII characters:

Wrong.java
// WRONG: Uses system default encoding
Writer writer = new FileWriter(outputFile);

Use OutputStream instead, which preserves the UTF-8 bytes:

Right.java
// RIGHT: Preserves UTF-8 encoding
OutputStream out = Files.newOutputStream(output);

Mistake 3: Not validating input

Passing a non-existent file or corrupted document produces confusing errors. Always validate:

Right.java
Path file = Paths.get(path);
if (!Files.exists(file)) {
throw new FileNotFoundException("File not found: " + path);
}

What Gets Converted

XHTMLConverter handles most common Word elements:

ElementHTML Output
Paragraphs&lt;p&gt;
Bold/Italic&lt;strong&gt;, &lt;em&gt;
Tables&lt;table&gt;, &lt;tr&gt;, &lt;td&gt;
Images&lt;img&gt; (base64 or file reference)
Lists&lt;ul&gt;, &lt;ol&gt;, &lt;li&gt;
Headings&lt;h1&gt; through &lt;h6&gt;

Limitations: Complex layouts, floating elements, and some advanced formatting may not convert perfectly. Test with your actual documents.

Summary

Converting .docx to HTML with Apache POI and XDocReport requires:

  1. Add poi-ooxml and fr.opensagres.poi.xwpf.converter.xhtml dependencies
  2. Load the document with XWPFDocument and validate content
  3. Configure XHTMLOptions with an ImageManager for embedded images
  4. Call XHTMLConverter.getInstance().convert() to generate HTML
  5. Write output with proper UTF-8 encoding

The three-library approach (POI for parsing, XDocReport for conversion) handles paragraphs, tables, and images automatically. For enterprise applications, add regression tests with real sample documents to ensure layout fidelity.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments