How to Convert Word docx Files to HTML Using Apache POI in Java
I needed to display Word document content in a web application. The documents were stored as .docx files on the server, but my web frontend needed HTML. I tried several approaches before finding a reliable solution with Apache POI and XDocReport.
The Problem
My enterprise application stores reports as Word documents. Users wanted to preview these documents in their browsers without downloading them. I needed a way to convert .docx files to HTML programmatically.
My first attempt used basic file reading, which obviously failed:
Exception in thread "main" java.io.IOException: Invalid header signature;read 0x0000000000000000, expected 0xE11AB1A1E011CFD0A .docx file is a ZIP archive containing XML files, not plain text. I needed a proper library to parse its structure.
Maven Dependencies
I added Apache POI and XDocReport to my pom.xml:
<dependency> <groupId>org.apache.poi</groupId> <artifactId>poi-ooxml</artifactId> <version>5.5.1</version></dependency><dependency> <groupId>fr.opensagres.xdocreport</groupId> <artifactId>fr.opensagres.poi.xwpf.converter.xhtml</artifactId> <version>2.1.0</version></dependency>Apache POI’s poi-ooxml handles the .docx format (which is a ZIP containing XML). XDocReport’s XHTMLConverter transforms the document model to HTML.
Loading the Document
First, I needed to load and validate the .docx file:
import org.apache.poi.xwpf.usermodel.XWPFDocument;import java.io.FileNotFoundException;import java.io.IOException;import java.io.UncheckedIOException;import java.nio.file.Files;import java.nio.file.Path;import java.nio.file.Paths;
public XWPFDocument loadDocxFromPath(String path) { try { Path file = Paths.get(path); if (!Files.exists(file)) { throw new FileNotFoundException("File not found: " + path); } XWPFDocument document = new XWPFDocument(Files.newInputStream(file)); boolean hasParagraphs = !document.getParagraphs().isEmpty(); boolean hasTables = !document.getTables().isEmpty(); if (!hasParagraphs && !hasTables) { document.close(); throw new IllegalArgumentException("Document is empty: " + path); } return document; } catch (IOException ex) { throw new UncheckedIOException("Cannot load document: " + path, ex); }}The XWPFDocument class provides a clean API over the ZIP+XML structure of .docx files. I validate that the document has content (paragraphs or tables) before returning it.
Why validate before conversion? Empty documents cause cryptic errors during HTML conversion. It’s better to fail fast with a clear message.
Converting to HTML
The core conversion logic:
import fr.opensagres.poi.xwpf.converter.xhtml.XHTMLConverter;import fr.opensagres.poi.xwpf.converter.xhtml.XHTMLOptions;import fr.opensagres.poi.xwpf.converter.xhtml.Base64EmbedImgManager;import org.apache.poi.xwpf.usermodel.XWPFDocument;
import java.io.IOException;import java.io.OutputStream;import java.nio.file.Files;import java.nio.file.Path;import java.nio.file.Paths;
public void convertDocxToHtml(String docxPath) throws IOException { Path input = Paths.get(docxPath); String htmlFileName = input.getFileName().toString().replaceFirst("\\.[^.]+$", "") + ".html"; Path output = input.resolveSibling(htmlFileName);
try (XWPFDocument document = loadDocxFromPath(docxPath); OutputStream out = Files.newOutputStream(output)) {
XHTMLOptions options = XHTMLOptions.create(); // Embed images as base64 for self-contained HTML options.setImageManager(new Base64EmbedImgManager());
XHTMLConverter.getInstance().convert(document, out, options); }}The try-with-resources ensures both the document and output stream close properly, even if an exception occurs.
Handling Images
My documents contained embedded images. The default behavior saves images as separate files, which complicates deployment. I chose to embed images as base64:
// Option 1: Embed images as base64 (self-contained HTML)options.setImageManager(new Base64EmbedImgManager());
// Option 2: Save images to a directory (better for large images)// options.setImageManager(new ImageManager(output.getParent().toFile(), "images"));When to use each option:
| Method | Pros | Cons |
|---|---|---|
Base64EmbedImgManager | Self-contained HTML, easy deployment | Larger file size, browser caching disabled |
ImageManager | Smaller files, browser caching works | Must manage image directory |
For my use case (document previews), base64 embedding was simpler.
Full Working Example
Here’s the complete converter class:
import fr.opensagres.poi.xwpf.converter.xhtml.XHTMLConverter;import fr.opensagres.poi.xwpf.converter.xhtml.XHTMLOptions;import fr.opensagres.poi.xwpf.converter.xhtml.Base64EmbedImgManager;import org.apache.poi.xwpf.usermodel.XWPFDocument;
import java.io.FileNotFoundException;import java.io.IOException;import java.io.UncheckedIOException;import java.io.OutputStream;import java.nio.file.Files;import java.nio.file.Path;import java.nio.file.Paths;
public class DocxToHtmlConverter {
public static void main(String[] args) { if (args.length != 1) { System.err.println("Usage: java DocxToHtmlConverter <docx-file>"); System.exit(1); }
DocxToHtmlConverter converter = new DocxToHtmlConverter(); try { converter.convertDocxToHtml(args[0]); System.out.println("Conversion completed successfully."); } catch (IOException e) { System.err.println("Conversion failed: " + e.getMessage()); System.exit(1); } }
public void convertDocxToHtml(String docxPath) throws IOException { Path input = Paths.get(docxPath); String htmlFileName = input.getFileName().toString().replaceFirst("\\.[^.]+$", "") + ".html"; Path output = input.resolveSibling(htmlFileName);
try (XWPFDocument document = loadDocxFromPath(docxPath); OutputStream out = Files.newOutputStream(output)) {
XHTMLOptions options = XHTMLOptions.create(); options.setImageManager(new Base64EmbedImgManager());
XHTMLConverter.getInstance().convert(document, out, options); } }
public XWPFDocument loadDocxFromPath(String path) { try { Path file = Paths.get(path); if (!Files.exists(file)) { throw new FileNotFoundException("File not found: " + path); } XWPFDocument document = new XWPFDocument(Files.newInputStream(file)); boolean hasParagraphs = !document.getParagraphs().isEmpty(); boolean hasTables = !document.getTables().isEmpty(); if (!hasParagraphs && !hasTables) { document.close(); throw new IllegalArgumentException("Document is empty: " + path); } return document; } catch (IOException ex) { throw new UncheckedIOException("Cannot load document: " + path, ex); } }}Common Mistakes
Mistake 1: Not closing XWPFDocument
// WRONG: Resource leakXWPFDocument document = new XWPFDocument(Files.newInputStream(file));XHTMLConverter.getInstance().convert(document, out, options);// document never closed!Always use try-with-resources:
// RIGHT: Automatic resource managementtry (XWPFDocument document = new XWPFDocument(Files.newInputStream(file))) { XHTMLConverter.getInstance().convert(document, out, options);}Mistake 2: Forgetting UTF-8 encoding
The converter outputs UTF-8 by default, but if you write to a FileWriter, you’ll corrupt non-ASCII characters:
// WRONG: Uses system default encodingWriter writer = new FileWriter(outputFile);Use OutputStream instead, which preserves the UTF-8 bytes:
// RIGHT: Preserves UTF-8 encodingOutputStream out = Files.newOutputStream(output);Mistake 3: Not validating input
Passing a non-existent file or corrupted document produces confusing errors. Always validate:
Path file = Paths.get(path);if (!Files.exists(file)) { throw new FileNotFoundException("File not found: " + path);}What Gets Converted
XHTMLConverter handles most common Word elements:
| Element | HTML Output |
|---|---|
| Paragraphs | <p> |
| Bold/Italic | <strong>, <em> |
| Tables | <table>, <tr>, <td> |
| Images | <img> (base64 or file reference) |
| Lists | <ul>, <ol>, <li> |
| Headings | <h1> through <h6> |
Limitations: Complex layouts, floating elements, and some advanced formatting may not convert perfectly. Test with your actual documents.
Summary
Converting .docx to HTML with Apache POI and XDocReport requires:
- Add
poi-ooxmlandfr.opensagres.poi.xwpf.converter.xhtmldependencies - Load the document with
XWPFDocumentand validate content - Configure
XHTMLOptionswith anImageManagerfor embedded images - Call
XHTMLConverter.getInstance().convert()to generate HTML - Write output with proper UTF-8 encoding
The three-library approach (POI for parsing, XDocReport for conversion) handles paragraphs, tables, and images automatically. For enterprise applications, add regression tests with real sample documents to ensure layout fidelity.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments