Skip to content

How to Handle Images When Converting Word Documents to HTML with Apache POI

I was converting Word documents to HTML using Apache POI when I noticed something troubling. The generated HTML showed broken image placeholders everywhere. The text converted perfectly, but all the embedded images were missing.

After hours of debugging, I realized the conversion wasn’t automatically extracting images from the Word documents. I needed to handle image extraction explicitly. Here’s what I learned.

The Problem with Default Conversion

When I first tried converting a docx file to HTML, I used a simple approach:

BasicConversion.java
XWPFDocument document = new XWPFDocument(new FileInputStream("document.docx"));
XHTMLOptions options = XHTMLOptions.create();
XHTMLConverter.getInstance()
.convert(document, new FileOutputStream("output.html"), options);

The HTML generated correctly, but every <img> tag had an empty or broken src attribute. The images embedded in the Word document weren’t being extracted to the filesystem.

This makes sense in hindsight. The converter can’t assume where you want images stored or what URL structure you want to use. You need to tell it explicitly.

Solution for DOCX Files Using XDocReport

For modern .docx files, XDocReport provides an ImageManager class that handles image extraction automatically. Here’s the proper approach:

DocxImageExtraction.java
import fr.opensagres.poi.xwpf.converter.xhtml.XHTMLConverter;
import fr.opensagres.poi.xwpf.converter.xhtml.XHTMLOptions;
import fr.opensagres.poi.xwpf.converter.xhtml.base64.Base64EmbedImgManager;
import fr.opensagres.poi.xwpf.converter.xhtml.core.FileImageManager;
// First, create the images directory
Path imagesDir = Paths.get("output/images");
Files.createDirectories(imagesDir);
// Load the document
XWPFDocument document = new XWPFDocument(new FileInputStream("document.docx"));
// Create options with ImageManager
XHTMLOptions options = XHTMLOptions.create();
options.setImageManager(new FileImageManager(new File("output"), "images"));
// Convert
XHTMLConverter.getInstance()
.convert(document, new FileOutputStream("output/index.html"), options);

The FileImageManager constructor takes two parameters:

  1. The base output directory where your HTML file will be saved
  2. The subdirectory name for images (relative to the base directory)

After conversion, your output structure looks like this:

Output Directory Structure
output/
├── index.html
└── images/
├── image1.png
├── image2.jpg
└── image3.png

The generated HTML contains correct relative paths:

Generated HTML
<img src="images/image1.png" alt="embedded image">

Understanding the ImageManager Parameters

I initially confused the parameters and got broken paths. Let me clarify:

ImageManagerParameters.java
// CORRECT: Base dir is where HTML is saved, images folder is relative
XHTMLOptions options = XHTMLOptions.create();
options.setImageManager(new FileImageManager(
new File("output"), // Base directory (where index.html is saved)
"images" // Subfolder name for images
));
// WRONG: Using absolute paths can cause issues
options.setImageManager(new FileImageManager(
new File("/absolute/path/to/output"),
"/absolute/path/to/images" // This creates broken relative paths!
));

The key insight is that ImageManager generates relative paths. The src attribute in HTML will be images/filename.ext, which is relative to where your HTML file is saved.

Alternative: Base64 Embedded Images

If you prefer self-contained HTML without external image files, use Base64EmbedImgManager:

Base64Images.java
XHTMLOptions options = XHTMLOptions.create();
options.setImageManager(new Base64EmbedImgManager());
XHTMLConverter.getInstance()
.convert(document, new FileOutputStream("output.html"), options);

This embeds images directly in HTML as base64 data URLs. The HTML file becomes larger but has no external dependencies.

Solution for Legacy DOC Files

For older .doc files (not .docx), you need a different approach using HWPF and WordToHtmlConverter:

DocImageExtraction.java
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.converter.WordToHtmlConverter;
import org.w3c.dom.Document;
// Create images directory first
Path imagesDir = Paths.get("output/images");
Files.createDirectories(imagesDir);
// Load the document
HWPFDocument document = new HWPFDocument(new FileInputStream("document.doc"));
// Create the converter
WordToHtmlConverter converter = new WordToHtmlConverter(
DocumentBuilderFactory.newInstance()
.newDocumentBuilder()
.newDocument()
);
// Set custom PicturesManager to handle image extraction
converter.setPicturesManager((content, pictureType, suggestedName, width, height) -> {
// Generate unique filename
String imageName = "images/" + suggestedName;
// Write image to disk
try {
Path imagePath = Paths.get("output", imageName);
Files.write(imagePath, content);
} catch (IOException e) {
throw new RuntimeException("Failed to write image: " + imageName, e);
}
// Return the relative path for the HTML src attribute
return imageName;
});
// Convert
converter.processDocument(document);
// Save HTML output
Document htmlDocument = converter.getDocument();
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.transform(
new DOMSource(htmlDocument),
new StreamResult(new FileOutputStream("output/index.html"))
);

Understanding the PicturesManager Interface

The PicturesManager is a functional interface with one method:

PicturesManager.java
public interface PicturesManager {
String savePicture(
byte[] content, // Raw image bytes
int pictureType, // PNG, JPEG, etc.
String suggestedName, // Suggested filename from Word
float width, // Image width in points
float height // Image height in points
);
}

Your implementation must:

  1. Save the image bytes to disk
  2. Return the path (relative to HTML file) to use in the src attribute

Here’s a more robust implementation:

RobustPicturesManager.java
converter.setPicturesManager((content, pictureType, suggestedName, width, height) -> {
// Determine file extension based on picture type
String extension = switch (pictureType) {
case PictureType.PNG -> ".png";
case PictureType.JPEG -> ".jpg";
case PictureType.BMP -> ".bmp";
case PictureType.WMF -> ".wmf";
default -> ".bin";
};
// Create unique filename
String baseName = suggestedName != null ?
suggestedName.replaceAll("[^a-zA-Z0-9.-]", "_") :
"image_" + System.currentTimeMillis();
String fileName = baseName + extension;
String relativePath = "images/" + fileName;
// Ensure directory exists
Path imagePath = Paths.get("output/images", fileName);
Files.createDirectories(imagePath.getParent());
// Write image
Files.write(imagePath, content);
return relativePath;
});

Common Mistakes I Made

Mistake 1: Not Creating the Images Directory

Mistake1.java
// WRONG: Directory doesn't exist yet
options.setImageManager(new FileImageManager(new File("output"), "images"));
// Conversion fails with "directory not found" error
// CORRECT: Create directory first
Files.createDirectories(Paths.get("output/images"));
options.setImageManager(new FileImageManager(new File("output"), "images"));

The ImageManager doesn’t create directories automatically. You must create the target directory before conversion.

Mistake 2: Using Absolute Paths

Mistake2.java
// WRONG: Returns absolute path, breaks when HTML is moved
return "/Users/me/project/output/images/image1.png";
// CORRECT: Returns relative path, works anywhere
return "images/image1.png";

The returned path from PicturesManager goes directly into the HTML src attribute. Use relative paths for portability.

Mistake 3: Not Closing Streams

Mistake3.java
// WRONG: Stream never closed
FileOutputStream fos = new FileOutputStream(imagePath);
fos.write(content);
// fos.close() never called!
// CORRECT: Use try-with-resources
try (FileOutputStream fos = new FileOutputStream(imagePath)) {
fos.write(content);
}

Or better, use Files.write() which handles everything:

CorrectFileWrite.java
Files.write(imagePath, content);

Complete Working Example for DOCX

Here’s a complete, ready-to-use solution for docx files:

WordToHtmlConverter.java
import fr.opensagres.poi.xwpf.converter.xhtml.XHTMLConverter;
import fr.opensagres.poi.xwpf.converter.xhtml.XHTMLOptions;
import fr.opensagres.poi.xwpf.converter.xhtml.core.FileImageManager;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import java.io.*;
import java.nio.file.*;
public class WordToHtmlConverter {
public static void convert(Path docxPath, Path outputDir) throws IOException {
// Create output directories
Files.createDirectories(outputDir);
Files.createDirectories(outputDir.resolve("images"));
// Load document
try (InputStream is = Files.newInputStream(docxPath);
XWPFDocument document = new XWPFDocument(is)) {
// Configure HTML options
XHTMLOptions options = XHTMLOptions.create();
options.setImageManager(new FileImageManager(
outputDir.toFile(),
"images"
));
// Convert
Path htmlPath = outputDir.resolve("index.html");
try (OutputStream os = Files.newOutputStream(htmlPath)) {
XHTMLConverter.getInstance().convert(document, os, options);
}
}
}
public static void main(String[] args) throws IOException {
convert(
Paths.get("input.docx"),
Paths.get("output")
);
}
}

Dependencies Required

For docx conversion with XDocReport:

pom.xml
<dependencies>
<!-- Apache POI for Word documents -->
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-ooxml</artifactId>
<version>5.2.5</version>
</dependency>
<!-- XDocReport for docx to HTML conversion -->
<dependency>
<groupId>fr.opensagres.xdocreport</groupId>
<artifactId>fr.opensagres.poi.xwpf.converter.xhtml</artifactId>
<version>2.0.4</version>
</dependency>
</dependencies>

For legacy doc conversion:

pom.xml
<dependencies>
<!-- Apache POI HWPF for .doc files -->
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-scratchpad</artifactId>
<version>5.2.5</version>
</dependency>
<!-- For HTML output -->
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-ooxml</artifactId>
<version>5.2.5</version>
</dependency>
</dependencies>

When to Use Each Approach

Use XDocReport with ImageManager for:

  • Modern .docx files
  • Automatic image extraction
  • Clean, maintainable code

Use PicturesManager with WordToHtmlConverter for:

  • Legacy .doc files
  • Full control over image processing
  • Custom naming or transformation logic

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

The key takeaway: Apache POI doesn’t automatically extract images when converting to HTML. You must use ImageManager for docx files or implement PicturesManager for doc files. Always create the images directory before conversion, and use relative paths for portability.

Comments