How to Handle Images When Converting Word Documents to HTML with Apache POI
I was converting Word documents to HTML using Apache POI when I noticed something troubling. The generated HTML showed broken image placeholders everywhere. The text converted perfectly, but all the embedded images were missing.
After hours of debugging, I realized the conversion wasn’t automatically extracting images from the Word documents. I needed to handle image extraction explicitly. Here’s what I learned.
The Problem with Default Conversion
When I first tried converting a docx file to HTML, I used a simple approach:
XWPFDocument document = new XWPFDocument(new FileInputStream("document.docx"));XHTMLOptions options = XHTMLOptions.create();XHTMLConverter.getInstance() .convert(document, new FileOutputStream("output.html"), options);The HTML generated correctly, but every <img> tag had an empty or broken src attribute. The images embedded in the Word document weren’t being extracted to the filesystem.
This makes sense in hindsight. The converter can’t assume where you want images stored or what URL structure you want to use. You need to tell it explicitly.
Solution for DOCX Files Using XDocReport
For modern .docx files, XDocReport provides an ImageManager class that handles image extraction automatically. Here’s the proper approach:
import fr.opensagres.poi.xwpf.converter.xhtml.XHTMLConverter;import fr.opensagres.poi.xwpf.converter.xhtml.XHTMLOptions;import fr.opensagres.poi.xwpf.converter.xhtml.base64.Base64EmbedImgManager;import fr.opensagres.poi.xwpf.converter.xhtml.core.FileImageManager;
// First, create the images directoryPath imagesDir = Paths.get("output/images");Files.createDirectories(imagesDir);
// Load the documentXWPFDocument document = new XWPFDocument(new FileInputStream("document.docx"));
// Create options with ImageManagerXHTMLOptions options = XHTMLOptions.create();options.setImageManager(new FileImageManager(new File("output"), "images"));
// ConvertXHTMLConverter.getInstance() .convert(document, new FileOutputStream("output/index.html"), options);The FileImageManager constructor takes two parameters:
- The base output directory where your HTML file will be saved
- The subdirectory name for images (relative to the base directory)
After conversion, your output structure looks like this:
output/├── index.html└── images/ ├── image1.png ├── image2.jpg └── image3.pngThe generated HTML contains correct relative paths:
<img src="images/image1.png" alt="embedded image">Understanding the ImageManager Parameters
I initially confused the parameters and got broken paths. Let me clarify:
// CORRECT: Base dir is where HTML is saved, images folder is relativeXHTMLOptions options = XHTMLOptions.create();options.setImageManager(new FileImageManager( new File("output"), // Base directory (where index.html is saved) "images" // Subfolder name for images));
// WRONG: Using absolute paths can cause issuesoptions.setImageManager(new FileImageManager( new File("/absolute/path/to/output"), "/absolute/path/to/images" // This creates broken relative paths!));The key insight is that ImageManager generates relative paths. The src attribute in HTML will be images/filename.ext, which is relative to where your HTML file is saved.
Alternative: Base64 Embedded Images
If you prefer self-contained HTML without external image files, use Base64EmbedImgManager:
XHTMLOptions options = XHTMLOptions.create();options.setImageManager(new Base64EmbedImgManager());
XHTMLConverter.getInstance() .convert(document, new FileOutputStream("output.html"), options);This embeds images directly in HTML as base64 data URLs. The HTML file becomes larger but has no external dependencies.
Solution for Legacy DOC Files
For older .doc files (not .docx), you need a different approach using HWPF and WordToHtmlConverter:
import org.apache.poi.hwpf.HWPFDocument;import org.apache.poi.hwpf.converter.WordToHtmlConverter;import org.w3c.dom.Document;
// Create images directory firstPath imagesDir = Paths.get("output/images");Files.createDirectories(imagesDir);
// Load the documentHWPFDocument document = new HWPFDocument(new FileInputStream("document.doc"));
// Create the converterWordToHtmlConverter converter = new WordToHtmlConverter( DocumentBuilderFactory.newInstance() .newDocumentBuilder() .newDocument());
// Set custom PicturesManager to handle image extractionconverter.setPicturesManager((content, pictureType, suggestedName, width, height) -> { // Generate unique filename String imageName = "images/" + suggestedName;
// Write image to disk try { Path imagePath = Paths.get("output", imageName); Files.write(imagePath, content); } catch (IOException e) { throw new RuntimeException("Failed to write image: " + imageName, e); }
// Return the relative path for the HTML src attribute return imageName;});
// Convertconverter.processDocument(document);
// Save HTML outputDocument htmlDocument = converter.getDocument();Transformer transformer = TransformerFactory.newInstance().newTransformer();transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");transformer.setOutputProperty(OutputKeys.INDENT, "yes");transformer.transform( new DOMSource(htmlDocument), new StreamResult(new FileOutputStream("output/index.html")));Understanding the PicturesManager Interface
The PicturesManager is a functional interface with one method:
public interface PicturesManager { String savePicture( byte[] content, // Raw image bytes int pictureType, // PNG, JPEG, etc. String suggestedName, // Suggested filename from Word float width, // Image width in points float height // Image height in points );}Your implementation must:
- Save the image bytes to disk
- Return the path (relative to HTML file) to use in the
srcattribute
Here’s a more robust implementation:
converter.setPicturesManager((content, pictureType, suggestedName, width, height) -> { // Determine file extension based on picture type String extension = switch (pictureType) { case PictureType.PNG -> ".png"; case PictureType.JPEG -> ".jpg"; case PictureType.BMP -> ".bmp"; case PictureType.WMF -> ".wmf"; default -> ".bin"; };
// Create unique filename String baseName = suggestedName != null ? suggestedName.replaceAll("[^a-zA-Z0-9.-]", "_") : "image_" + System.currentTimeMillis(); String fileName = baseName + extension; String relativePath = "images/" + fileName;
// Ensure directory exists Path imagePath = Paths.get("output/images", fileName); Files.createDirectories(imagePath.getParent());
// Write image Files.write(imagePath, content);
return relativePath;});Common Mistakes I Made
Mistake 1: Not Creating the Images Directory
// WRONG: Directory doesn't exist yetoptions.setImageManager(new FileImageManager(new File("output"), "images"));// Conversion fails with "directory not found" error
// CORRECT: Create directory firstFiles.createDirectories(Paths.get("output/images"));options.setImageManager(new FileImageManager(new File("output"), "images"));The ImageManager doesn’t create directories automatically. You must create the target directory before conversion.
Mistake 2: Using Absolute Paths
// WRONG: Returns absolute path, breaks when HTML is movedreturn "/Users/me/project/output/images/image1.png";
// CORRECT: Returns relative path, works anywherereturn "images/image1.png";The returned path from PicturesManager goes directly into the HTML src attribute. Use relative paths for portability.
Mistake 3: Not Closing Streams
// WRONG: Stream never closedFileOutputStream fos = new FileOutputStream(imagePath);fos.write(content);// fos.close() never called!
// CORRECT: Use try-with-resourcestry (FileOutputStream fos = new FileOutputStream(imagePath)) { fos.write(content);}Or better, use Files.write() which handles everything:
Files.write(imagePath, content);Complete Working Example for DOCX
Here’s a complete, ready-to-use solution for docx files:
import fr.opensagres.poi.xwpf.converter.xhtml.XHTMLConverter;import fr.opensagres.poi.xwpf.converter.xhtml.XHTMLOptions;import fr.opensagres.poi.xwpf.converter.xhtml.core.FileImageManager;import org.apache.poi.xwpf.usermodel.XWPFDocument;
import java.io.*;import java.nio.file.*;
public class WordToHtmlConverter {
public static void convert(Path docxPath, Path outputDir) throws IOException { // Create output directories Files.createDirectories(outputDir); Files.createDirectories(outputDir.resolve("images"));
// Load document try (InputStream is = Files.newInputStream(docxPath); XWPFDocument document = new XWPFDocument(is)) {
// Configure HTML options XHTMLOptions options = XHTMLOptions.create(); options.setImageManager(new FileImageManager( outputDir.toFile(), "images" ));
// Convert Path htmlPath = outputDir.resolve("index.html"); try (OutputStream os = Files.newOutputStream(htmlPath)) { XHTMLConverter.getInstance().convert(document, os, options); } } }
public static void main(String[] args) throws IOException { convert( Paths.get("input.docx"), Paths.get("output") ); }}Dependencies Required
For docx conversion with XDocReport:
<dependencies> <!-- Apache POI for Word documents --> <dependency> <groupId>org.apache.poi</groupId> <artifactId>poi-ooxml</artifactId> <version>5.2.5</version> </dependency>
<!-- XDocReport for docx to HTML conversion --> <dependency> <groupId>fr.opensagres.xdocreport</groupId> <artifactId>fr.opensagres.poi.xwpf.converter.xhtml</artifactId> <version>2.0.4</version> </dependency></dependencies>For legacy doc conversion:
<dependencies> <!-- Apache POI HWPF for .doc files --> <dependency> <groupId>org.apache.poi</groupId> <artifactId>poi-scratchpad</artifactId> <version>5.2.5</version> </dependency>
<!-- For HTML output --> <dependency> <groupId>org.apache.poi</groupId> <artifactId>poi-ooxml</artifactId> <version>5.2.5</version> </dependency></dependencies>When to Use Each Approach
Use XDocReport with ImageManager for:
- Modern
.docxfiles - Automatic image extraction
- Clean, maintainable code
Use PicturesManager with WordToHtmlConverter for:
- Legacy
.docfiles - Full control over image processing
- Custom naming or transformation logic
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
The key takeaway: Apache POI doesn’t automatically extract images when converting to HTML. You must use ImageManager for docx files or implement PicturesManager for doc files. Always create the images directory before conversion, and use relative paths for portability.
Comments