How to Convert Legacy Word doc Files to HTML Using Apache POI HWPF

Mar 26, 2026

Problem

When I tried to convert a legacy .doc file to HTML using the XWPF APIs I was familiar with, I got this error:

Exception in thread "main" java.lang.IllegalArgumentException: Your file is not a valid OOXML file
    at org.apache.poi.xwpf.usermodel.XWPFDocument.<init>(XWPFDocument.java:141)
    at com.example.DocConverter.convert(DocConverter.java:25)

I was confused. The file opened fine in Microsoft Word, so why was Apache POI rejecting it?

Environment

Java 17
Apache POI 5.5.1
Maven 3.8.1
Windows 10 / macOS 14

What happened?

I was building a document management system for a client who had thousands of archived Word documents. Some were created with Word 2010 (docx format), others with Word 2003 (doc format). My code worked perfectly for docx files:

public void convertDocxToHtml(String docxPath) throws Exception {
    try (InputStream in = Files.newInputStream(Paths.get(docxPath))) {
        XWPFDocument document = new XWPFDocument(in);
        // conversion logic...
    }
}

But when I pointed it at the older files, everything broke. I spent hours debugging before I realized the fundamental issue:

XWPF is for docx files. HWPF is for doc files.

These are completely different APIs because the underlying file formats are radically different:

.docx is an XML-based format (Office Open XML)
.doc is a binary format (OLE2 Compound Document)

So XWPF couldn’t read my binary doc files at all. I needed to use HWPF (Horrible Word Processor Format - yes, that’s really what it stands for).

How to solve it?

First, I added the HWPF dependency. It’s not included in the core POI package, so you need poi-scratchpad:

<dependency>
    <groupId>org.apache.poi</groupId>
    <artifactId>poi-scratchpad</artifactId>
    <version>5.5.1</version>
</dependency>

Then I wrote a new converter specifically for legacy doc files:

import org.apache.poi.hwpf.HWPFDocumentCore;
import org.apache.poi.hwpf.converter.WordToHtmlConverter;
import org.apache.poi.hwpf.converter.WordToHtmlUtils;
import org.w3c.dom.Document;

import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.transform.OutputKeys;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;

public void convertDocToHtml(String docPath) throws Exception {
    Path input = Paths.get(docPath);
    String htmlFileName = input.getFileName().toString().replaceFirst("\\.[^.]+$", "") + ".html";
    Path output = input.resolveSibling(htmlFileName);
    Path imagesDir = input.resolveSibling("images");

    // Create images directory BEFORE processing
    Files.createDirectories(imagesDir);

    try (InputStream in = Files.newInputStream(Paths.get(docPath));
         OutputStream out = Files.newOutputStream(output)) {

        // Load the legacy doc file
        HWPFDocumentCore document = WordToHtmlUtils.loadDoc(in);

        // Create a DOM document for the HTML output
        Document htmlDocument = DocumentBuilderFactory.newInstance()
            .newDocumentBuilder()
            .newDocument();

        // Create the converter
        WordToHtmlConverter converter = new WordToHtmlConverter(htmlDocument);

        // Handle embedded images
        converter.setPicturesManager((content, pictureType, suggestedName, widthInches, heightInches) -> {
            Path imageFile = imagesDir.resolve(suggestedName);
            try {
                Files.write(imageFile, content);
            } catch (IOException e) {
                throw new RuntimeException("Failed to write image: " + suggestedName, e);
            }
            return "images/" + suggestedName;
        });

        // Process the document
        converter.processDocument(document);

        // Write HTML with proper encoding
        Transformer transformer = TransformerFactory.newInstance().newTransformer();
        transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
        transformer.setOutputProperty(OutputKeys.METHOD, "html");
        transformer.transform(
            new DOMSource(converter.getDocument()),
            new StreamResult(out)
        );
    }
}

Let me test it:

mvn compile exec:java -Dexec.mainClass="com.example.LegacyDocConverter"

The conversion worked. I got:

output/
  legacy-document.html
  images/
    image001.png
    image002.jpg

The key differences between doc and docx conversion

After getting this working, I documented the differences:

Aspect	doc (HWPF)	docx (XWPF)
File format	Binary (OLE2)	XML (OOXML)
Loading	`WordToHtmlUtils.loadDoc()`	`new XWPFDocument()`
Converter	`WordToHtmlConverter`	`XHTMLConverter` (from poi-examples)
Complexity	More complex	Simpler
Documentation	Limited	Better documented

Common mistakes I made

Mistake 1: Using XWPF for .doc files

// This will NEVER work for .doc files
XWPFDocument document = new XWPFDocument(inputStream);

The XWPF classes only understand OOXML format. For binary doc files, you must use HWPF.

Mistake 2: Forgetting to create the images directory

// Images won't save if directory doesn't exist
Path imagesDir = input.resolveSibling("images");
// Files.write will throw NoSuchFileException!

Always create the directory first:

Path imagesDir = input.resolveSibling("images");
Files.createDirectories(imagesDir);  // Create BEFORE writing images

Mistake 3: Missing UTF-8 encoding

// Without explicit encoding, special characters break
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.transform(new DOMSource(doc), new StreamResult(out));

The output HTML might have encoding issues. Always set UTF-8:

transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
transformer.setOutputProperty(OutputKeys.METHOD, "html");

Why this matters

Many enterprises still have archives of older Word documents created before Office 2007. These documents need to be:

Displayed on web portals
Migrated to content management systems
Converted for archival purposes
Indexed for search engines

Understanding both HWPF and XWPF lets you build robust document processing pipelines that handle the full history of Word formats.

Summary

In this post, I showed how to convert legacy .doc files to HTML using Apache POI’s HWPF module. The key points are:

Use poi-scratchpad dependency for HWPF support
Load doc files with WordToHtmlUtils.loadDoc(), not new XWPFDocument()
Create the images directory before saving embedded images
Always set UTF-8 encoding on the Transformer output
Remember: HWPF is for .doc (binary), XWPF is for .docx (XML)

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

👨‍💻 Apache POI HWPF Documentation
👨‍💻 Apache POI Scratchpad

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!