Skip to content

How to Convert Legacy Word doc Files to HTML Using Apache POI HWPF

Problem

When I tried to convert a legacy .doc file to HTML using the XWPF APIs I was familiar with, I got this error:

Error message
Exception in thread "main" java.lang.IllegalArgumentException: Your file is not a valid OOXML file
at org.apache.poi.xwpf.usermodel.XWPFDocument.<init>(XWPFDocument.java:141)
at com.example.DocConverter.convert(DocConverter.java:25)

I was confused. The file opened fine in Microsoft Word, so why was Apache POI rejecting it?

Environment

  • Java 17
  • Apache POI 5.5.1
  • Maven 3.8.1
  • Windows 10 / macOS 14

What happened?

I was building a document management system for a client who had thousands of archived Word documents. Some were created with Word 2010 (docx format), others with Word 2003 (doc format). My code worked perfectly for docx files:

DocxConverter.java
public void convertDocxToHtml(String docxPath) throws Exception {
try (InputStream in = Files.newInputStream(Paths.get(docxPath))) {
XWPFDocument document = new XWPFDocument(in);
// conversion logic...
}
}

But when I pointed it at the older files, everything broke. I spent hours debugging before I realized the fundamental issue:

XWPF is for docx files. HWPF is for doc files.

These are completely different APIs because the underlying file formats are radically different:

  • .docx is an XML-based format (Office Open XML)
  • .doc is a binary format (OLE2 Compound Document)

So XWPF couldn’t read my binary doc files at all. I needed to use HWPF (Horrible Word Processor Format - yes, that’s really what it stands for).

How to solve it?

First, I added the HWPF dependency. It’s not included in the core POI package, so you need poi-scratchpad:

pom.xml
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-scratchpad</artifactId>
<version>5.5.1</version>
</dependency>

Then I wrote a new converter specifically for legacy doc files:

LegacyDocConverter.java
import org.apache.poi.hwpf.HWPFDocumentCore;
import org.apache.poi.hwpf.converter.WordToHtmlConverter;
import org.apache.poi.hwpf.converter.WordToHtmlUtils;
import org.w3c.dom.Document;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.transform.OutputKeys;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
public void convertDocToHtml(String docPath) throws Exception {
Path input = Paths.get(docPath);
String htmlFileName = input.getFileName().toString().replaceFirst("\\.[^.]+$", "") + ".html";
Path output = input.resolveSibling(htmlFileName);
Path imagesDir = input.resolveSibling("images");
// Create images directory BEFORE processing
Files.createDirectories(imagesDir);
try (InputStream in = Files.newInputStream(Paths.get(docPath));
OutputStream out = Files.newOutputStream(output)) {
// Load the legacy doc file
HWPFDocumentCore document = WordToHtmlUtils.loadDoc(in);
// Create a DOM document for the HTML output
Document htmlDocument = DocumentBuilderFactory.newInstance()
.newDocumentBuilder()
.newDocument();
// Create the converter
WordToHtmlConverter converter = new WordToHtmlConverter(htmlDocument);
// Handle embedded images
converter.setPicturesManager((content, pictureType, suggestedName, widthInches, heightInches) -> {
Path imageFile = imagesDir.resolve(suggestedName);
try {
Files.write(imageFile, content);
} catch (IOException e) {
throw new RuntimeException("Failed to write image: " + suggestedName, e);
}
return "images/" + suggestedName;
});
// Process the document
converter.processDocument(document);
// Write HTML with proper encoding
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
transformer.setOutputProperty(OutputKeys.METHOD, "html");
transformer.transform(
new DOMSource(converter.getDocument()),
new StreamResult(out)
);
}
}

Let me test it:

Terminal
mvn compile exec:java -Dexec.mainClass="com.example.LegacyDocConverter"

The conversion worked. I got:

Output structure
output/
legacy-document.html
images/
image001.png
image002.jpg

The key differences between doc and docx conversion

After getting this working, I documented the differences:

Aspectdoc (HWPF)docx (XWPF)
File formatBinary (OLE2)XML (OOXML)
LoadingWordToHtmlUtils.loadDoc()new XWPFDocument()
ConverterWordToHtmlConverterXHTMLConverter (from poi-examples)
ComplexityMore complexSimpler
DocumentationLimitedBetter documented

Common mistakes I made

Mistake 1: Using XWPF for .doc files

WrongApproach.java
// This will NEVER work for .doc files
XWPFDocument document = new XWPFDocument(inputStream);

The XWPF classes only understand OOXML format. For binary doc files, you must use HWPF.

Mistake 2: Forgetting to create the images directory

BrokenImageHandling.java
// Images won't save if directory doesn't exist
Path imagesDir = input.resolveSibling("images");
// Files.write will throw NoSuchFileException!

Always create the directory first:

CorrectImageHandling.java
Path imagesDir = input.resolveSibling("images");
Files.createDirectories(imagesDir); // Create BEFORE writing images

Mistake 3: Missing UTF-8 encoding

EncodingIssue.java
// Without explicit encoding, special characters break
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.transform(new DOMSource(doc), new StreamResult(out));

The output HTML might have encoding issues. Always set UTF-8:

CorrectEncoding.java
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
transformer.setOutputProperty(OutputKeys.METHOD, "html");

Why this matters

Many enterprises still have archives of older Word documents created before Office 2007. These documents need to be:

  • Displayed on web portals
  • Migrated to content management systems
  • Converted for archival purposes
  • Indexed for search engines

Understanding both HWPF and XWPF lets you build robust document processing pipelines that handle the full history of Word formats.

Summary

In this post, I showed how to convert legacy .doc files to HTML using Apache POI’s HWPF module. The key points are:

  1. Use poi-scratchpad dependency for HWPF support
  2. Load doc files with WordToHtmlUtils.loadDoc(), not new XWPFDocument()
  3. Create the images directory before saving embedded images
  4. Always set UTF-8 encoding on the Transformer output
  5. Remember: HWPF is for .doc (binary), XWPF is for .docx (XML)

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments