How to Convert Legacy Word doc Files to HTML Using Apache POI HWPF
Problem
When I tried to convert a legacy .doc file to HTML using the XWPF APIs I was familiar with, I got this error:
Exception in thread "main" java.lang.IllegalArgumentException: Your file is not a valid OOXML file at org.apache.poi.xwpf.usermodel.XWPFDocument.<init>(XWPFDocument.java:141) at com.example.DocConverter.convert(DocConverter.java:25)I was confused. The file opened fine in Microsoft Word, so why was Apache POI rejecting it?
Environment
- Java 17
- Apache POI 5.5.1
- Maven 3.8.1
- Windows 10 / macOS 14
What happened?
I was building a document management system for a client who had thousands of archived Word documents. Some were created with Word 2010 (docx format), others with Word 2003 (doc format). My code worked perfectly for docx files:
public void convertDocxToHtml(String docxPath) throws Exception { try (InputStream in = Files.newInputStream(Paths.get(docxPath))) { XWPFDocument document = new XWPFDocument(in); // conversion logic... }}But when I pointed it at the older files, everything broke. I spent hours debugging before I realized the fundamental issue:
XWPF is for docx files. HWPF is for doc files.
These are completely different APIs because the underlying file formats are radically different:
.docxis an XML-based format (Office Open XML).docis a binary format (OLE2 Compound Document)
So XWPF couldn’t read my binary doc files at all. I needed to use HWPF (Horrible Word Processor Format - yes, that’s really what it stands for).
How to solve it?
First, I added the HWPF dependency. It’s not included in the core POI package, so you need poi-scratchpad:
<dependency> <groupId>org.apache.poi</groupId> <artifactId>poi-scratchpad</artifactId> <version>5.5.1</version></dependency>Then I wrote a new converter specifically for legacy doc files:
import org.apache.poi.hwpf.HWPFDocumentCore;import org.apache.poi.hwpf.converter.WordToHtmlConverter;import org.apache.poi.hwpf.converter.WordToHtmlUtils;import org.w3c.dom.Document;
import javax.xml.parsers.DocumentBuilderFactory;import javax.xml.transform.OutputKeys;import javax.xml.transform.Transformer;import javax.xml.transform.TransformerFactory;import javax.xml.transform.dom.DOMSource;import javax.xml.transform.stream.StreamResult;import java.io.IOException;import java.io.InputStream;import java.io.OutputStream;import java.nio.file.Files;import java.nio.file.Path;import java.nio.file.Paths;
public void convertDocToHtml(String docPath) throws Exception { Path input = Paths.get(docPath); String htmlFileName = input.getFileName().toString().replaceFirst("\\.[^.]+$", "") + ".html"; Path output = input.resolveSibling(htmlFileName); Path imagesDir = input.resolveSibling("images");
// Create images directory BEFORE processing Files.createDirectories(imagesDir);
try (InputStream in = Files.newInputStream(Paths.get(docPath)); OutputStream out = Files.newOutputStream(output)) {
// Load the legacy doc file HWPFDocumentCore document = WordToHtmlUtils.loadDoc(in);
// Create a DOM document for the HTML output Document htmlDocument = DocumentBuilderFactory.newInstance() .newDocumentBuilder() .newDocument();
// Create the converter WordToHtmlConverter converter = new WordToHtmlConverter(htmlDocument);
// Handle embedded images converter.setPicturesManager((content, pictureType, suggestedName, widthInches, heightInches) -> { Path imageFile = imagesDir.resolve(suggestedName); try { Files.write(imageFile, content); } catch (IOException e) { throw new RuntimeException("Failed to write image: " + suggestedName, e); } return "images/" + suggestedName; });
// Process the document converter.processDocument(document);
// Write HTML with proper encoding Transformer transformer = TransformerFactory.newInstance().newTransformer(); transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8"); transformer.setOutputProperty(OutputKeys.METHOD, "html"); transformer.transform( new DOMSource(converter.getDocument()), new StreamResult(out) ); }}Let me test it:
mvn compile exec:java -Dexec.mainClass="com.example.LegacyDocConverter"The conversion worked. I got:
output/ legacy-document.html images/ image001.png image002.jpgThe key differences between doc and docx conversion
After getting this working, I documented the differences:
| Aspect | doc (HWPF) | docx (XWPF) |
|---|---|---|
| File format | Binary (OLE2) | XML (OOXML) |
| Loading | WordToHtmlUtils.loadDoc() | new XWPFDocument() |
| Converter | WordToHtmlConverter | XHTMLConverter (from poi-examples) |
| Complexity | More complex | Simpler |
| Documentation | Limited | Better documented |
Common mistakes I made
Mistake 1: Using XWPF for .doc files
// This will NEVER work for .doc filesXWPFDocument document = new XWPFDocument(inputStream);The XWPF classes only understand OOXML format. For binary doc files, you must use HWPF.
Mistake 2: Forgetting to create the images directory
// Images won't save if directory doesn't existPath imagesDir = input.resolveSibling("images");// Files.write will throw NoSuchFileException!Always create the directory first:
Path imagesDir = input.resolveSibling("images");Files.createDirectories(imagesDir); // Create BEFORE writing imagesMistake 3: Missing UTF-8 encoding
// Without explicit encoding, special characters breakTransformer transformer = TransformerFactory.newInstance().newTransformer();transformer.transform(new DOMSource(doc), new StreamResult(out));The output HTML might have encoding issues. Always set UTF-8:
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");transformer.setOutputProperty(OutputKeys.METHOD, "html");Why this matters
Many enterprises still have archives of older Word documents created before Office 2007. These documents need to be:
- Displayed on web portals
- Migrated to content management systems
- Converted for archival purposes
- Indexed for search engines
Understanding both HWPF and XWPF lets you build robust document processing pipelines that handle the full history of Word formats.
Summary
In this post, I showed how to convert legacy .doc files to HTML using Apache POI’s HWPF module. The key points are:
- Use
poi-scratchpaddependency for HWPF support - Load doc files with
WordToHtmlUtils.loadDoc(), notnew XWPFDocument() - Create the images directory before saving embedded images
- Always set UTF-8 encoding on the Transformer output
- Remember: HWPF is for .doc (binary), XWPF is for .docx (XML)
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments