doc vs docx: Understanding Word Document Format Differences
I inherited a legacy document processing system. The codebase had separate handlers for .doc and .docx files, and I needed to understand why. When I tried using the same API for both formats, I got cryptic errors. It turns out the two formats are fundamentally different.
The Problem
My application needed to extract text from Word documents uploaded by users. Some uploaded old .doc files, others uploaded newer .docx files. I initially assumed they were similar formats that could be handled the same way.
I was wrong.
org.apache.poi.poifs.filesystem.NotOLE2FileException: Invalid header signature;read 0x504B0304, expected 0xE11AB1A1E011CFD0This error appeared when I tried to read a .docx file with the .doc API. The reverse also failed. I needed to understand the actual differences.
The Core Difference
The fundamental difference is the file structure:
+------------------+-------------------+------------------------+| Aspect | .doc (Legacy) | .docx (Modern) |+------------------+-------------------+------------------------+| Format | Binary | Office Open XML || Standard | Proprietary | ECMA-376, ISO/IEC 29500|| Structure | Single binary file| ZIP of XML files || Interoperability | Limited | Excellent || Recovery | Difficult | Possible (edit XML) || Version Control | Binary blob | Text-based XML diffs || Size | Often larger | Usually smaller |+------------------+-------------------+------------------------+What this means:
.doc files are proprietary binary blobs. You need Microsoft Word or reverse-engineered libraries to read them.
.docx files are ZIP archives containing XML files. You can rename a .docx to .zip, extract it, and read the content directly.
The .docx Structure
When I unzip a .docx file, I see this structure:
document.docx├── [Content_Types].xml├── _rels/│ └── .rels├── docProps/│ ├── app.xml│ └── core.xml└── word/ ├── document.xml <- Main content here ├── styles.xml ├── numbering.xml └── media/ └── image1.pngThe word/document.xml file contains the actual text and formatting:
<w:document> <w:body> <w:p> <w:r> <w:t>Hello World</w:t> </w:r> </w:p> </w:body></w:document>This transparency enables:
- Version control diffs (text-based XML)
- Recovery from corruption (edit XML directly)
- Processing without Word installed (any XML parser)
- Easier debugging (read the source)
Different APIs Required
In Java with Apache POI, I need completely different APIs for each format:
import org.apache.poi.hwpf.HWPFDocument; // For .docimport org.apache.poi.xwpf.usermodel.XWPFDocument; // For .docx
// WRONG: Cannot mix APIspublic String extractText(File file) { // This fails if file is wrong format HWPFDocument doc = new HWPFDocument(new FileInputStream(file)); return doc.getDocumentText();}The correct approach detects format first:
import org.apache.poi.hwpf.HWPFDocument;import org.apache.poi.xwpf.usermodel.XWPFDocument;import org.apache.poi.poifs.filesystem.FileMagic;import java.io.InputStream;import java.io.IOException;
public class DocumentLoader {
public String extractText(InputStream input) throws IOException { // Detect format before processing InputStream buffered = FileMagic.prepareToCheckMagic(input); FileMagic magic = FileMagic.valueOf(buffered);
return switch (magic) { case OLE2 -> extractFromDoc(buffered); // .doc format case OOXML -> extractFromDocx(buffered); // .docx format default -> throw new IllegalArgumentException( "Unsupported format: " + magic ); }; }
private String extractFromDoc(InputStream input) throws IOException { HWPFDocument doc = new HWPFDocument(input); return doc.getDocumentText(); }
private String extractFromDocx(InputStream input) throws IOException { XWPFDocument doc = new XWPFDocument(input); StringBuilder text = new StringBuilder(); doc.getParagraphs().forEach(p -> text.append(p.getText()).append("\n")); return text.toString(); }}Key API differences:
| Feature | HWPF (for .doc) | XWPF (for .docx) |
|---|---|---|
| Document class | HWPFDocument | XWPFDocument |
| Text extraction | getDocumentText() | Iterate paragraphs |
| Image handling | getPicturesTable() | getAllPictures() |
| Table access | getRange().getTable() | getTables() |
| Maintenance | Minimal (legacy) | Active development |
Why Microsoft Switched
Microsoft introduced .docx in Office 2007. The reasons were:
-
Interoperability: XML is a standard format. Any application can parse it.
-
File size: XML with ZIP compression typically produces smaller files than binary.
-
Corruption recovery: If one XML part is damaged, the rest may be recoverable.
-
Standards compliance: Office Open XML is an ISO standard (ISO/IEC 29500).
-
Security: ZIP structure allows scanning individual XML files for threats.
sample.doc (binary) : 245 KBsample.docx (XML+ZIP) : 189 KBThe .docx file is often 20-30% smaller due to ZIP compression.
Version Control Benefits
One practical benefit I discovered: .docx works better with Git.
Binary files a/report.doc and b/report.doc differdiff --git a/word/document.xml b/word/document.xml- <w:t>Old content</w:t>+ <w:t>New content</w:t>To enable this, I configure Git:
# Add to .gitattributes*.docx diff=docx
# Configure diff drivergit config diff.docx.textconv "unzip -p -"git config diff.docx.cachetextconv trueNow I can see actual content changes in commits.
Common Mistakes
Mistake 1: Assuming the same API works for both
// WRONG: Using XWPF for .doc filesXWPFDocument doc = new XWPFDocument(stream); // Fails for .docMistake 2: Not detecting format before processing
// WRONG: No format detectionpublic void process(String filename) { if (filename.endsWith(".doc")) { // What if someone renamed .docx to .doc? processDoc(filename); }}Always detect format by content, not extension:
// RIGHT: Detect by contentFileMagic magic = FileMagic.valueOf(bufferedStream);Mistake 3: Assuming .docx requires Microsoft Word
# Extract text without any Office installationunzip -p document.docx word/document.xml | xmllint --format -Mistake 4: Ignoring legacy documents in migrations
When migrating systems, I need to handle both:
public void migrateDocuments(Path sourceDir) throws IOException { Files.walk(sourceDir) .filter(p -> p.toString().endsWith(".doc") || p.toString().endsWith(".docx")) .forEach(this::migrateDocument);}
private void migrateDocument(Path file) { try { String text = documentLoader.extractText(Files.newInputStream(file)); // Store extracted text for search/indexing textIndexService.index(file.getFileName().toString(), text); } catch (IOException e) { log.error("Failed to migrate: {}", file, e); }}Recovery Scenarios
When a .docx file gets corrupted, recovery options exist:
# 1. Rename to .zipcp damaged.docx damaged.zip
# 2. Try to extractunzip damaged.zip -d recovered/
# 3. Find content in XMLgrep -r "your search term" recovered/
# 4. Edit XML if neededvim recovered/word/document.xml
# 5. Re-packagecd recovered && zip -r ../repaired.docx .For .doc files, this approach doesn’t work. The binary format is opaque.
When to Use Each Format
I recommend .docx for:
- New documents
- Documents requiring version control
- Automated processing systems
- Archival storage
I keep .doc support only for:
- Legacy document libraries
- User-uploaded legacy files
- Systems that cannot upgrade
Summary
The difference between .doc and .docx is fundamental:
.doc: Proprietary binary, limited interoperability, requires specific APIs (HWPF).docx: Open XML standard, ZIP of XML files, better tooling (XWPF), easier recovery
When building document processing systems:
- Detect format by content, not file extension
- Use the correct API for each format (HWPF vs XWPF)
- Prefer
.docxfor new documents and archival - Plan migration paths for legacy
.docfiles
Understanding this distinction prevents confusing errors and enables proper handling of both formats in production systems.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments