Skip to content

doc vs docx: Understanding Word Document Format Differences

I inherited a legacy document processing system. The codebase had separate handlers for .doc and .docx files, and I needed to understand why. When I tried using the same API for both formats, I got cryptic errors. It turns out the two formats are fundamentally different.

The Problem

My application needed to extract text from Word documents uploaded by users. Some uploaded old .doc files, others uploaded newer .docx files. I initially assumed they were similar formats that could be handled the same way.

I was wrong.

Error when using wrong API
org.apache.poi.poifs.filesystem.NotOLE2FileException: Invalid header signature;
read 0x504B0304, expected 0xE11AB1A1E011CFD0

This error appeared when I tried to read a .docx file with the .doc API. The reverse also failed. I needed to understand the actual differences.

The Core Difference

The fundamental difference is the file structure:

Format comparison
+------------------+-------------------+------------------------+
| Aspect | .doc (Legacy) | .docx (Modern) |
+------------------+-------------------+------------------------+
| Format | Binary | Office Open XML |
| Standard | Proprietary | ECMA-376, ISO/IEC 29500|
| Structure | Single binary file| ZIP of XML files |
| Interoperability | Limited | Excellent |
| Recovery | Difficult | Possible (edit XML) |
| Version Control | Binary blob | Text-based XML diffs |
| Size | Often larger | Usually smaller |
+------------------+-------------------+------------------------+

What this means:

.doc files are proprietary binary blobs. You need Microsoft Word or reverse-engineered libraries to read them.

.docx files are ZIP archives containing XML files. You can rename a .docx to .zip, extract it, and read the content directly.

The .docx Structure

When I unzip a .docx file, I see this structure:

docx internal structure
document.docx
├── [Content_Types].xml
├── _rels/
│ └── .rels
├── docProps/
│ ├── app.xml
│ └── core.xml
└── word/
├── document.xml <- Main content here
├── styles.xml
├── numbering.xml
└── media/
└── image1.png

The word/document.xml file contains the actual text and formatting:

document.xml (simplified)
<w:document>
<w:body>
<w:p>
<w:r>
<w:t>Hello World</w:t>
</w:r>
</w:p>
</w:body>
</w:document>

This transparency enables:

  • Version control diffs (text-based XML)
  • Recovery from corruption (edit XML directly)
  • Processing without Word installed (any XML parser)
  • Easier debugging (read the source)

Different APIs Required

In Java with Apache POI, I need completely different APIs for each format:

DocumentLoader.java
import org.apache.poi.hwpf.HWPFDocument; // For .doc
import org.apache.poi.xwpf.usermodel.XWPFDocument; // For .docx
// WRONG: Cannot mix APIs
public String extractText(File file) {
// This fails if file is wrong format
HWPFDocument doc = new HWPFDocument(new FileInputStream(file));
return doc.getDocumentText();
}

The correct approach detects format first:

DocumentLoader.java
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.poifs.filesystem.FileMagic;
import java.io.InputStream;
import java.io.IOException;
public class DocumentLoader {
public String extractText(InputStream input) throws IOException {
// Detect format before processing
InputStream buffered = FileMagic.prepareToCheckMagic(input);
FileMagic magic = FileMagic.valueOf(buffered);
return switch (magic) {
case OLE2 -> extractFromDoc(buffered); // .doc format
case OOXML -> extractFromDocx(buffered); // .docx format
default -> throw new IllegalArgumentException(
"Unsupported format: " + magic
);
};
}
private String extractFromDoc(InputStream input) throws IOException {
HWPFDocument doc = new HWPFDocument(input);
return doc.getDocumentText();
}
private String extractFromDocx(InputStream input) throws IOException {
XWPFDocument doc = new XWPFDocument(input);
StringBuilder text = new StringBuilder();
doc.getParagraphs().forEach(p -> text.append(p.getText()).append("\n"));
return text.toString();
}
}

Key API differences:

FeatureHWPF (for .doc)XWPF (for .docx)
Document classHWPFDocumentXWPFDocument
Text extractiongetDocumentText()Iterate paragraphs
Image handlinggetPicturesTable()getAllPictures()
Table accessgetRange().getTable()getTables()
MaintenanceMinimal (legacy)Active development

Why Microsoft Switched

Microsoft introduced .docx in Office 2007. The reasons were:

  1. Interoperability: XML is a standard format. Any application can parse it.

  2. File size: XML with ZIP compression typically produces smaller files than binary.

  3. Corruption recovery: If one XML part is damaged, the rest may be recoverable.

  4. Standards compliance: Office Open XML is an ISO standard (ISO/IEC 29500).

  5. Security: ZIP structure allows scanning individual XML files for threats.

File size comparison (typical)
sample.doc (binary) : 245 KB
sample.docx (XML+ZIP) : 189 KB

The .docx file is often 20-30% smaller due to ZIP compression.

Version Control Benefits

One practical benefit I discovered: .docx works better with Git.

git diff with .doc (binary)
Binary files a/report.doc and b/report.doc differ
git diff with .docx (text diff in XML)
diff --git a/word/document.xml b/word/document.xml
- <w:t>Old content</w:t>
+ <w:t>New content</w:t>

To enable this, I configure Git:

Git configuration
# Add to .gitattributes
*.docx diff=docx
# Configure diff driver
git config diff.docx.textconv "unzip -p -"
git config diff.docx.cachetextconv true

Now I can see actual content changes in commits.

Common Mistakes

Mistake 1: Assuming the same API works for both

Wrong.java
// WRONG: Using XWPF for .doc files
XWPFDocument doc = new XWPFDocument(stream); // Fails for .doc

Mistake 2: Not detecting format before processing

Wrong.java
// WRONG: No format detection
public void process(String filename) {
if (filename.endsWith(".doc")) {
// What if someone renamed .docx to .doc?
processDoc(filename);
}
}

Always detect format by content, not extension:

Right.java
// RIGHT: Detect by content
FileMagic magic = FileMagic.valueOf(bufferedStream);

Mistake 3: Assuming .docx requires Microsoft Word

No Word needed
# Extract text without any Office installation
unzip -p document.docx word/document.xml | xmllint --format -

Mistake 4: Ignoring legacy documents in migrations

When migrating systems, I need to handle both:

MigrationHandler.java
public void migrateDocuments(Path sourceDir) throws IOException {
Files.walk(sourceDir)
.filter(p -> p.toString().endsWith(".doc") ||
p.toString().endsWith(".docx"))
.forEach(this::migrateDocument);
}
private void migrateDocument(Path file) {
try {
String text = documentLoader.extractText(Files.newInputStream(file));
// Store extracted text for search/indexing
textIndexService.index(file.getFileName().toString(), text);
} catch (IOException e) {
log.error("Failed to migrate: {}", file, e);
}
}

Recovery Scenarios

When a .docx file gets corrupted, recovery options exist:

Recovery attempt
# 1. Rename to .zip
cp damaged.docx damaged.zip
# 2. Try to extract
unzip damaged.zip -d recovered/
# 3. Find content in XML
grep -r "your search term" recovered/
# 4. Edit XML if needed
vim recovered/word/document.xml
# 5. Re-package
cd recovered && zip -r ../repaired.docx .

For .doc files, this approach doesn’t work. The binary format is opaque.

When to Use Each Format

I recommend .docx for:

  • New documents
  • Documents requiring version control
  • Automated processing systems
  • Archival storage

I keep .doc support only for:

  • Legacy document libraries
  • User-uploaded legacy files
  • Systems that cannot upgrade

Summary

The difference between .doc and .docx is fundamental:

  • .doc: Proprietary binary, limited interoperability, requires specific APIs (HWPF)
  • .docx: Open XML standard, ZIP of XML files, better tooling (XWPF), easier recovery

When building document processing systems:

  1. Detect format by content, not file extension
  2. Use the correct API for each format (HWPF vs XWPF)
  3. Prefer .docx for new documents and archival
  4. Plan migration paths for legacy .doc files

Understanding this distinction prevents confusing errors and enables proper handling of both formats in production systems.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments