doc vs docx: Understanding Word Document Format Differences

Mar 26, 2026

I inherited a legacy document processing system. The codebase had separate handlers for .doc and .docx files, and I needed to understand why. When I tried using the same API for both formats, I got cryptic errors. It turns out the two formats are fundamentally different.

The Problem

My application needed to extract text from Word documents uploaded by users. Some uploaded old .doc files, others uploaded newer .docx files. I initially assumed they were similar formats that could be handled the same way.

I was wrong.

org.apache.poi.poifs.filesystem.NotOLE2FileException: Invalid header signature;
read 0x504B0304, expected 0xE11AB1A1E011CFD0

This error appeared when I tried to read a .docx file with the .doc API. The reverse also failed. I needed to understand the actual differences.

The Core Difference

The fundamental difference is the file structure:

+------------------+-------------------+------------------------+
| Aspect           | .doc (Legacy)     | .docx (Modern)         |
+------------------+-------------------+------------------------+
| Format           | Binary            | Office Open XML        |
| Standard         | Proprietary       | ECMA-376, ISO/IEC 29500|
| Structure        | Single binary file| ZIP of XML files       |
| Interoperability | Limited           | Excellent              |
| Recovery         | Difficult         | Possible (edit XML)    |
| Version Control  | Binary blob       | Text-based XML diffs   |
| Size             | Often larger      | Usually smaller        |
+------------------+-------------------+------------------------+

What this means:

.doc files are proprietary binary blobs. You need Microsoft Word or reverse-engineered libraries to read them.

.docx files are ZIP archives containing XML files. You can rename a .docx to .zip, extract it, and read the content directly.

The .docx Structure

When I unzip a .docx file, I see this structure:

document.docx
├── [Content_Types].xml
├── _rels/
│   └── .rels
├── docProps/
│   ├── app.xml
│   └── core.xml
└── word/
    ├── document.xml      <- Main content here
    ├── styles.xml
    ├── numbering.xml
    └── media/
        └── image1.png

The word/document.xml file contains the actual text and formatting:

<w:document>
  <w:body>
    <w:p>
      <w:r>
        <w:t>Hello World</w:t>
      </w:r>
    </w:p>
  </w:body>
</w:document>

This transparency enables:

Version control diffs (text-based XML)
Recovery from corruption (edit XML directly)
Processing without Word installed (any XML parser)
Easier debugging (read the source)

Different APIs Required

In Java with Apache POI, I need completely different APIs for each format:

import org.apache.poi.hwpf.HWPFDocument;  // For .doc
import org.apache.poi.xwpf.usermodel.XWPFDocument;  // For .docx

// WRONG: Cannot mix APIs
public String extractText(File file) {
    // This fails if file is wrong format
    HWPFDocument doc = new HWPFDocument(new FileInputStream(file));
    return doc.getDocumentText();
}

The correct approach detects format first:

import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.poifs.filesystem.FileMagic;
import java.io.InputStream;
import java.io.IOException;

public class DocumentLoader {

    public String extractText(InputStream input) throws IOException {
        // Detect format before processing
        InputStream buffered = FileMagic.prepareToCheckMagic(input);
        FileMagic magic = FileMagic.valueOf(buffered);

        return switch (magic) {
            case OLE2 -> extractFromDoc(buffered);   // .doc format
            case OOXML -> extractFromDocx(buffered); // .docx format
            default -> throw new IllegalArgumentException(
                "Unsupported format: " + magic
            );
        };
    }

    private String extractFromDoc(InputStream input) throws IOException {
        HWPFDocument doc = new HWPFDocument(input);
        return doc.getDocumentText();
    }

    private String extractFromDocx(InputStream input) throws IOException {
        XWPFDocument doc = new XWPFDocument(input);
        StringBuilder text = new StringBuilder();
        doc.getParagraphs().forEach(p -> text.append(p.getText()).append("\n"));
        return text.toString();
    }
}

Key API differences:

Feature	HWPF (for .doc)	XWPF (for .docx)
Document class	`HWPFDocument`	`XWPFDocument`
Text extraction	`getDocumentText()`	Iterate paragraphs
Image handling	`getPicturesTable()`	`getAllPictures()`
Table access	`getRange().getTable()`	`getTables()`
Maintenance	Minimal (legacy)	Active development

Why Microsoft Switched

Microsoft introduced .docx in Office 2007. The reasons were:

Interoperability: XML is a standard format. Any application can parse it.
File size: XML with ZIP compression typically produces smaller files than binary.
Corruption recovery: If one XML part is damaged, the rest may be recoverable.
Standards compliance: Office Open XML is an ISO standard (ISO/IEC 29500).
Security: ZIP structure allows scanning individual XML files for threats.

sample.doc  (binary)   : 245 KB
sample.docx (XML+ZIP)  : 189 KB

The .docx file is often 20-30% smaller due to ZIP compression.

Version Control Benefits

One practical benefit I discovered: .docx works better with Git.

Binary files a/report.doc and b/report.doc differ

diff --git a/word/document.xml b/word/document.xml
- <w:t>Old content</w:t>
+ <w:t>New content</w:t>

To enable this, I configure Git:

# Add to .gitattributes
*.docx diff=docx

# Configure diff driver
git config diff.docx.textconv "unzip -p -"
git config diff.docx.cachetextconv true

Now I can see actual content changes in commits.

Common Mistakes

Mistake 1: Assuming the same API works for both

// WRONG: Using XWPF for .doc files
XWPFDocument doc = new XWPFDocument(stream); // Fails for .doc

Mistake 2: Not detecting format before processing

// WRONG: No format detection
public void process(String filename) {
    if (filename.endsWith(".doc")) {
        // What if someone renamed .docx to .doc?
        processDoc(filename);
    }
}

Always detect format by content, not extension:

// RIGHT: Detect by content
FileMagic magic = FileMagic.valueOf(bufferedStream);

Mistake 3: Assuming .docx requires Microsoft Word

# Extract text without any Office installation
unzip -p document.docx word/document.xml | xmllint --format -

Mistake 4: Ignoring legacy documents in migrations

When migrating systems, I need to handle both:

public void migrateDocuments(Path sourceDir) throws IOException {
    Files.walk(sourceDir)
        .filter(p -> p.toString().endsWith(".doc") ||
                     p.toString().endsWith(".docx"))
        .forEach(this::migrateDocument);
}

private void migrateDocument(Path file) {
    try {
        String text = documentLoader.extractText(Files.newInputStream(file));
        // Store extracted text for search/indexing
        textIndexService.index(file.getFileName().toString(), text);
    } catch (IOException e) {
        log.error("Failed to migrate: {}", file, e);
    }
}

Recovery Scenarios

When a .docx file gets corrupted, recovery options exist:

# 1. Rename to .zip
cp damaged.docx damaged.zip

# 2. Try to extract
unzip damaged.zip -d recovered/

# 3. Find content in XML
grep -r "your search term" recovered/

# 4. Edit XML if needed
vim recovered/word/document.xml

# 5. Re-package
cd recovered && zip -r ../repaired.docx .

For .doc files, this approach doesn’t work. The binary format is opaque.

When to Use Each Format

I recommend .docx for:

New documents
Documents requiring version control
Automated processing systems
Archival storage

I keep .doc support only for:

Legacy document libraries
User-uploaded legacy files
Systems that cannot upgrade

Summary

The difference between .doc and .docx is fundamental:

.doc: Proprietary binary, limited interoperability, requires specific APIs (HWPF)
.docx: Open XML standard, ZIP of XML files, better tooling (XWPF), easier recovery

When building document processing systems:

Detect format by content, not file extension
Use the correct API for each format (HWPF vs XWPF)
Prefer .docx for new documents and archival
Plan migration paths for legacy .doc files

Understanding this distinction prevents confusing errors and enables proper handling of both formats in production systems.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!