Skip to content

How to Handle Large Word Documents When Converting to HTML in Java

The Problem

I was building a document conversion service that converts Word documents (.docx) to HTML. Everything worked fine with small files, but when I tried to convert a 50MB Word document, I got this:

Error Output
java.lang.OutOfMemoryError: Java heap space
at org.apache.xmlbeans.impl.store.Saver$TextSaver.emit(Saver.java:2045)
at org.apache.xmlbeans.impl.store.Saver$TextSaver.preContent(Saver.java:1038)
at org.apache.xmlbeans.impl.store.Saver.process(Saver.java:456)
at org.apache.poi.xwpf.usermodel.XWPFDocument.write(XWPFDocument.java:582)

Even worse, when I tried to process multiple large documents in a web application, the server became unresponsive:

Server Log
INFO [http-nio-8080-exec-1] c.e.d.DocumentController : Starting conversion for document.docx
INFO [http-nio-8080-exec-2] c.e.d.DocumentController : Starting conversion for report.docx
... (30 seconds of silence)
ERROR [http-nio-8080-exec-1] o.a.c.c.C.[.[localhost].[/].[dispatcherServlet] : Servlet.service() threw exception
java.util.concurrent.TimeoutException: Request timed out after 30000ms

The HTTP request thread was blocked, waiting for the conversion to complete, and eventually timed out.

My Environment

  • Java 17
  • Apache POI 5.2.3
  • Spring Boot 3.1.0
  • Maven 3.9.0

The Initial Code

Here was my original document conversion code:

DocumentConverter.java
@Service
public class DocumentConverter {
public String convertToHtml(Path docxPath) throws IOException {
FileInputStream fis = new FileInputStream(docxPath.toFile());
XWPFDocument document = new XWPFDocument(fis);
XHTMLOptions options = XHTMLOptions.create();
options.setImageManager(new Base64EmbedImgManager());
StringWriter writer = new StringWriter();
XHTMLConverter.getInstance()
.convert(document, writer, options);
fis.close(); // I thought this was enough
return writer.toString();
}
}

I thought I was doing everything right - opening the file, converting, and closing it. But there were several problems:

  1. Memory leak: If an exception occurred, the fis.close() was never called
  2. Blocking thread: The conversion ran on the HTTP request thread
  3. No resource limits: Any size document could be loaded into memory

First Attempt: try-with-resources

I knew I should use try-with-resources to ensure proper cleanup:

DocumentConverter.java
@Service
public class DocumentConverter {
public String convertToHtml(Path docxPath) throws IOException {
try (FileInputStream fis = new FileInputStream(docxPath.toFile());
XWPFDocument document = new XWPFDocument(fis)) {
XHTMLOptions options = XHTMLOptions.create();
options.setImageManager(new Base64EmbedImgManager());
StringWriter writer = new StringWriter();
XHTMLConverter.getInstance()
.convert(document, writer, options);
return writer.toString();
}
}
}

This fixed the memory leak issue - streams and document data were released immediately after conversion, even if an exception occurred.

But I still had the timeout problem for large files. The HTTP request thread was still blocked.

Second Attempt: Async Processing

I needed to run conversions asynchronously. Here’s my updated service:

DocumentConverter.java
@Service
public class DocumentConverter {
private final ExecutorService executorService;
public DocumentConverter() {
this.executorService = Executors.newFixedThreadPool(4);
}
public CompletableFuture<String> convertToHtmlAsync(Path docxPath) {
return CompletableFuture.supplyAsync(() -> {
try {
return convertToHtml(docxPath);
} catch (IOException e) {
throw new CompletionException(e);
}
}, executorService);
}
private String convertToHtml(Path docxPath) throws IOException {
try (FileInputStream fis = new FileInputStream(docxPath.toFile());
XWPFDocument document = new XWPFDocument(fis)) {
XHTMLOptions options = XHTMLOptions.create();
options.setImageManager(new Base64EmbedImgManager());
StringWriter writer = new StringWriter();
XHTMLConverter.getInstance()
.convert(document, writer, options);
return writer.toString();
}
}
}

Now the controller could return immediately:

DocumentController.java
@RestController
@RequestMapping("/api/documents")
public class DocumentController {
private final DocumentConverter converter;
private final ConversionRepository repository;
@PostMapping("/convert")
public ResponseEntity<ConversionResponse> startConversion(
@RequestParam("file") MultipartFile file) {
String jobId = UUID.randomUUID().toString();
Path tempPath = saveTempFile(file);
converter.convertToHtmlAsync(tempPath)
.thenAccept(html -> {
repository.saveResult(jobId, html);
})
.exceptionally(ex -> {
repository.saveError(jobId, ex.getMessage());
return null;
});
return ResponseEntity.accepted()
.body(new ConversionResponse(jobId, "PROCESSING"));
}
}

The API now returns immediately with a job ID, and clients can poll for results.

Third Attempt: Add Timeout and Validation

For production use, I needed timeouts and input validation:

DocumentConverter.java
@Service
public class DocumentConverter {
private static final long MAX_FILE_SIZE = 100 * 1024 * 1024; // 100MB
private static final int CONVERSION_TIMEOUT_MINUTES = 10;
private final ExecutorService executorService;
public DocumentConverter() {
this.executorService = Executors.newFixedThreadPool(4);
}
public CompletableFuture<String> convertToHtmlAsync(Path docxPath) {
// Validate file size first
validateDocument(docxPath);
return CompletableFuture.supplyAsync(() -> {
try {
return convertToHtml(docxPath);
} catch (IOException e) {
throw new CompletionException(e);
}
}, executorService)
.orTimeout(CONVERSION_TIMEOUT_MINUTES, TimeUnit.MINUTES);
}
private void validateDocument(Path path) {
if (!Files.exists(path)) {
throw new IllegalArgumentException("File not found: " + path);
}
try {
long size = Files.size(path);
if (size > MAX_FILE_SIZE) {
throw new IllegalArgumentException(
"File too large: " + size + " bytes. Maximum: " + MAX_FILE_SIZE
);
}
} catch (IOException e) {
throw new IllegalArgumentException("Cannot read file size", e);
}
}
private String convertToHtml(Path docxPath) throws IOException {
try (FileInputStream fis = new FileInputStream(docxPath.toFile());
XWPFDocument document = new XWPFDocument(fis)) {
XHTMLOptions options = XHTMLOptions.create();
options.setImageManager(new Base64EmbedImgManager());
StringWriter writer = new StringWriter();
XHTMLConverter.getInstance()
.convert(document, writer, options);
return writer.toString();
}
}
@PreDestroy
public void shutdown() {
executorService.shutdown();
}
}

Why This Matters

Apache POI loads the entire document into memory. When you have a 50MB Word document, it can easily consume 200-300MB of heap space due to:

  • XML parsing overhead (the .docx format is a ZIP of XML files)
  • Object model creation (every paragraph, run, table becomes an object)
  • Image handling and base64 encoding

The key insights:

  1. Resource lifecycle matters: Unclosed streams lead to memory leaks that accumulate over time
  2. Thread blocking matters: Large document processing on HTTP threads causes timeouts
  3. Input validation matters: Processing a 500MB file without limits will crash your application

Common Mistakes to Avoid

Summary

When converting large Word documents to HTML in Java:

  1. Always use try-with-resources to ensure streams and document objects are released immediately
  2. Process asynchronously to prevent blocking HTTP request threads
  3. Validate input size before loading documents into memory
  4. Set reasonable timeouts for conversion operations
  5. Monitor heap usage for very large files

The combination of proper resource management and async processing makes document conversion reliable and scalable.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments