How to Handle Large Word Documents When Converting to HTML in Java
The Problem
I was building a document conversion service that converts Word documents (.docx) to HTML. Everything worked fine with small files, but when I tried to convert a 50MB Word document, I got this:
java.lang.OutOfMemoryError: Java heap space at org.apache.xmlbeans.impl.store.Saver$TextSaver.emit(Saver.java:2045) at org.apache.xmlbeans.impl.store.Saver$TextSaver.preContent(Saver.java:1038) at org.apache.xmlbeans.impl.store.Saver.process(Saver.java:456) at org.apache.poi.xwpf.usermodel.XWPFDocument.write(XWPFDocument.java:582)Even worse, when I tried to process multiple large documents in a web application, the server became unresponsive:
INFO [http-nio-8080-exec-1] c.e.d.DocumentController : Starting conversion for document.docxINFO [http-nio-8080-exec-2] c.e.d.DocumentController : Starting conversion for report.docx... (30 seconds of silence)ERROR [http-nio-8080-exec-1] o.a.c.c.C.[.[localhost].[/].[dispatcherServlet] : Servlet.service() threw exceptionjava.util.concurrent.TimeoutException: Request timed out after 30000msThe HTTP request thread was blocked, waiting for the conversion to complete, and eventually timed out.
My Environment
- Java 17
- Apache POI 5.2.3
- Spring Boot 3.1.0
- Maven 3.9.0
The Initial Code
Here was my original document conversion code:
@Servicepublic class DocumentConverter {
public String convertToHtml(Path docxPath) throws IOException { FileInputStream fis = new FileInputStream(docxPath.toFile()); XWPFDocument document = new XWPFDocument(fis);
XHTMLOptions options = XHTMLOptions.create(); options.setImageManager(new Base64EmbedImgManager());
StringWriter writer = new StringWriter(); XHTMLConverter.getInstance() .convert(document, writer, options);
fis.close(); // I thought this was enough return writer.toString(); }}I thought I was doing everything right - opening the file, converting, and closing it. But there were several problems:
- Memory leak: If an exception occurred, the
fis.close()was never called - Blocking thread: The conversion ran on the HTTP request thread
- No resource limits: Any size document could be loaded into memory
First Attempt: try-with-resources
I knew I should use try-with-resources to ensure proper cleanup:
@Servicepublic class DocumentConverter {
public String convertToHtml(Path docxPath) throws IOException { try (FileInputStream fis = new FileInputStream(docxPath.toFile()); XWPFDocument document = new XWPFDocument(fis)) {
XHTMLOptions options = XHTMLOptions.create(); options.setImageManager(new Base64EmbedImgManager());
StringWriter writer = new StringWriter(); XHTMLConverter.getInstance() .convert(document, writer, options);
return writer.toString(); } }}This fixed the memory leak issue - streams and document data were released immediately after conversion, even if an exception occurred.
But I still had the timeout problem for large files. The HTTP request thread was still blocked.
Second Attempt: Async Processing
I needed to run conversions asynchronously. Hereβs my updated service:
@Servicepublic class DocumentConverter {
private final ExecutorService executorService;
public DocumentConverter() { this.executorService = Executors.newFixedThreadPool(4); }
public CompletableFuture<String> convertToHtmlAsync(Path docxPath) { return CompletableFuture.supplyAsync(() -> { try { return convertToHtml(docxPath); } catch (IOException e) { throw new CompletionException(e); } }, executorService); }
private String convertToHtml(Path docxPath) throws IOException { try (FileInputStream fis = new FileInputStream(docxPath.toFile()); XWPFDocument document = new XWPFDocument(fis)) {
XHTMLOptions options = XHTMLOptions.create(); options.setImageManager(new Base64EmbedImgManager());
StringWriter writer = new StringWriter(); XHTMLConverter.getInstance() .convert(document, writer, options);
return writer.toString(); } }}Now the controller could return immediately:
@RestController@RequestMapping("/api/documents")public class DocumentController {
private final DocumentConverter converter; private final ConversionRepository repository;
@PostMapping("/convert") public ResponseEntity<ConversionResponse> startConversion( @RequestParam("file") MultipartFile file) {
String jobId = UUID.randomUUID().toString();
Path tempPath = saveTempFile(file);
converter.convertToHtmlAsync(tempPath) .thenAccept(html -> { repository.saveResult(jobId, html); }) .exceptionally(ex -> { repository.saveError(jobId, ex.getMessage()); return null; });
return ResponseEntity.accepted() .body(new ConversionResponse(jobId, "PROCESSING")); }}The API now returns immediately with a job ID, and clients can poll for results.
Third Attempt: Add Timeout and Validation
For production use, I needed timeouts and input validation:
@Servicepublic class DocumentConverter {
private static final long MAX_FILE_SIZE = 100 * 1024 * 1024; // 100MB private static final int CONVERSION_TIMEOUT_MINUTES = 10;
private final ExecutorService executorService;
public DocumentConverter() { this.executorService = Executors.newFixedThreadPool(4); }
public CompletableFuture<String> convertToHtmlAsync(Path docxPath) { // Validate file size first validateDocument(docxPath);
return CompletableFuture.supplyAsync(() -> { try { return convertToHtml(docxPath); } catch (IOException e) { throw new CompletionException(e); } }, executorService) .orTimeout(CONVERSION_TIMEOUT_MINUTES, TimeUnit.MINUTES); }
private void validateDocument(Path path) { if (!Files.exists(path)) { throw new IllegalArgumentException("File not found: " + path); }
try { long size = Files.size(path); if (size > MAX_FILE_SIZE) { throw new IllegalArgumentException( "File too large: " + size + " bytes. Maximum: " + MAX_FILE_SIZE ); } } catch (IOException e) { throw new IllegalArgumentException("Cannot read file size", e); } }
private String convertToHtml(Path docxPath) throws IOException { try (FileInputStream fis = new FileInputStream(docxPath.toFile()); XWPFDocument document = new XWPFDocument(fis)) {
XHTMLOptions options = XHTMLOptions.create(); options.setImageManager(new Base64EmbedImgManager());
StringWriter writer = new StringWriter(); XHTMLConverter.getInstance() .convert(document, writer, options);
return writer.toString(); } }
@PreDestroy public void shutdown() { executorService.shutdown(); }}Why This Matters
Apache POI loads the entire document into memory. When you have a 50MB Word document, it can easily consume 200-300MB of heap space due to:
- XML parsing overhead (the .docx format is a ZIP of XML files)
- Object model creation (every paragraph, run, table becomes an object)
- Image handling and base64 encoding
The key insights:
- Resource lifecycle matters: Unclosed streams lead to memory leaks that accumulate over time
- Thread blocking matters: Large document processing on HTTP threads causes timeouts
- Input validation matters: Processing a 500MB file without limits will crash your application
Common Mistakes to Avoid
Summary
When converting large Word documents to HTML in Java:
- Always use try-with-resources to ensure streams and document objects are released immediately
- Process asynchronously to prevent blocking HTTP request threads
- Validate input size before loading documents into memory
- Set reasonable timeouts for conversion operations
- Monitor heap usage for very large files
The combination of proper resource management and async processing makes document conversion reliable and scalable.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- π¨βπ» Apache POI - the Java API for Microsoft Documents
- π¨βπ» Apache POI XWPF Usermodel Documentation
- π¨βπ» Java CompletableFuture Documentation
Oh, and if you found these resources useful, donβt forget to support me by starring the repo on GitHub!
Comments