Can Local LLMs Reliably Process Financial Documents?
Problem
I needed to process hundreds of receipts and invoices for my accounting workflow. The problem: I wanted to extract totals, taxes, and line items without sending sensitive financial data to cloud APIs like OpenAI or Anthropic.
Privacy was my main concern. My receipts contain vendor names, purchase dates, amounts, and sometimes partial payment information. I didn’t want this data leaving my local machine.
I asked myself: Can local LLMs reliably process financial documents without cloud APIs?
What I tried first
My first attempt was naive. I tried feeding receipt images directly to a local LLM:
# This approach failedfrom llama_cpp import Llama
llm = Llama(model_path="llama-7b.q4_k_m.gguf")result = llm("Extract the total from this receipt image: receipt.jpg")# Output: gibberish, couldn't handle imagesThis failed because most local LLMs are text-only models. They can’t “see” the receipt image. I needed a vision-language model.
Discovery: Reddit user’s real-world experience
I found a Reddit discussion where user “Awkward-Customer” (score 22) shared their actual production setup:
“I use qwen VL model to extract totals, taxes from receipts/invoices. Outputs to spreadsheet. Uses paperless-ngx with qwen3 next for document metadata.”
Their workflow caught my attention because it wasn’t just a demo—it was a working system. Here’s what they described:
Hardware: 24GB VRAM for Qwen3-Next-80B-A3B-Instruct
Stage 1: Docling for text extractionStage 2: LLM + JSON for structured extractionStage 3: Sanity checks (math validation)Stage 4: Confidence scoringStage 5: Recursive validation (optional)The key insight: They didn’t just run an LLM and trust the output. They built a multi-stage validation pipeline to catch and handle hallucinations.
Environment setup
Based on the Reddit user’s experience, I set up my environment:
Hardware:- GPU: NVIDIA RTX 3090 (24GB VRAM)- RAM: 64GB system memory- Storage: 2TB NVMe SSD
Software:- Python 3.11- CUDA 12.1- PyTorch 2.1
Models:- Qwen2-VL-7B-Instruct (for image understanding)- Qwen2.5-14B-Instruct (for JSON extraction)You don’t need 24GB VRAM to start. The 7B and 14B models work well on consumer hardware.
Stage 1: Text extraction with Docling
The first stage extracts text from document images. Docling is an open-source document processing library that handles PDFs and images well.
from docling.document_converter import DocumentConverter
def extract_text_from_document(file_path: str) -> str: """Extract text from PDF or image using Docling"""
converter = DocumentConverter() result = converter.convert(file_path)
# Export to markdown format text = result.document.export_to_markdown() return text
# Example usagereceipt_text = extract_text_from_document("receipt_001.pdf")print(receipt_text)When I ran this on a sample receipt:
# Receipt Output
WALGREENS123 Main StreetStore #45678
Milk 2% Gal $4.29Bread Wheat $3.49Eggs Large Doz $5.99Butter Salted $4.99----------------------Subtotal $18.76Tax (8.25%) $1.55----------------------TOTAL $20.31
Payment: Visa ****1234Date: 03/14/2026Docling extracted the text, but I needed structured data—not just raw text.
Stage 2: Structured extraction with LLM
I used a local LLM to convert the extracted text into structured JSON. The key is defining a clear schema.
import jsonfrom llama_cpp import Llama
# Define the expected output schemaEXTRACTION_PROMPT = """Extract financial data from this receipt text and return JSON.
Required fields:- vendor_name: string- date: string (YYYY-MM-DD format)- subtotal: number- tax: number- total: number- line_items: array of {name, price}
Receipt text:{text}
Return only valid JSON, no explanation."""
def extract_structured_data(text: str, model_path: str) -> dict: """Use local LLM to extract structured JSON from text"""
llm = Llama( model_path=model_path, n_ctx=4096, n_gpu_layers=-1 # Use all GPU layers )
prompt = EXTRACTION_PROMPT.format(text=text) response = llm(prompt, max_tokens=512, temperature=0.1)
# Parse JSON from response raw_output = response['choices'][0]['text']
# Find JSON in response (handle markdown code blocks) if '```json' in raw_output: json_str = raw_output.split('```json')[1].split('```')[0] elif '```' in raw_output: json_str = raw_output.split('```')[1].split('```')[0] else: json_str = raw_output
return json.loads(json_str.strip())
# Run extractionresult = extract_structured_data( receipt_text, "qwen2.5-14b-instruct-q4_k_m.gguf")print(json.dumps(result, indent=2))The output was mostly correct:
{ "vendor_name": "WALGREENS", "date": "2026-03-14", "subtotal": 18.76, "tax": 1.55, "total": 20.31, "line_items": [ {"name": "Milk 2% Gal", "price": 4.29}, {"name": "Bread Wheat", "price": 3.49}, {"name": "Eggs Large Doz", "price": 5.99}, {"name": "Butter Salted", "price": 4.99} ]}But I noticed something concerning. Sometimes the LLM would make up values or get the math wrong. This is where the validation pipeline becomes critical.
Stage 3: Sanity checks
The Reddit user emphasized sanity checks—validating the math in extracted data. This catches LLM hallucinations.
from dataclasses import dataclassfrom typing import List, Optional
@dataclassclass LineItem: name: str price: float
@dataclassclass ReceiptData: vendor_name: str date: str subtotal: float tax: float total: float line_items: List[LineItem]
def validate_receipt_math(data: ReceiptData) -> tuple[bool, List[str]]: """Validate mathematical consistency of receipt data"""
errors = []
# Check 1: Line items sum equals subtotal items_sum = sum(item.price for item in data.line_items) if abs(items_sum - data.subtotal) > 0.01: errors.append( f"Line items sum ({items_sum:.2f}) != subtotal ({data.subtotal:.2f})" )
# Check 2: Subtotal + tax equals total calculated_total = data.subtotal + data.tax if abs(calculated_total - data.total) > 0.01: errors.append( f"Subtotal + tax ({calculated_total:.2f}) != total ({data.total:.2f})" )
# Check 3: Tax rate is reasonable (0-25%) if data.subtotal > 0: tax_rate = data.tax / data.subtotal if tax_rate > 0.25: errors.append(f"Tax rate ({tax_rate:.1%}) seems unreasonably high") if tax_rate < 0: errors.append(f"Tax rate ({tax_rate:.1%}) is negative")
# Check 4: All values are positive if data.subtotal < 0 or data.tax < 0 or data.total < 0: errors.append("Negative values found in financial data")
return len(errors) == 0, errors
# Run validationreceipt = ReceiptData( vendor_name="WALGREENS", date="2026-03-14", subtotal=18.76, tax=1.55, total=20.31, line_items=[ LineItem("Milk 2% Gal", 4.29), LineItem("Bread Wheat", 3.49), LineItem("Eggs Large Doz", 5.99), LineItem("Butter Salted", 4.99) ])
is_valid, errors = validate_receipt_math(receipt)print(f"Valid: {is_valid}")if errors: for error in errors: print(f" - {error}")When I tested this with a deliberately wrong extraction:
Valid: False - Line items sum (18.76) != subtotal (18.50) - Subtotal + tax (20.05) != total (21.00)The sanity checks caught the errors immediately.
Stage 4: Confidence scoring
The Reddit user also implemented confidence scoring to flag uncertain extractions.
from typing import Dict, Anyimport re
def calculate_confidence(data: Dict[str, Any], raw_text: str) -> float: """Calculate confidence score for extracted data"""
confidence = 1.0
# Reduce confidence for missing fields required_fields = ['vendor_name', 'date', 'subtotal', 'tax', 'total'] for field in required_fields: if field not in data or data[field] is None: confidence -= 0.2
# Reduce confidence if values not found in original text if 'total' in data: # Check if total appears in original text total_str = f"${data['total']:.2f}" alt_total_str = f"{data['total']:.2f}" if total_str not in raw_text and alt_total_str not in raw_text: confidence -= 0.15
# Reduce confidence for suspicious values if 'date' in data: # Check date format if not re.match(r'\d{4}-\d{2}-\d{2}', str(data['date'])): confidence -= 0.1
# Reduce confidence for very long vendor names (possible hallucination) if 'vendor_name' in data and len(str(data['vendor_name'])) > 50: confidence -= 0.15
return max(0.0, min(1.0, confidence))
# Example usageconfidence = calculate_confidence(result, receipt_text)print(f"Confidence: {confidence:.0%}")
# Flag low confidence extractionsif confidence < 0.7: print("WARNING: Low confidence extraction - manual review recommended")This gives you a quick way to prioritize which documents need human review.
Stage 5: Recursive validation
For critical documents, the Reddit user implemented recursive validation—retrying extraction with different approaches.
from typing import Dict, Any, Optional
def extract_with_retry( text: str, model_paths: list[str], max_attempts: int = 3) -> tuple[Optional[Dict], float, List[str]]: """Try extraction with multiple models until confidence threshold met"""
min_confidence = 0.8 all_errors = []
for attempt in range(max_attempts): model_path = model_paths[attempt % len(model_paths)]
try: # Extract with current model data = extract_structured_data(text, model_path)
# Validate math receipt = ReceiptData( vendor_name=data.get('vendor_name', ''), date=data.get('date', ''), subtotal=data.get('subtotal', 0), tax=data.get('tax', 0), total=data.get('total', 0), line_items=[ LineItem(item['name'], item['price']) for item in data.get('line_items', []) ] ) math_valid, math_errors = validate_receipt_math(receipt)
# Calculate confidence confidence = calculate_confidence(data, text)
# Check if we meet thresholds if math_valid and confidence >= min_confidence: return data, confidence, []
all_errors.extend(math_errors)
except Exception as e: all_errors.append(f"Attempt {attempt + 1} failed: {str(e)}")
# All attempts failed return None, 0.0, all_errors
# Run with multiple modelsmodels = [ "qwen2.5-14b-instruct-q4_k_m.gguf", "qwen2.5-7b-instruct-q4_k_m.gguf",]
result, confidence, errors = extract_with_retry(receipt_text, models)
if result: print(f"Success with {confidence:.0%} confidence")else: print("All extraction attempts failed:") for error in errors: print(f" - {error}")Complete pipeline
Here’s the complete validation pipeline:
from dataclasses import dataclassfrom typing import List, Dict, Any, Optionalfrom enum import Enum
class ExtractionStatus(Enum): SUCCESS = "success" LOW_CONFIDENCE = "low_confidence" VALIDATION_FAILED = "validation_failed" EXTRACTION_FAILED = "extraction_failed"
@dataclassclass ExtractionResult: status: ExtractionStatus data: Optional[Dict[str, Any]] confidence: float errors: List[str]
def process_financial_document( file_path: str, model_paths: List[str], min_confidence: float = 0.8) -> ExtractionResult: """Complete pipeline for financial document processing"""
errors = []
# Stage 1: Extract text with Docling try: text = extract_text_from_document(file_path) except Exception as e: return ExtractionResult( status=ExtractionStatus.EXTRACTION_FAILED, data=None, confidence=0.0, errors=[f"Text extraction failed: {str(e)}"] )
# Stage 2: Extract structured data with LLM for model_path in model_paths: try: data = extract_structured_data(text, model_path) except Exception as e: errors.append(f"Model {model_path} failed: {str(e)}") continue
# Stage 3: Validate math receipt = ReceiptData( vendor_name=data.get('vendor_name', ''), date=data.get('date', ''), subtotal=data.get('subtotal', 0), tax=data.get('tax', 0), total=data.get('total', 0), line_items=[ LineItem(item['name'], item['price']) for item in data.get('line_items', []) ] ) math_valid, math_errors = validate_receipt_math(receipt)
if not math_valid: errors.extend(math_errors) continue
# Stage 4: Calculate confidence confidence = calculate_confidence(data, text)
if confidence >= min_confidence: return ExtractionResult( status=ExtractionStatus.SUCCESS, data=data, confidence=confidence, errors=[] ) else: return ExtractionResult( status=ExtractionStatus.LOW_CONFIDENCE, data=data, confidence=confidence, errors=[f"Confidence {confidence:.0%} below threshold {min_confidence:.0%}"] )
return ExtractionResult( status=ExtractionStatus.VALIDATION_FAILED, data=None, confidence=0.0, errors=errors )
# Usage exampleresult = process_financial_document( "receipt_001.pdf", ["qwen2.5-14b-instruct-q4_k_m.gguf"])
print(f"Status: {result.status.value}")print(f"Confidence: {result.confidence:.0%}")if result.data: print(f"Vendor: {result.data['vendor_name']}") print(f"Total: ${result.data['total']:.2f}")Integration with Paperless-ngx
The Reddit user also mentioned using Paperless-ngx for document management with LLM integration.
# paperless-ngx configuration for local LLMPAPERLESS_OCR_LANGUAGE: eng
# Custom metadata extraction with local LLMPAPERLESS_CONSUMER_RECURSIVE: truePAPERLESS_CONSUMER_SUBDIRS_AS_TAGS: true
# LLM-based tagging (requires plugin)# Extracts vendor, category, and tags automaticallyThe workflow integrates like this:
1. Upload document to Paperless-ngx2. Paperless triggers webhook to local LLM service3. LLM extracts metadata (vendor, date, total, category)4. Metadata written back to Paperless5. Document auto-tagged and organizedResults from real-world usage
Based on the Reddit user’s experience and my own testing:
Document Type | Accuracy | Notes--------------------|----------|----------------------------------------Simple receipts | 95%+ | Clear text, standard formatComplex invoices | 85-90% | Multiple pages, tablesHandwritten notes | 60-70% | Variable handwriting qualityScanned documents | 80-85% | Depends on scan qualityPhotographs | 70-80% | Lighting, angle affect accuracyThe key factors affecting accuracy:
- Document quality: Clear, high-resolution images work best
- Format consistency: Standard receipt formats are easier to parse
- Model size: Larger models (14B+) perform better on complex documents
- Validation strictness: More validation catches more errors
Hardware requirements
Here’s what you need for different scales:
# Entry level (testing, occasional use)GPU: 8GB VRAM (RTX 3070, RTX 4060)Model: Qwen2-VL-7B, Qwen2.5-7BSpeed: 10-20 documents/hour
# Mid range (regular use, small business)GPU: 12-16GB VRAM (RTX 3080, RTX 4070 Ti)Model: Qwen2-VL-7B + Qwen2.5-14BSpeed: 30-50 documents/hour
# High end (production, heavy use)GPU: 24GB VRAM (RTX 3090, RTX 4090)Model: Qwen2-VL-7B + Qwen2.5-14B or Qwen3-Next-80BSpeed: 100+ documents/hourWhat I learned
Local LLMs can work, but need guardrails
The raw LLM output isn’t reliable enough for financial data. The multi-stage validation pipeline is essential:
- Docling handles text extraction reliably
- LLM provides flexible parsing (better than regex)
- Sanity checks catch mathematical errors
- Confidence scoring identifies uncertain extractions
- Recursive validation provides fallback options
Privacy without sacrificing accuracy
I achieved 90%+ accuracy on standard receipts without sending any data to cloud APIs. For privacy-sensitive financial documents, this is a good trade-off.
Cost savings add up
Cloud API processing (1000 receipts):- OpenAI GPT-4 Vision: ~$100-200- Anthropic Claude 3: ~$80-150- Google Gemini: ~$50-100
Local processing:- Electricity: ~$5-10/month- Hardware amortization: ~$20-30/month- Total: ~$25-40/month for unlimited processingFor processing hundreds of documents monthly, local LLMs become cost-effective quickly.
Summary
In this post, I explored whether local LLMs can reliably process financial documents. The answer is yes—with proper validation pipelines.
The key components:
- Vision-language models like Qwen VL for image understanding
- Docling for reliable text extraction
- llama.cpp for running local LLMs efficiently
- Multi-stage validation to catch hallucinations
- Paperless-ngx for document management integration
The Reddit user’s production experience proves this isn’t just theoretical—it’s a working system handling real financial documents daily.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 Reddit: Local LLMs for Financial Document Processing
- 👨💻 Docling Document Processing
- 👨💻 Qwen VL Model
- 👨💻 llama.cpp
- 👨💻 Paperless-ngx
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments