How to Use Claude Haiku for Text Classification and Structured Extraction
Problem
When I tried to build a text classification pipeline for invoice processing, I ran into a familiar problem: traditional NLP models require extensive training data and struggle with inconsistent formatting.
My invoices came from different vendors, each with their own layout. Rule-based extraction broke when facing new formats. I needed a solution that:
- Understands context without training
- Handles messy, real-world data
- Scales cost-effectively
- Produces consistent structured output
Environment
- Claude Haiku 3.5 via Anthropic API
- Python 3.11
- Pydantic for data validation
- Async processing for batch operations
What Happened?
I was processing invoices from multiple vendors. Each had different layouts, terminology, and formatting quirks. Traditional regex-based extraction was brittle and required constant maintenance.
A Reddit discussion about Claude Haiku’s capabilities caught my attention:
“Structured extraction — pulling specific fields from messy text (invoices, emails, forms). Haiku handles this very well”
Another user shared:
“context-aware Text classification. like a regular text classification model but 100x smarter. Find patterns in text, group text into parts, extract sentences that discuss a certain topic. wire it all into automated pipelines and run 100x haiku instance at once”
This sounded like exactly what I needed.
How to Solve It?
I built a structured extraction pipeline using Haiku. Here’s how it works.
Step 1: Define Your Extraction Schema
First, I defined the structure I wanted to extract:
from pydantic import BaseModelfrom typing import Optional
class LineItem(BaseModel): description: str amount: float
class InvoiceData(BaseModel): vendor_name: str invoice_number: str date: str total_amount: float line_items: list[LineItem] due_date: Optional[str] = NoneStep 2: Create the Extraction Function
Then I built the extraction logic:
from anthropic import Anthropicimport json
client = Anthropic()
def extract_invoice_data(invoice_text: str) -> dict: """ Extract structured data from messy invoice text. """ response = client.messages.create( model="claude-haiku-3-5", max_tokens=1024, messages=[{ "role": "user", "content": f"""Extract invoice data from this text and return as JSON.
Match this schema:{{ "vendor_name": "string", "invoice_number": "string", "date": "YYYY-MM-DD", "total_amount": float, "line_items": [{{"description": "string", "amount": float}}], "due_date": "YYYY-MM-DD or null"}}
Invoice text:{invoice_text}""" }] )
return json.loads(response.content[0].text)Step 3: Test with Real Messy Data
I tested it with a messy real-world invoice:
invoice = """ACME CorpInvoice # INV-2024-0892
Billed to: Customer X
Services rendered:- Consulting services $2,500.00- Software license $800.00- Support package $450.00
Date: March 15, 2024Payment due: April 14, 2024Total Due: $3,750.00"""
result = extract_invoice_data(invoice)print(json.dumps(result, indent=2))Output:
{ "vendor_name": "ACME Corp", "invoice_number": "INV-2024-0892", "date": "2024-03-15", "total_amount": 3750.00, "line_items": [ {"description": "Consulting services", "amount": 2500.00}, {"description": "Software license", "amount": 800.00}, {"description": "Support package", "amount": 450.00} ], "due_date": "2024-04-14"}It worked. Haiku extracted clean structured data from messy text.
Step 4: Add Validation
I added Pydantic validation to ensure data quality:
def extract_and_validate(invoice_text: str) -> InvoiceData: """Extract with validation.""" data = extract_invoice_data(invoice_text) return InvoiceData(**data)
# This will raise ValidationError if extraction is wrongvalidated = extract_and_validate(invoice)Batch Processing at Scale
The real power comes from parallel processing. I can run 100+ Haiku instances simultaneously:
import asynciofrom anthropic import AsyncAnthropic
client = AsyncAnthropic()
async def classify_one(text: str, categories: list[str]) -> dict: response = await client.messages.create( model="claude-haiku-3-5", max_tokens=256, messages=[{ "role": "user", "content": f"Classify into one category from {categories}. Return JSON with 'category' and 'confidence'. Text: {text}" }] ) return {"text": text[:50], "result": response.content[0].text}
async def classify_batch(texts: list[str], categories: list[str]) -> list[dict]: """Process multiple texts in parallel with rate limiting.""" semaphore = asyncio.Semaphore(10)
async def limited_classify(text): async with semaphore: await asyncio.sleep(0.1) # Rate limiting buffer return await classify_one(text, categories)
return await asyncio.gather(*[limited_classify(t) for t in texts])
# Usage: Process 100 documentstexts = ["document 1...", "document 2...", ...] # 100 textsresults = asyncio.run(classify_batch(texts, ["finance", "tech", "legal"]))Text Classification Use Cases
Beyond invoice extraction, Haiku excels at several classification tasks:
Sentiment Analysis with Quote Extraction
def analyze_sentiment(text: str) -> dict: """Analyze sentiment and extract key quotes.""" response = client.messages.create( model="claude-haiku-3-5", max_tokens=512, messages=[{ "role": "user", "content": f"""Analyze the sentiment and extract key quotes.
Return JSON with:- overall_sentiment: "positive" | "negative" | "neutral"- confidence: 0-1- key_quotes: list of 2-3 most impactful sentences- topics: list of main topics
Text: {text}""" }] ) return json.loads(response.content[0].text)
review = """I've been using this product for three months now.The interface is incredibly intuitive - I was productive within hours.However, the sync feature sometimes lags behind.Overall, it's transformed how our team collaborates."""
result = analyze_sentiment(review)A Reddit user confirmed Haiku’s advantage here:
“sentiment analysis and key quote extraction. Costs more than 10x grok fast but the actual results are worth. Grok leaves out a lot”
Document Categorization
def categorize_document(text: str, categories: list[str]) -> dict: """Classify document into predefined categories.""" response = client.messages.create( model="claude-haiku-3-5", max_tokens=256, messages=[{ "role": "user", "content": f"""Classify the following text into exactly one of these categories: {categories}
Return JSON with:- category: the selected category- confidence: float between 0-1- reasoning: brief explanation
Text: {text}""" }] ) return json.loads(response.content[0].text)
result = categorize_document( "The new product launch exceeded Q3 projections by 15%", ["finance", "technology", "marketing", "operations"])# Output: {"category": "finance", "confidence": 0.92, ...}The Reason
I think Haiku works well for these tasks because:
-
Semantic understanding - Unlike keyword-based classifiers, Haiku understands context. It identifies topics even when synonyms or paraphrasing are used.
-
Structured output - Haiku returns clean JSON, making it ideal for automated pipelines.
-
Zero-shot capability - No training data required. Just describe what you want.
-
Cost-effective scale - Running 100 Haiku instances for parallel processing is economically viable.
Common Mistakes
I made several mistakes when building this pipeline:
1. Over-prompting
Haiku responds well to concise instructions. Excessive context degrades performance.
# WRONG: Too much context"I need you to analyze this text very carefully, considering all possible interpretations, nuances, and edge cases. Please think step by step..."
# CORRECT: Concise instruction"Classify into one category from {categories}. Return JSON with category and confidence."2. No output validation
Always validate JSON structure before downstream processing:
# Always validatetry: data = json.loads(response.content[0].text) validated = InvoiceData(**data)except (json.JSONDecodeError, ValidationError) as e: # Handle error, maybe retry or log pass3. Ignoring rate limits
When running 100+ parallel instances, implement proper rate limiting. I use semaphores and small delays.
4. Using it as a search engine
Haiku classifies and extracts; it doesn’t retrieve external information. Don’t ask it to “find information about X.”
Summary
In this post, I showed how to use Claude Haiku for text classification and structured extraction. The key point is Haiku’s ability to handle messy real-world data with zero-shot capability.
The use cases I found most valuable:
- Invoice and form extraction
- Sentiment analysis with quote extraction
- Document categorization
- API response formatting
- High-volume text processing (10k+ pages)
The cost savings compared to larger models are significant, and the accuracy is higher than fast budget alternatives. Build proper rate limiting and output validation into your pipelines, and Haiku becomes a powerful tool for production NLP workflows.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments