Skip to content

How to Use Claude Haiku for Text Classification and Structured Extraction

Problem

When I tried to build a text classification pipeline for invoice processing, I ran into a familiar problem: traditional NLP models require extensive training data and struggle with inconsistent formatting.

My invoices came from different vendors, each with their own layout. Rule-based extraction broke when facing new formats. I needed a solution that:

  • Understands context without training
  • Handles messy, real-world data
  • Scales cost-effectively
  • Produces consistent structured output

Environment

  • Claude Haiku 3.5 via Anthropic API
  • Python 3.11
  • Pydantic for data validation
  • Async processing for batch operations

What Happened?

I was processing invoices from multiple vendors. Each had different layouts, terminology, and formatting quirks. Traditional regex-based extraction was brittle and required constant maintenance.

A Reddit discussion about Claude Haiku’s capabilities caught my attention:

“Structured extraction — pulling specific fields from messy text (invoices, emails, forms). Haiku handles this very well”

Another user shared:

“context-aware Text classification. like a regular text classification model but 100x smarter. Find patterns in text, group text into parts, extract sentences that discuss a certain topic. wire it all into automated pipelines and run 100x haiku instance at once”

This sounded like exactly what I needed.

How to Solve It?

I built a structured extraction pipeline using Haiku. Here’s how it works.

Step 1: Define Your Extraction Schema

First, I defined the structure I wanted to extract:

schemas.py
from pydantic import BaseModel
from typing import Optional
class LineItem(BaseModel):
description: str
amount: float
class InvoiceData(BaseModel):
vendor_name: str
invoice_number: str
date: str
total_amount: float
line_items: list[LineItem]
due_date: Optional[str] = None

Step 2: Create the Extraction Function

Then I built the extraction logic:

extractor.py
from anthropic import Anthropic
import json
client = Anthropic()
def extract_invoice_data(invoice_text: str) -> dict:
"""
Extract structured data from messy invoice text.
"""
response = client.messages.create(
model="claude-haiku-3-5",
max_tokens=1024,
messages=[{
"role": "user",
"content": f"""Extract invoice data from this text and return as JSON.
Match this schema:
{{
"vendor_name": "string",
"invoice_number": "string",
"date": "YYYY-MM-DD",
"total_amount": float,
"line_items": [{{"description": "string", "amount": float}}],
"due_date": "YYYY-MM-DD or null"
}}
Invoice text:
{invoice_text}"""
}]
)
return json.loads(response.content[0].text)

Step 3: Test with Real Messy Data

I tested it with a messy real-world invoice:

test_extraction.py
invoice = """
ACME Corp
Invoice # INV-2024-0892
Billed to: Customer X
Services rendered:
- Consulting services $2,500.00
- Software license $800.00
- Support package $450.00
Date: March 15, 2024
Payment due: April 14, 2024
Total Due: $3,750.00
"""
result = extract_invoice_data(invoice)
print(json.dumps(result, indent=2))

Output:

Output
{
"vendor_name": "ACME Corp",
"invoice_number": "INV-2024-0892",
"date": "2024-03-15",
"total_amount": 3750.00,
"line_items": [
{"description": "Consulting services", "amount": 2500.00},
{"description": "Software license", "amount": 800.00},
{"description": "Support package", "amount": 450.00}
],
"due_date": "2024-04-14"
}

It worked. Haiku extracted clean structured data from messy text.

Step 4: Add Validation

I added Pydantic validation to ensure data quality:

validated_extractor.py
def extract_and_validate(invoice_text: str) -> InvoiceData:
"""Extract with validation."""
data = extract_invoice_data(invoice_text)
return InvoiceData(**data)
# This will raise ValidationError if extraction is wrong
validated = extract_and_validate(invoice)

Batch Processing at Scale

The real power comes from parallel processing. I can run 100+ Haiku instances simultaneously:

batch_processor.py
import asyncio
from anthropic import AsyncAnthropic
client = AsyncAnthropic()
async def classify_one(text: str, categories: list[str]) -> dict:
response = await client.messages.create(
model="claude-haiku-3-5",
max_tokens=256,
messages=[{
"role": "user",
"content": f"Classify into one category from {categories}. Return JSON with 'category' and 'confidence'. Text: {text}"
}]
)
return {"text": text[:50], "result": response.content[0].text}
async def classify_batch(texts: list[str], categories: list[str]) -> list[dict]:
"""Process multiple texts in parallel with rate limiting."""
semaphore = asyncio.Semaphore(10)
async def limited_classify(text):
async with semaphore:
await asyncio.sleep(0.1) # Rate limiting buffer
return await classify_one(text, categories)
return await asyncio.gather(*[limited_classify(t) for t in texts])
# Usage: Process 100 documents
texts = ["document 1...", "document 2...", ...] # 100 texts
results = asyncio.run(classify_batch(texts, ["finance", "tech", "legal"]))

Text Classification Use Cases

Beyond invoice extraction, Haiku excels at several classification tasks:

Sentiment Analysis with Quote Extraction

sentiment.py
def analyze_sentiment(text: str) -> dict:
"""Analyze sentiment and extract key quotes."""
response = client.messages.create(
model="claude-haiku-3-5",
max_tokens=512,
messages=[{
"role": "user",
"content": f"""Analyze the sentiment and extract key quotes.
Return JSON with:
- overall_sentiment: "positive" | "negative" | "neutral"
- confidence: 0-1
- key_quotes: list of 2-3 most impactful sentences
- topics: list of main topics
Text: {text}"""
}]
)
return json.loads(response.content[0].text)
review = """
I've been using this product for three months now.
The interface is incredibly intuitive - I was productive within hours.
However, the sync feature sometimes lags behind.
Overall, it's transformed how our team collaborates.
"""
result = analyze_sentiment(review)

A Reddit user confirmed Haiku’s advantage here:

“sentiment analysis and key quote extraction. Costs more than 10x grok fast but the actual results are worth. Grok leaves out a lot”

Document Categorization

categorizer.py
def categorize_document(text: str, categories: list[str]) -> dict:
"""Classify document into predefined categories."""
response = client.messages.create(
model="claude-haiku-3-5",
max_tokens=256,
messages=[{
"role": "user",
"content": f"""Classify the following text into exactly one of these categories: {categories}
Return JSON with:
- category: the selected category
- confidence: float between 0-1
- reasoning: brief explanation
Text: {text}"""
}]
)
return json.loads(response.content[0].text)
result = categorize_document(
"The new product launch exceeded Q3 projections by 15%",
["finance", "technology", "marketing", "operations"]
)
# Output: {"category": "finance", "confidence": 0.92, ...}

The Reason

I think Haiku works well for these tasks because:

  1. Semantic understanding - Unlike keyword-based classifiers, Haiku understands context. It identifies topics even when synonyms or paraphrasing are used.

  2. Structured output - Haiku returns clean JSON, making it ideal for automated pipelines.

  3. Zero-shot capability - No training data required. Just describe what you want.

  4. Cost-effective scale - Running 100 Haiku instances for parallel processing is economically viable.

Common Mistakes

I made several mistakes when building this pipeline:

1. Over-prompting

Haiku responds well to concise instructions. Excessive context degrades performance.

# WRONG: Too much context
"I need you to analyze this text very carefully, considering all possible interpretations, nuances, and edge cases. Please think step by step..."
# CORRECT: Concise instruction
"Classify into one category from {categories}. Return JSON with category and confidence."

2. No output validation

Always validate JSON structure before downstream processing:

# Always validate
try:
data = json.loads(response.content[0].text)
validated = InvoiceData(**data)
except (json.JSONDecodeError, ValidationError) as e:
# Handle error, maybe retry or log
pass

3. Ignoring rate limits

When running 100+ parallel instances, implement proper rate limiting. I use semaphores and small delays.

4. Using it as a search engine

Haiku classifies and extracts; it doesn’t retrieve external information. Don’t ask it to “find information about X.”

Summary

In this post, I showed how to use Claude Haiku for text classification and structured extraction. The key point is Haiku’s ability to handle messy real-world data with zero-shot capability.

The use cases I found most valuable:

  • Invoice and form extraction
  • Sentiment analysis with quote extraction
  • Document categorization
  • API response formatting
  • High-volume text processing (10k+ pages)

The cost savings compared to larger models are significant, and the accuracy is higher than fast budget alternatives. Build proper rate limiting and output validation into your pipelines, and Haiku becomes a powerful tool for production NLP workflows.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments