How to Add Contracts and Validation to LLM Tool Calls
The Problem: Hallucinated Tool Arguments
I built an AI agent that calls database tools. In testing, it worked perfectly. In production, I started seeing weird errors.
A user asked the agent to search for “10 recent orders”. The agent called the database tool with limit="ten" instead of limit=10. The query failed. No error message. Just an empty result returned confidently.
Another time, the agent passed query=123 where a string was expected. The tool crashed somewhere deep in the codebase. I spent hours debugging why the model decided to pass an integer for a search query.
The logs showed a pattern:
Input: "find 5 users named john"Expected call: search_users(query="john", limit=5)Actual call: search_users(query="john", limit="five", type="exact")
Input: "get order details for ID 12345"Expected call: get_order(order_id=12345)Actual call: get_order(id="12345", include_items=True, format="json")Same intent. Different parameters. Hallucinated arguments. Silent wrong executions.
My First Mistake: Trusting Model Output Directly
My original code passed model output directly to tools:
from langchain.tools import tool
@tooldef search_database(query: str, limit: int) -> dict: """Search the database.""" # What if limit="ten" or limit=-5? # What if query=123 instead of string? # What if model hallucinates extra parameters? return db.execute(f"SELECT * FROM items WHERE name LIKE '%{query}%' LIMIT {limit}")
# Model output goes directly to executionresult = agent.run("find 5 items matching python")# No validation. No contract. Just hope.The model generates tool calls based on its training. But models do not guarantee:
- Correct types (string vs integer)
- Valid ranges (limit between 1 and 100)
- Required fields present
- No extra hallucinated fields
I needed contracts. Typed, validated inputs before anything executes.
The Solution: Input Contracts with Pydantic
I started by defining exactly what each tool accepts:
from pydantic import BaseModel, Field, field_validatorfrom typing import Literal
class DatabaseSearchInput(BaseModel): """Contract for database search tool - strict typing with constraints"""
query: str = Field(..., min_length=1, max_length=500) limit: int = Field(..., ge=1, le=100) # Between 1 and 100 search_type: Literal["exact", "fuzzy", "semantic"] = "fuzzy"
@field_validator("query") @classmethod def validate_query(cls, v: str) -> str: # Prevent SQL injection patterns if ";" in v or "--" in v: raise ValueError("Invalid query characters detected") return v.strip()
# Now I can validate model output before executiontry: validated = DatabaseSearchInput( query="python", limit=5, search_type="fuzzy" ) # All validations pass, safe to executeexcept ValidationError as e: # Contract violation caught before any damage print(f"Invalid parameters: {e}")If the model passes limit="ten", validation fails immediately. No silent wrong execution. No corrupted database queries.
Adding Output Contracts
Input validation was not enough. I also needed to validate what tools return:
from pydantic import BaseModelfrom typing import Literal
class DatabaseSearchOutput(BaseModel): """Contract for what the tool returns"""
results: list[dict] total_count: int query_time_ms: float status: Literal["success", "partial", "empty"]
# Tool returns validated output@tool(args_schema=DatabaseSearchInput)def search_database( query: str, limit: int, search_type: str = "fuzzy") -> DatabaseSearchOutput: """Search the database with validated inputs."""
result = db.search(query=query, limit=limit, mode=search_type)
# Output is validated on construction return DatabaseSearchOutput( results=result.items, total_count=result.total, query_time_ms=result.duration_ms, status="success" if result.items else "empty" )Now both inputs and outputs have contracts. If the tool returns malformed data, the output contract catches it.
Wrapping Tools with Validation
I created a decorator to add contracts to any tool:
from functools import wrapsfrom pydantic import BaseModel, ValidationErrorimport json
def validated_tool(input_model: type[BaseModel], output_model: type[BaseModel]): """Decorator that adds contract validation to any tool."""
def decorator(func): @wraps(func) def wrapper(**kwargs): # 1. Validate inputs against contract try: validated_input = input_model(**kwargs) except ValidationError as e: # Surface as data, not exception return { "error": "input_validation_failed", "details": json.loads(e.json()), "rejected_args": kwargs }
# 2. Execute with validated inputs try: result = func(**validated_input.model_dump()) except Exception as e: return { "error": "execution_failed", "details": str(e), "validated_input": validated_input.model_dump() }
# 3. Validate outputs against contract try: validated_output = output_model(**result) return validated_output.model_dump() except ValidationError as e: return { "error": "output_validation_failed", "details": json.loads(e.json()), "raw_output": result }
return wrapper return decorator
# Usage@validated_tool( input_model=DatabaseSearchInput, output_model=DatabaseSearchOutput)def search_database(query: str, limit: int, search_type: str = "fuzzy") -> dict: # Function only receives validated data result = db.search(query=query, limit=limit, mode=search_type) return { "results": result.items, "total_count": result.total, "query_time_ms": result.duration_ms, "status": "success" if result.items else "empty" }The wrapper creates a validation boundary. Invalid inputs never reach the tool. Invalid outputs never leave the agent.
JSON Schema for OpenAI Function Calling
For OpenAI’s function calling API, I define contracts as JSON Schema:
from openai import OpenAIimport jsonfrom pydantic import ValidationError
client = OpenAI()
# Define contract as JSON Schemasearch_tool_schema = { "type": "object", "properties": { "query": { "type": "string", "minLength": 1, "maxLength": 500, "description": "Search query string" }, "limit": { "type": "integer", "minimum": 1, "maximum": 100, "description": "Maximum number of results" }, "search_type": { "type": "string", "enum": ["exact", "fuzzy", "semantic"], "default": "fuzzy", "description": "Type of search to perform" } }, "required": ["query", "limit"], "additionalProperties": False # Reject unknown fields}
tools = [{ "type": "function", "function": { "name": "search_database", "description": "Search the database with validated inputs", "parameters": search_tool_schema, "strict": True # OpenAI strict mode for exact schema adherence }}]
response = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": "Find 5 items matching 'python'"}], tools=tools)
# Validate the tool call matches our contractif response.choices[0].message.tool_calls: tool_call = response.choices[0].message.tool_calls[0] args = json.loads(tool_call.function.arguments)
# Re-validate with Pydantic for extra safety try: validated = DatabaseSearchInput(**args) result = search_database(**validated.model_dump()) except ValidationError as e: print(f"Contract violation: {e}") # Handle gracefully, possibly ask model to retryThe strict: True flag tells OpenAI to follow the schema exactly. I still re-validate with Pydantic because I do not trust external APIs completely.
Observability: Tracking Validation Failures
Validation failures are data, not exceptions. I track them as metrics:
from dataclasses import dataclass, fieldfrom datetime import datetimefrom typing import Any
@dataclassclass ValidationMetrics: input_validations_passed: int = 0 input_validations_failed: int = 0 output_validations_passed: int = 0 output_validations_failed: int = 0 failures: list[dict] = field(default_factory=list)
def record_input_validation( self, passed: bool, tool_name: str, args: dict, error: str | None = None ): if passed: self.input_validations_passed += 1 else: self.input_validations_failed += 1 self.failures.append({ "type": "input_validation", "tool": tool_name, "args": args, "error": error, "timestamp": datetime.utcnow().isoformat() })
def summary(self) -> dict: total_input = self.input_validations_passed + self.input_validations_failed total_output = self.output_validations_passed + self.output_validations_failed return { "input_pass_rate": self.input_validations_passed / max(total_input, 1), "output_pass_rate": self.output_validations_passed / max(total_output, 1), "total_failures": len(self.failures), "recent_failures": self.failures[-5:] }
# Usage in agent loopmetrics = ValidationMetrics()
def safe_tool_execution( tool_name: str, tool_func: callable, args: dict, input_model: type[BaseModel]) -> dict: # Validate input try: validated_input = input_model(**args) metrics.record_input_validation(True, tool_name, args) except ValidationError as e: metrics.record_input_validation(False, tool_name, args, str(e)) return {"error": "validation_failed", "details": json.loads(e.json())}
# Execute result = tool_func(**validated_input.model_dump()) return resultNow I can see validation pass rates in production. When pass rates drop, I know the model is having trouble with certain tools.
What Changed in Production
After adding contracts:
Before:Input: "find 5 users named john"Call: search_users(query="john", limit="five") -> crash
After:Input: "find 5 users named john"Call: search_users(query="john", limit="five") -> validation errorRetry: model corrects to limit=5 -> successThe validation layer catches hallucinations before they reach my tools. Errors surface as data, not silent wrong executions.
Common Mistakes I Made
Trusting model output blindly: I passed model outputs directly to tools without validation. This let hallucinated parameters through to production.
Vague tool definitions: I used loose typing and optional fields everywhere. Models exploited the vagueness to pass wrong parameters.
Validation inside tools: I put validation logic inside tool implementations instead of at boundaries. This made validation inconsistent across tools.
No output validation: I checked inputs but let malformed outputs propagate to downstream code.
Silent fallbacks: I returned None or empty values on validation failure instead of explicit errors. This hid problems instead of surfacing them.
Summary
Contracts transform LLM tool calls from fragile operations into robust, debuggable components. Define strict Pydantic models or JSON Schemas for every tool’s inputs and outputs. Validate at boundaries. When validation fails, surface errors as structured data.
The key insight: if parameters do not match the schema, the call does not happen. No hallucinated arguments. No silent wrong executions. Every output gets checked structurally and logically before it leaves the agent.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 Reddit discussion on LLM agent production failures
- 👨💻 Pydantic Documentation
- 👨💻 OpenAI Function Calling Guide
- 👨💻 JSON Schema Specification
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments