Skip to content

How to Add Contracts and Validation to LLM Tool Calls

The Problem: Hallucinated Tool Arguments

I built an AI agent that calls database tools. In testing, it worked perfectly. In production, I started seeing weird errors.

A user asked the agent to search for “10 recent orders”. The agent called the database tool with limit="ten" instead of limit=10. The query failed. No error message. Just an empty result returned confidently.

Another time, the agent passed query=123 where a string was expected. The tool crashed somewhere deep in the codebase. I spent hours debugging why the model decided to pass an integer for a search query.

The logs showed a pattern:

Input: "find 5 users named john"
Expected call: search_users(query="john", limit=5)
Actual call: search_users(query="john", limit="five", type="exact")
Input: "get order details for ID 12345"
Expected call: get_order(order_id=12345)
Actual call: get_order(id="12345", include_items=True, format="json")

Same intent. Different parameters. Hallucinated arguments. Silent wrong executions.

My First Mistake: Trusting Model Output Directly

My original code passed model output directly to tools:

unsafe_tool_call.py
from langchain.tools import tool
@tool
def search_database(query: str, limit: int) -> dict:
"""Search the database."""
# What if limit="ten" or limit=-5?
# What if query=123 instead of string?
# What if model hallucinates extra parameters?
return db.execute(f"SELECT * FROM items WHERE name LIKE '%{query}%' LIMIT {limit}")
# Model output goes directly to execution
result = agent.run("find 5 items matching python")
# No validation. No contract. Just hope.

The model generates tool calls based on its training. But models do not guarantee:

  • Correct types (string vs integer)
  • Valid ranges (limit between 1 and 100)
  • Required fields present
  • No extra hallucinated fields

I needed contracts. Typed, validated inputs before anything executes.

The Solution: Input Contracts with Pydantic

I started by defining exactly what each tool accepts:

input_contracts.py
from pydantic import BaseModel, Field, field_validator
from typing import Literal
class DatabaseSearchInput(BaseModel):
"""Contract for database search tool - strict typing with constraints"""
query: str = Field(..., min_length=1, max_length=500)
limit: int = Field(..., ge=1, le=100) # Between 1 and 100
search_type: Literal["exact", "fuzzy", "semantic"] = "fuzzy"
@field_validator("query")
@classmethod
def validate_query(cls, v: str) -> str:
# Prevent SQL injection patterns
if ";" in v or "--" in v:
raise ValueError("Invalid query characters detected")
return v.strip()
# Now I can validate model output before execution
try:
validated = DatabaseSearchInput(
query="python",
limit=5,
search_type="fuzzy"
)
# All validations pass, safe to execute
except ValidationError as e:
# Contract violation caught before any damage
print(f"Invalid parameters: {e}")

If the model passes limit="ten", validation fails immediately. No silent wrong execution. No corrupted database queries.

Adding Output Contracts

Input validation was not enough. I also needed to validate what tools return:

output_contracts.py
from pydantic import BaseModel
from typing import Literal
class DatabaseSearchOutput(BaseModel):
"""Contract for what the tool returns"""
results: list[dict]
total_count: int
query_time_ms: float
status: Literal["success", "partial", "empty"]
# Tool returns validated output
@tool(args_schema=DatabaseSearchInput)
def search_database(
query: str,
limit: int,
search_type: str = "fuzzy"
) -> DatabaseSearchOutput:
"""Search the database with validated inputs."""
result = db.search(query=query, limit=limit, mode=search_type)
# Output is validated on construction
return DatabaseSearchOutput(
results=result.items,
total_count=result.total,
query_time_ms=result.duration_ms,
status="success" if result.items else "empty"
)

Now both inputs and outputs have contracts. If the tool returns malformed data, the output contract catches it.

Wrapping Tools with Validation

I created a decorator to add contracts to any tool:

validated_tool.py
from functools import wraps
from pydantic import BaseModel, ValidationError
import json
def validated_tool(input_model: type[BaseModel], output_model: type[BaseModel]):
"""Decorator that adds contract validation to any tool."""
def decorator(func):
@wraps(func)
def wrapper(**kwargs):
# 1. Validate inputs against contract
try:
validated_input = input_model(**kwargs)
except ValidationError as e:
# Surface as data, not exception
return {
"error": "input_validation_failed",
"details": json.loads(e.json()),
"rejected_args": kwargs
}
# 2. Execute with validated inputs
try:
result = func(**validated_input.model_dump())
except Exception as e:
return {
"error": "execution_failed",
"details": str(e),
"validated_input": validated_input.model_dump()
}
# 3. Validate outputs against contract
try:
validated_output = output_model(**result)
return validated_output.model_dump()
except ValidationError as e:
return {
"error": "output_validation_failed",
"details": json.loads(e.json()),
"raw_output": result
}
return wrapper
return decorator
# Usage
@validated_tool(
input_model=DatabaseSearchInput,
output_model=DatabaseSearchOutput
)
def search_database(query: str, limit: int, search_type: str = "fuzzy") -> dict:
# Function only receives validated data
result = db.search(query=query, limit=limit, mode=search_type)
return {
"results": result.items,
"total_count": result.total,
"query_time_ms": result.duration_ms,
"status": "success" if result.items else "empty"
}

The wrapper creates a validation boundary. Invalid inputs never reach the tool. Invalid outputs never leave the agent.

JSON Schema for OpenAI Function Calling

For OpenAI’s function calling API, I define contracts as JSON Schema:

openai_contracts.py
from openai import OpenAI
import json
from pydantic import ValidationError
client = OpenAI()
# Define contract as JSON Schema
search_tool_schema = {
"type": "object",
"properties": {
"query": {
"type": "string",
"minLength": 1,
"maxLength": 500,
"description": "Search query string"
},
"limit": {
"type": "integer",
"minimum": 1,
"maximum": 100,
"description": "Maximum number of results"
},
"search_type": {
"type": "string",
"enum": ["exact", "fuzzy", "semantic"],
"default": "fuzzy",
"description": "Type of search to perform"
}
},
"required": ["query", "limit"],
"additionalProperties": False # Reject unknown fields
}
tools = [{
"type": "function",
"function": {
"name": "search_database",
"description": "Search the database with validated inputs",
"parameters": search_tool_schema,
"strict": True # OpenAI strict mode for exact schema adherence
}
}]
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Find 5 items matching 'python'"}],
tools=tools
)
# Validate the tool call matches our contract
if response.choices[0].message.tool_calls:
tool_call = response.choices[0].message.tool_calls[0]
args = json.loads(tool_call.function.arguments)
# Re-validate with Pydantic for extra safety
try:
validated = DatabaseSearchInput(**args)
result = search_database(**validated.model_dump())
except ValidationError as e:
print(f"Contract violation: {e}")
# Handle gracefully, possibly ask model to retry

The strict: True flag tells OpenAI to follow the schema exactly. I still re-validate with Pydantic because I do not trust external APIs completely.

Observability: Tracking Validation Failures

Validation failures are data, not exceptions. I track them as metrics:

validation_metrics.py
from dataclasses import dataclass, field
from datetime import datetime
from typing import Any
@dataclass
class ValidationMetrics:
input_validations_passed: int = 0
input_validations_failed: int = 0
output_validations_passed: int = 0
output_validations_failed: int = 0
failures: list[dict] = field(default_factory=list)
def record_input_validation(
self,
passed: bool,
tool_name: str,
args: dict,
error: str | None = None
):
if passed:
self.input_validations_passed += 1
else:
self.input_validations_failed += 1
self.failures.append({
"type": "input_validation",
"tool": tool_name,
"args": args,
"error": error,
"timestamp": datetime.utcnow().isoformat()
})
def summary(self) -> dict:
total_input = self.input_validations_passed + self.input_validations_failed
total_output = self.output_validations_passed + self.output_validations_failed
return {
"input_pass_rate": self.input_validations_passed / max(total_input, 1),
"output_pass_rate": self.output_validations_passed / max(total_output, 1),
"total_failures": len(self.failures),
"recent_failures": self.failures[-5:]
}
# Usage in agent loop
metrics = ValidationMetrics()
def safe_tool_execution(
tool_name: str,
tool_func: callable,
args: dict,
input_model: type[BaseModel]
) -> dict:
# Validate input
try:
validated_input = input_model(**args)
metrics.record_input_validation(True, tool_name, args)
except ValidationError as e:
metrics.record_input_validation(False, tool_name, args, str(e))
return {"error": "validation_failed", "details": json.loads(e.json())}
# Execute
result = tool_func(**validated_input.model_dump())
return result

Now I can see validation pass rates in production. When pass rates drop, I know the model is having trouble with certain tools.

What Changed in Production

After adding contracts:

Before:
Input: "find 5 users named john"
Call: search_users(query="john", limit="five") -> crash
After:
Input: "find 5 users named john"
Call: search_users(query="john", limit="five") -> validation error
Retry: model corrects to limit=5 -> success

The validation layer catches hallucinations before they reach my tools. Errors surface as data, not silent wrong executions.

Common Mistakes I Made

Trusting model output blindly: I passed model outputs directly to tools without validation. This let hallucinated parameters through to production.

Vague tool definitions: I used loose typing and optional fields everywhere. Models exploited the vagueness to pass wrong parameters.

Validation inside tools: I put validation logic inside tool implementations instead of at boundaries. This made validation inconsistent across tools.

No output validation: I checked inputs but let malformed outputs propagate to downstream code.

Silent fallbacks: I returned None or empty values on validation failure instead of explicit errors. This hid problems instead of surfacing them.

Summary

Contracts transform LLM tool calls from fragile operations into robust, debuggable components. Define strict Pydantic models or JSON Schemas for every tool’s inputs and outputs. Validate at boundaries. When validation fails, surface errors as structured data.

The key insight: if parameters do not match the schema, the call does not happen. No hallucinated arguments. No silent wrong executions. Every output gets checked structurally and logically before it leaves the agent.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments