Skip to content

How to Implement Tool Calling with Confidence Levels in Spring AI

The Problem

My AI agent was calling tools, but I couldn’t tell how certain the model was about its choices. The same log entry appeared whether the LLM was highly confident or just guessing:

application.log
2026-03-26 09:15:22 INFO Tool called: retrievePatientHealthStatus
2026-03-26 09:15:22 INFO Arguments: {patientId=PAT-001}

When the agent made a wrong decision, I had no signal to detect it. I needed to know: was this a confident choice or an uncertain guess?

The Solution

Spring AI’s Tool Argument Augmenter lets you add a confidence field to every tool call. The LLM evaluates its own certainty and reports “low”, “medium”, or “high” for each tool selection.

Adding Confidence Scoring

Step 1: Create the Thinking DTO

Define a record with a required confidence field:

AgentThinking.java
import org.springframework.ai.tool.annotation.ToolParam;
public record AgentThinking(
@ToolParam(description = """
Your step-by-step reasoning for why you're calling this tool.
""", required = true)
String innerThought,
@ToolParam(description = "Confidence level (low, medium, high) in this tool choice", required = true)
String confidence
) {}

The required = true is critical. Without it, the LLM might skip the confidence field inconsistently.

Step 2: Configure the Augmented Provider

Wire up the augmenter with confidence-aware handling:

ToolConfig.java
import org.springframework.ai.tool.augment.AugmentedToolCallbackProvider;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
@Configuration
public class ToolConfig {
private static final Logger log = LoggerFactory.getLogger(ToolConfig.class);
private final AlertService alertService;
private final MetricsService metricsService;
public ToolConfig(AlertService alertService, MetricsService metricsService) {
this.alertService = alertService;
this.metricsService = metricsService;
}
@Bean
public AugmentedToolCallbackProvider<AgentThinking> augmentedToolProvider(
HealthTools healthTools) {
return AugmentedToolCallbackProvider
.<AgentThinking>builder()
.toolObject(healthTools)
.argumentType(AgentThinking.class)
.argumentConsumer(event -> {
AgentThinking thinking = event.arguments();
log.info("Tool: {}", event.toolDefinition().name());
log.info("Reasoning: {}", thinking.innerThought());
log.info("Confidence: {}", thinking.confidence());
// Track confidence metrics
metricsService.recordConfidence(
event.toolDefinition().name(),
thinking.confidence()
);
// Alert on low confidence
if ("low".equals(thinking.confidence())) {
log.warn("Low confidence tool selection detected");
alertService.notifyLowConfidence(event, thinking);
}
})
.build();
}
}

Step 3: Conditional Execution Based on Confidence

The real power comes from acting on confidence levels. Here’s a service that skips low-confidence calls:

ConfidenceAwareAgentService.java
import org.springframework.ai.chat.client.ChatClient;
import org.springframework.stereotype.Service;
@Service
public class ConfidenceAwareAgentService {
private final ChatClient chatClient;
private final Map<String, String> pendingLowConfidenceCalls = new ConcurrentHashMap<>();
public ConfidenceAwareAgentService(ChatClient.Builder builder,
AugmentedToolCallbackProvider<AgentThinking> provider) {
this.chatClient = builder
.defaultToolCallbacks(provider)
.build();
}
public AgentResponse process(String userInput) {
String response = chatClient.prompt()
.user(userInput)
.call()
.content();
return new AgentResponse(response, pendingLowConfidenceCalls.isEmpty());
}
public Map<String, String> getPendingReviews() {
return Map.copyOf(pendingLowConfidenceCalls);
}
public void approveLowConfidenceCall(String callId) {
pendingLowConfidenceCalls.remove(callId);
// Execute the approved call
}
}

And here’s an alert service for human-in-the-loop:

AlertService.java
import org.springframework.stereotype.Service;
import java.util.Map;
import java.util.concurrent.ConcurrentHashMap;
@Service
public class AlertService {
private final NotificationService notificationService;
private final Map<String, LowConfidenceEvent> pendingReviews = new ConcurrentHashMap<>();
public AlertService(NotificationService notificationService) {
this.notificationService = notificationService;
}
public void notifyLowConfidence(ToolCallEvent event, AgentThinking thinking) {
String callId = UUID.randomUUID().toString();
LowConfidenceEvent lowConfEvent = new LowConfidenceEvent(
callId,
event.toolDefinition().name(),
thinking.innerThought(),
Instant.now()
);
pendingReviews.put(callId, lowConfEvent);
// Alert ops team
notificationService.sendAlert("""
Low Confidence Tool Call Detected
Tool: %s
Reasoning: %s
Call ID: %s
Time: %s
Review and approve or reject.
""".formatted(
lowConfEvent.toolName(),
lowConfEvent.reasoning(),
callId,
lowConfEvent.timestamp()
));
}
public Map<String, LowConfidenceEvent> getPendingReviews() {
return Map.copyOf(pendingReviews);
}
}

The Result

Now when I run my agent, I see confidence in the logs:

application.log
2026-03-26 09:20:15 INFO Tool: retrievePatientHealthStatus
2026-03-26 09:20:15 INFO Reasoning: The user asked about patient PAT-001's health. I need to retrieve their current status to provide accurate information.
2026-03-26 09:20:15 INFO Confidence: high

When confidence is low, my alert service fires:

alert.log
2026-03-26 09:22:31 WARN Low confidence tool selection detected
2026-03-26 09:22:31 INFO Alert sent to ops team for review

Why This Matters

Confidence levels enable production-grade AI systems:

  1. Conditional Execution: Skip or queue low-confidence calls for review
  2. Human-in-the-Loop: Route uncertain decisions to operators
  3. Quality Metrics: Track confidence distribution over time
  4. Fallback Mechanisms: Use alternative approaches when confidence drops

Common Mistake: Making Confidence Optional

I initially made the confidence field optional:

WrongApproach.java
// DON'T DO THIS
@ToolParam(description = "Confidence level (optional)")
String confidence // May be null, inconsistent data

This broke my monitoring. The LLM sometimes skipped the field, giving me inconsistent data. Always mark it required = true:

CorrectApproach.java
// ALWAYS DO THIS
@ToolParam(description = "Confidence level (low, medium, high)", required = true)
String confidence // Always populated, consistent monitoring

Tracking Confidence Over Time

Here’s a simple metrics service to track confidence patterns:

MetricsService.java
import org.springframework.stereotype.Service;
import java.util.Map;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.atomic.AtomicLong;
@Service
public class MetricsService {
private final Map<String, AtomicLong> highConfidenceCount = new ConcurrentHashMap<>();
private final Map<String, AtomicLong> mediumConfidenceCount = new ConcurrentHashMap<>();
private final Map<String, AtomicLong> lowConfidenceCount = new ConcurrentHashMap<>();
public void recordConfidence(String toolName, String confidence) {
switch (confidence.toLowerCase()) {
case "high" -> highConfidenceCount
.computeIfAbsent(toolName, k -> new AtomicLong())
.incrementAndGet();
case "medium" -> mediumConfidenceCount
.computeIfAbsent(toolName, k -> new AtomicLong())
.incrementAndGet();
case "low" -> lowConfidenceCount
.computeIfAbsent(toolName, k -> new AtomicLong())
.incrementAndGet();
}
}
public ConfidenceReport getReport() {
return new ConfidenceReport(
Map.copyOf(highConfidenceCount),
Map.copyOf(mediumConfidenceCount),
Map.copyOf(lowConfidenceCount)
);
}
public record ConfidenceReport(
Map<String, AtomicLong> high,
Map<String, AtomicLong> medium,
Map<String, AtomicLong> low
) {}
}

Confidence Patterns to Watch

When analyzing confidence metrics, look for:

  • Consistently low confidence on a tool: Description may be unclear
  • Confidence drops after model changes: Test thoroughly before deploying
  • Low confidence on specific inputs: Edge cases needing special handling
  • High variance in confidence: User prompts may be ambiguous

Environment

  • Spring Boot 3.3.x
  • Spring AI 1.0.0
  • Java 21

Summary

Add confidence scoring to Spring AI tool calls by defining a DTO with a required confidence field and registering it with AugmentedToolCallbackProvider. The LLM populates “low”, “medium”, or “high” for each tool selection. Use this signal for conditional execution, human-in-the-loop triggers, quality metrics, and fallback mechanisms. Never make confidence optional, or your monitoring data becomes inconsistent.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments