Spring Boot Micrometer Monitoring Best Practices - A Practical Guide
The Problem
I had set up Prometheus and Grafana for my Spring Boot application, and I could see JVM memory usage and HTTP request metrics. But when my product manager asked specific questions, I couldn’t answer them:
- “How many orders are being processed per minute?”
- “What’s the failure rate for payment transactions?”
- “How long does it take to sync data from the external API?”
The built-in metrics weren’t enough. I needed custom metrics tailored to my business logic. I tried adding metrics using System.currentTimeMillis() and log statements, but that approach was messy and didn’t integrate with my monitoring stack.
I needed a better way to track business-specific metrics that would work seamlessly with Prometheus and Grafana.
Why Micrometer?
I discovered Micrometer, and it’s the standard solution for Spring Boot metrics. Think of it as “SLF4J for metrics” - you instrument your code once, and Micrometer handles translation to your monitoring backend.
Here’s what makes Micrometer the right choice:
+-------------------+| Your Application || (Business Logic) |+-------------------+ | v+-------------------+ +------------+ +---------+| Micrometer | | | | || (MeterRegistry) |---->| Prometheus |---->| Grafana |+-------------------+ +------------+ +---------+ | ^ | | +-----> Datadog -------+ | | +-----> InfluxDB ------+ | | +-----> New Relic -----+Key benefits I found:
- Vendor neutrality: I can switch from Prometheus to Datadog by changing one dependency
- Spring Boot integration: Auto-configured via Spring Boot Actuator
- Dimensional metrics: Tags enable filtering and drill-down in dashboards
- Battle-tested: Used in production at scale by major companies
Step 1: Understand Meter Types
Micrometer provides four main meter types. I needed to understand when to use each one:
| Meter Type | Behavior | Use Case | Example |
|---|---|---|---|
| Counter | Only increases | Counting events | Orders placed, errors occurred |
| Gauge | Can go up or down | Current state | Active connections, queue size |
| Timer | Measures duration + count | Latency tracking | API response time, DB query time |
| DistributionSummary | Tracks distribution | Size metrics | Request payload size, batch size |
Counter - For Counting Events
Counters only go up. I use them for monotonically increasing values like request counts or completed tasks.
import io.micrometer.core.instrument.Counter;import io.micrometer.core.instrument.MeterRegistry;import org.springframework.stereotype.Service;
@Servicepublic class OrderService { private final Counter ordersCreated; private final Counter ordersFailed;
public OrderService(MeterRegistry registry) { this.ordersCreated = Counter.builder("orders.created") .description("Number of orders created") .tag("type", "online") .baseUnit("orders") .register(registry);
this.ordersFailed = Counter.builder("orders.failed") .description("Number of failed orders") .tag("type", "online") .register(registry); }
public Order createOrder(OrderRequest request) { try { Order order = orderRepository.save(request); ordersCreated.increment(); return order; } catch (Exception e) { ordersFailed.increment(); throw e; } }}Important: Never use Counter for values that can decrease. If you need to track current value (like active users), use Gauge instead.
Timer - For Measuring Duration
Timers measure both duration and count. They automatically track percentiles (p50, p95, p99).
import io.micrometer.core.instrument.Timer;import io.micrometer.core.instrument.MeterRegistry;import org.springframework.stereotype.Service;import java.util.concurrent.TimeUnit;
@Servicepublic class PaymentService { private final Timer paymentTimer;
public PaymentService(MeterRegistry registry) { this.paymentTimer = Timer.builder("payment.processing.time") .description("Time spent processing payments") .tag("gateway", "stripe") .publishPercentiles(0.5, 0.95, 0.99) .publishPercentileHistogram() .minimumExpectedValue(Duration.ofMillis(1)) .maximumExpectedValue(Duration.ofSeconds(30)) .register(registry); }
public PaymentResult processPayment(PaymentRequest request) { // Option 1: Record with supplier return paymentTimer.record(() -> { return paymentGateway.charge(request); });
// Option 2: Manual timing long start = System.nanoTime(); try { PaymentResult result = paymentGateway.charge(request); return result; } finally { paymentTimer.record(System.nanoTime() - start, TimeUnit.NANOSECONDS); } }}Timers give me both the count of operations and duration statistics in one meter.
Gauge - For Current Values
Gauges sample a value at observation time. They’re perfect for current state metrics.
import io.micrometer.core.instrument.Gauge;import io.micrometer.core.instrument.MeterRegistry;import java.util.concurrent.atomic.AtomicLong;import org.springframework.stereotype.Service;
@Servicepublic class ConnectionPoolService { private final AtomicLong activeConnections = new AtomicLong(0);
public ConnectionPoolService(MeterRegistry registry) { // Gauge samples the value from AtomicLong Gauge.builder("connections.active", activeConnections, AtomicLong::get) .description("Number of active database connections") .tag("pool", "main") .register(registry); }
public Connection acquireConnection() { activeConnections.incrementAndGet(); return connectionPool.borrowObject(); }
public void releaseConnection(Connection conn) { activeConnections.decrementAndGet(); connectionPool.returnObject(conn); }}Gotcha I learned: Gauges are sampled, not pushed. Prometheus scrapes the current value when it queries the endpoint.
Step 2: Follow Naming Conventions
I made mistakes with metric naming initially. Here’s what I learned:
Use Lowercase Dot Notation
// GOOD - Consistent, readableCounter.builder("http.server.requests") .tag("method", "GET") .tag("uri", "/api/users");
Counter.builder("database.queries") .tag("table", "users") .tag("operation", "select");
// BAD - Inconsistent, hard to filterCounter.builder("HTTP_Requests");Counter.builder("databaseQueries");Counter.builder("user-query-count");Name the Thing Being Measured
The metric name should answer “what is being measured?”:
// Clear what's being measuredTimer.builder("http.server.requests");Counter.builder("orders.created");Gauge.builder("jvm.memory.used");
// Unclear - what is this?Timer.builder("timing");Counter.builder("count");Gauge.builder("value");Step 3: Use Tags Effectively
Tags (also called labels in Prometheus) enable dimensional metrics. They let me filter and group metrics in Grafana.
Tags for Granularity
// Track HTTP requests by method, endpoint, and statusregistry.counter("http.server.requests", "method", "GET", "uri", "/api/users", "status", "200");
// Track database calls by operationregistry.counter("database.queries", "database", "users", "operation", "select", "table", "orders");
// Track business metrics by regionregistry.counter("orders.processed", "region", "us-east-1", "service", "order-service");Now in Grafana, I can query:
- Total orders:
sum(orders_processed) - Orders by region:
sum by (region) (orders_processed) - US-east orders only:
orders_processed{region="us-east-1"}
Common Tags for All Metrics
I set application-wide tags so every metric includes context:
import io.micrometer.core.instrument.MeterRegistry;import org.springframework.boot.actuate.autoconfigure.metrics.MeterRegistryCustomizer;import org.springframework.context.annotation.Bean;import org.springframework.context.annotation.Configuration;
@Configurationpublic class MetricsConfig {
@Bean public MeterRegistryCustomizer<MeterRegistry> commonTags() { return registry -> registry.config() .commonTags("application", "order-service") .commonTags("environment", "production") .commonTags("region", "us-east-1"); }}Or via application properties:
spring: application: name: order-service
management: metrics: tags: application: ${spring.application.name} environment: ${DEPLOY_ENV:development} region: ${AWS_REGION:us-east-1}Step 4: Avoid Cardinality Explosion
I learned this the hard way. High-cardinality tags can crash Prometheus.
The Problem: Too Many Time Series
// DANGEROUS - Creates a time series for each userCounter.builder("api.requests") .tag("userId", userId) // Could be millions of users! .register(registry) .increment();
// DANGEROUS - Creates a time series for each request IDCounter.builder("http.requests") .tag("requestId", UUID.randomUUID().toString()) // Unique per request! .register(registry) .increment();With 100,000 users, this creates 100,000 time series. With 1 million requests, that’s 1 million time series. Prometheus will run out of memory.
The Solution: Low-Cardinality Tags
// GOOD - Low cardinality (3-4 values)Counter.builder("api.requests") .tag("userType", getUserType(userId)) // "free", "premium", "enterprise" .tag("tier", getUserTier(userId)) // "tier1", "tier2", "tier3" .register(registry) .increment();
// GOOD - Bounded valuesCounter.builder("http.requests") .tag("endpoint", getEndpointPattern(uri)) // "/api/users/{id}" .tag("method", method) // "GET", "POST", etc. .tag("status", String.valueOf(status)) // "200", "404", "500" .register(registry) .increment();Rule of thumb: Keep cardinality under 10 for each tag, and total time series under 100,000.
Step 5: A Complete Example
Here’s a real-world example I use in my payment service:
import io.micrometer.core.instrument.*;import org.springframework.stereotype.Service;import java.util.concurrent.atomic.AtomicLong;
@Servicepublic class PaymentService { private final Counter paymentsProcessed; private final Counter paymentsFailed; private final Timer paymentTimer; private final AtomicLong pendingPayments = new AtomicLong(0);
public PaymentService(MeterRegistry registry) { // Counter for successful payments this.paymentsProcessed = Counter.builder("payments.processed") .description("Total payments processed successfully") .tag("service", "payment") .tag("gateway", "stripe") .register(registry);
// Counter for failed payments this.paymentsFailed = Counter.builder("payments.failed") .description("Total failed payments") .tag("service", "payment") .tag("gateway", "stripe") .register(registry);
// Timer for payment processing duration this.paymentTimer = Timer.builder("payments.processing.time") .description("Time to process payment") .tag("service", "payment") .tag("gateway", "stripe") .publishPercentiles(0.5, 0.95, 0.99) .publishPercentileHistogram() .register(registry);
// Gauge for pending payments Gauge.builder("payments.pending", pendingPayments, AtomicLong::get) .description("Number of payments currently being processed") .tag("service", "payment") .register(registry); }
public PaymentResult processPayment(PaymentRequest request) { pendingPayments.incrementAndGet();
try { PaymentResult result = paymentTimer.record(() -> { return stripeGateway.charge(request); });
paymentsProcessed.increment(); return result;
} catch (Exception e) { paymentsFailed.increment(); throw new PaymentException("Payment failed", e); } finally { pendingPayments.decrementAndGet(); } }}This gives me:
payments_processed_total: Total successful paymentspayments_failed_total: Total failed paymentspayments_processing_time_seconds: Duration statistics (p50, p95, p99)payments_pending: Current number of payments in flight
Step 6: Test Metrics Locally
I always test metrics before deploying. Here’s my workflow:
Start the Application
./mvnw spring-boot:runCheck Available Metrics
# List all available metricscurl http://localhost:8080/actuator/metrics
# Get specific metriccurl http://localhost:8080/actuator/metrics/payments.processed
# Check Prometheus formatcurl http://localhost:8080/actuator/prometheus | grep paymentsExpected Output
# HELP payments_processed_total Total payments processed successfully# TYPE payments_processed_total counterpayments_processed_total{application="order-service",environment="production",gateway="stripe",service="payment"} 42.0
# HELP payments_processing_time_seconds Time to process payment# TYPE payments_processing_time_seconds summarypayments_processing_time_seconds_count{application="order-service",gateway="stripe",service="payment"} 42.0payments_processing_time_seconds_sum{application="order-service",gateway="stripe",service="payment"} 12.34payments_processing_time_seconds{application="order-service",gateway="stripe",service="payment",quantile="0.5"} 0.25payments_processing_time_seconds{application="order-service",gateway="stripe",service="payment",quantile="0.95"} 0.89payments_processing_time_seconds{application="order-service",gateway="stripe",service="payment",quantile="0.99"} 1.23Common Issues I Encountered
Issue 1: Metric Not Showing Up
Symptoms: Metric doesn’t appear in /actuator/prometheus
Causes I found:
- Meter not registered: Call
.register(registry) - Meter never used: Counters and Timers only emit when incremented
- Filtered by management settings
Fix:
// WRONG - Not registeredCounter counter = Counter.builder("my.metric") .tag("key", "value");
// CORRECT - RegisteredCounter counter = Counter.builder("my.metric") .tag("key", "value") .register(registry); // <-- Important!Issue 2: Wrong Metric Type
Symptoms: Using Counter for values that go down
Wrong approach:
// WRONG - Counter only increasesCounter userCount = registry.counter("users.active");userCount.increment(); // When user logs in// Can't decrement when user logs out!Correct approach:
// CORRECT - Gauge can go up and downAtomicLong activeUsers = new AtomicLong(0);
Gauge.builder("users.active", activeUsers, AtomicLong::get) .register(registry);
// Increment/decrement as neededactiveUsers.incrementAndGet(); // User logs inactiveUsers.decrementAndGet(); // User logs outIssue 3: Prometheus Memory Issues
Symptoms: Prometheus runs out of memory, slow queries
Cause: High-cardinality tags
How to find the problem:
topk(10, count by (__name__) ({__name__=~".+"}))This shows the top 10 metrics by cardinality. Look for metrics with millions of time series.
Issue 4: Percentiles Missing
Symptoms: No p95/p99 values in Timer metrics
Fix: Enable percentile publishing
Timer.builder("my.timer") .publishPercentiles(0.5, 0.95, 0.99) // Enable percentiles .publishPercentileHistogram() // For Prometheus histogram .register(registry);Production Considerations
Separate Management Port
I expose metrics on a different port for security:
management: server: port: 9090 address: 0.0.0.0
server: port: 8080Now only internal monitoring systems can access the metrics endpoint.
Disable Unnecessary Metrics
Spring Boot enables many metrics by default. I disable what I don’t need:
management: metrics: enable: jvm: true process: true tomcat: true http: true logback: false uptime: falseResource Limits
In Kubernetes, I set limits on the metrics endpoint:
livenessProbe: httpGet: path: /actuator/health port: 9090 initialDelaySeconds: 30
readinessProbe: httpGet: path: /actuator/health port: 9090 initialDelaySeconds: 10
resources: limits: memory: "512Mi" requests: memory: "256Mi"Summary
In this post, I covered how to set up and use Micrometer for Spring Boot monitoring with best practices. I started with the problem of needing business-specific metrics beyond the defaults, then walked through the solution using Micrometer’s meter types.
The key practices I follow:
- Choose the right meter type: Counter for totals, Gauge for current values, Timer for durations
- Follow naming conventions: Use lowercase dot notation and descriptive names
- Use tags wisely: Enable filtering without creating cardinality explosion
- Set common tags: Add application and environment context to all metrics
- Test locally: Verify metrics appear correctly before deploying
- Configure for production: Separate management port and resource limits
Micrometer’s integration with Spring Boot Actuator means I can focus on instrumenting my business logic while Micrometer handles translation to Prometheus, Datadog, or any other monitoring backend.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 Micrometer Documentation
- 👨💻 Spring Boot Actuator Reference
- 👨💻 Prometheus Configuration Guide
- 👨💻 Micrometer Concept Reference
- 👨💻 Spring Boot Metrics Documentation
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments