The Only Prometheus Metrics I Actually Alert On

I used to instrument everything. Every function call, every cache hit, every database query. My Prometheus instance was ingesting somewhere north of 50,000 samples per second across three services, and I thought that density meant rigor. Then our checkout flow went down at 3 AM during a sale event, and I spent twenty minutes scrolling through dashboards before I found the single metric that mattered: connection pool exhaustion on the payments database. It had been queuing for six minutes before queries started timing out. I had a metric for it. I just wasn't alerting on it.

That's the gap this post is about. Not the theoretical list of things you could track, but what I've found worth waking up for.

Golden Signals are a starting point, not an answer

Google's four Golden Signals (latency, traffic, errors, saturation) point you in the right direction. But they're underspecified enough that following them naively leads to useless alerts.

Latency without percentiles tells you nothing actionable. If your p99 is 2 seconds and your mean is 50ms, the mean is actively misleading. Users hit the tail, not the average. I track p50, p95, p99, and max. The gap between p95 and p99 is often the most interesting number. A large gap usually means a specific slow path (a missing index, a lock contention, an N+1 query) rather than a general performance problem.

Errors need to distinguish user-visible failures from internal retries. A database timeout that triggers a retry and eventually succeeds is not the same failure mode as a 500 returned to the user, but both increment error counters in most client libraries by default. I split by severity: critical for user-visible failures, warning for degraded-but-functional, and info for retried-successfully events. Only critical fires a page.

What I actually instrument in HTTP services

I instrument at the edge, a single middleware or handler wrapper, so every endpoint gets consistent labels automatically. Three metrics:

http_requests_total{method, path, status}
http_request_duration_seconds{method, path, status, le}
http_in_flight_requests{method, path}http_requests_total{method, path, status}
http_request_duration_seconds{method, path, status, le}
http_in_flight_requests{method, path}

http_requests_total gives you rate and error ratio. The histogram gives you latency at any percentile. In-flight requests catch saturation before it shows up in latency. By the time requests slow down, you've usually been saturated for a while.

The mistake I made early on: I used the full request path as a label, so /users/12345/profile and /users/67890/profile became separate time series. At any meaningful user count, cardinality explodes and Prometheus runs out of memory. Sanitize paths before labeling. Replace ID segments with a placeholder like /users/{id}/profile. This is obvious in retrospect but I've seen it sink three separate teams' setups.

For gRPC, same pattern, but I add grpc_code as a label. gRPC status codes are more expressive than HTTP codes (RESOURCE_EXHAUSTED vs UNAVAILABLE vs DEADLINE_EXCEEDED all have different remediation paths), and I reference them directly in alert conditions.

Database metrics: the failure mode that sneaks up on you

Connection pool exhaustion is the failure mode that hurts most and is hardest to detect early. By the time queries are timing out, you've been at capacity for minutes. These come from the application, not the database server:

db_pool_connections_active
db_pool_connections_idle
db_pool_connections_wait_count_total
db_query_duration_seconds{query, le}db_pool_connections_active
db_pool_connections_idle
db_pool_connections_wait_count_total
db_query_duration_seconds{query, le}

The wait count is your leading indicator. I alert when rate(db_pool_connections_wait_count_total[5m]) > 0, but I treat it as a warning, not a page. Brief queuing can happen under bursty traffic without indicating a real problem. Sustained queuing (for more than a few minutes) usually means an undersized pool or a connection leak. The alert tells me to look; I don't automatically assume the worst.

Query duration histograms need custom buckets. Default client library buckets assume sub-second operations, but a report query or a schema migration step might legitimately take 30 seconds. I use: 0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0, 30.0. If your query durations don't fit this range, adjust. The buckets should bracket your actual distribution.

Server metrics: what node_exporter exports vs. what matters

Node exporter exports 700+ metrics by default in current versions (and more with optional collectors enabled). Most are noise for application operators. The ones I use consistently:

  • node_cpu_seconds_total: CPU utilization via rate()
  • node_memory_MemAvailable_bytes: not MemFree; available includes reclaimable cache and gives a realistic picture of OOM risk
  • node_filesystem_avail_bytes: disk space
  • node_load1: secondary signal, never primary

When I care about CPU, I filter by mode. iowait indicates a disk-bound process; steal in virtualized environments indicates you're competing with other tenants for CPU time. User and system time being high is expected. Iowait and steal are the modes that suggest something is wrong upstream of your application.

The Alertmanager config I wish I'd started with

My first Alertmanager config routed everything to a single Slack channel. During a cascading failure, I received 400 messages in 10 minutes and missed the root cause entirely. The cascade was loud enough to bury the signal. Here's the structure I've settled on:

route:
  group_by: ['alertname', 'cluster', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 12h
  receiver: 'default'
  routes:
    - matchers:
        - severity = critical
      receiver: pagerduty
      continue: true
    - matchers:
        - severity = warning
      receiver: slack
      group_interval: 30mroute:
  group_by: ['alertname', 'cluster', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 12h
  receiver: 'default'
  routes:
    - matchers:
        - severity = critical
      receiver: pagerduty
      continue: true
    - matchers:
        - severity = warning
      receiver: slack
      group_interval: 30m
flowchart TD A[Prometheus fires alerts] --> B[Alertmanager] B --> C[Alert Processing] C --> D[Deduplication] C --> E[Grouping by <br />alertname, cluster, severity] C --> F[Apply Silences] D --> G[Route to Receivers] E --> G F --> G G -->|severity=critical| H[PagerDuty] G -->|severity=critical + continue=true| I[Slack] G -->|severity=warning| J[Slack batched 30m] G -->|severity=info| K[Slack batched 30m] H --> L[Wake someone up] I --> M[Channel context] J --> N[Look in business hours]

Critical alerts hit PagerDuty immediately. Warnings batch into 30-minute windows. This prevents alert fatigue while keeping urgent pages urgent.

Alert fatigue is real. When everything pages, nothing pages. Teams start muting notifications because they can't sleep, and then the alerts that matter get buried too. Better to have three alerts you actually respond to than thirty you've learned to ignore.

The continue: true on the critical route matters. Without it, a matching route stops processing entirely, so critical alerts would never reach Slack. I want them in both places: PagerDuty wakes someone up, Slack gives the channel context.

A practical warning: Alertmanager's matching syntax changed meaningfully between versions. Before v0.22, you use match: and match_re: as maps. From v0.22 onward, the recommended syntax is matchers: with a list format. In practice, many deployments run mixed versions. The docs you find may not match what you've deployed. Check your version first.

Recording rules: when complexity pays off

Recording rules pre-compute expensive queries and make dashboards and alerts faster. I use them in two situations: when a query takes more than a second to evaluate, and when I reference the same complex expression in multiple alerts or dashboards. Outside those two cases, they add complexity without payoff.

groups:
  - name: http_rates
    interval: 30s
    rules:
      - record: job:http_requests:rate5m
        expr: sum(rate(http_requests_total[5m])) by (job)
      - record: job:http_errors:rate5m
        expr: sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)groups:
  - name: http_rates
    interval: 30s
    rules:
      - record: job:http_requests:rate5m
        expr: sum(rate(http_requests_total[5m])) by (job)
      - record: job:http_errors:rate5m
        expr: sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)

With those recorded, the error ratio alert simplifies to:

- alert: HighErrorRate
  expr: job:http_errors:rate5m / job:http_requests:rate5m > 0.01
  for: 5m- alert: HighErrorRate
  expr: job:http_errors:rate5m / job:http_requests:rate5m > 0.01
  for: 5m

Without recording rules, that division across all time series can time out on high-cardinality metrics. Whether you need recording rules at all depends on your scale. At small cardinality, raw queries work fine.

Alert thresholds and avoiding false positives

Every alert I write includes a for: duration. An instant threshold crossing is almost always a deployment blip, a brief GC pause, or a transient spike. Not something worth paging about. I use 5 minutes for error rate and latency, 10 minutes for saturation and capacity, and 0 minutes only for complete outages or security events.

I include a severity label in every alert. Without it, Alertmanager can't route correctly. Critical means wake someone up. Warning means look at it during business hours. Info means log it.

What I stopped monitoring

I don't alert on memory usage anymore. High memory isn't a problem, OOM is. For containers, alert on container_oom_events_total from cAdvisor. For VMs or bare metal, watch the OOM killer logs. Memory pressure without OOM is usually just efficient caching.

I don't alert on disk I/O wait directly. I alert on latency increases that correlate with high iowait. The user pain is the latency, not the disk activity.

I don't use predictive disk fill alerts ("disk will fill in 4 hours"). They generate false positives during batch jobs and log rotations without giving you actionable signal. Instead, I alert at 85% capacity with a 1-hour for: duration. This is a simpler threshold, but it assumes you're monitoring write rates separately for services where fill speed matters. If a service can fill a disk in under an hour, 85% may not give you enough runway, and you'd want to tighten the threshold or add a rate-based rule.

Grafana: dashboards for humans under pressure

When I get paged at 3 AM, I need to understand what's broken in under 30 seconds. That's the design constraint. My main dashboard has four panels: request rate (QPS), error rate percentage, p99 latency, and active alerts by severity. Everything else lives on sub-dashboards, linked from variable-based drill-downs.

A few things I've found useful beyond the basics: variable-based filtering lets you scope a dashboard to a single service or cluster without duplicating it. Template variables with $job and $instance selectors give you one dashboard that works across all your services. I also keep a separate dashboard for saturation signals: connection pool utilization, thread pool depth, queue length. These are leading indicators I want to check when latency starts rising but errors haven't followed yet.

For the panels themselves, I use recording rule data rather than raw metrics. Dashboards render faster and the difference in precision doesn't matter for human-readable trend graphs.

The cardinality trap

High cardinality is the most common way to break Prometheus. Each unique label combination creates a new time series. At scale, 1,000 pods each exporting 100 metrics with 10 label combinations, you reach 1 million series quickly, and Prometheus's memory usage grows roughly linearly with series count.

My constraints: no more than 5 labels per metric, no unbounded labels (user IDs, session IDs, request IDs, UUIDs), and no single metric with more than ~1,000 unique label combinations in practice.

I use prometheus_tsdb_head_series to watch my own cardinality. When it grows unexpectedly between deploys, I track down the culprit with:

topk(10, count by (__name__)({job="myjob"}))topk(10, count by (__name__)({job="myjob"}))

This shows which metric names have the most series. From there, a count by (label_name) query on the offending metric usually surfaces the high-cardinality label immediately.

About the Author

Asaduzzaman Pavel

Software Engineer who actually enjoys the friction of well-architected systems. 15+ years building high-performance backends and infrastructure that handles real-world chaos at scale.

Open to new opportunities

Comments

  • Sign in with GitHub to comment
  • Keep it respectful and on-topic
  • No spam or self-promotion