Debugging high-cardinality metrics in Prometheus


Our Prometheus instance started OOMing every few days. The usual suspects — long retention, too many targets — checked out fine. The actual cause took longer to find: a single metric with unbounded label cardinality.

Finding the offender

The prometheus_tsdb_symbol_table_size_bytes metric going up is a leading indicator, but doesn’t tell you which metric is responsible. The TSDB status page does:

http://prometheus:9090/tsdb-status

It lists the top 10 metrics by series count, head chunks, and memory. Ours had one metric at 4 million active series. Normal services have a few thousand.

The root cause

A developer had added a request_id label to a counter — effectively making every HTTP request its own unique time series. Prometheus stores every unique label combination as a separate series. A label with unbounded values is a cardinality bomb.

# Bad
http_requests_total{method="GET", path="/api/users", request_id="abc-123"} 1

# Good
http_requests_total{method="GET", path="/api/users", status="200"} 42

The fix

We dropped the label at the recording rule level while the developer fixed the instrumentation:

- record: http_requests_total_safe
  expr: sum without(request_id) (http_requests_total)

Then we added a metric_relabel_configs rule in the scrape config to drop the label at ingest time as a long-term guard.

Takeaway

High cardinality is the most common cause of Prometheus memory issues. Label values should come from a bounded set — status codes, HTTP methods, service names. Never user IDs, request IDs, or timestamps.