Sachith Dassanayake Software Engineering High‑cardinality metrics dos and don’ts — Performance Tuning Guide — Practical Guide (Jan 31, 2026)

High‑cardinality metrics dos and don’ts — Performance Tuning Guide — Practical Guide (Jan 31, 2026)

High‑cardinality metrics dos and don’ts — Performance Tuning Guide — Practical Guide (Jan 31, 2026)

High‑cardinality metrics dos and don’ts — Performance Tuning Guide

High‑cardinality metrics dos and don’ts — Performance Tuning Guide

Level: Intermediate

As of January 31, 2026

Introduction

High-cardinality metrics—those with a large number of unique label or tag values—can provide granular insight into system behaviour, but pose significant challenges to performance, storage, and querying efficiency in modern observability systems. Popular monitoring tools like Prometheus, Datadog, and OpenTelemetry-based platforms have evolved considerably, but managing high-cardinality metrics remains a critical tuning concern.

This guide covers practical dos and don’ts for handling high-cardinality metrics in your monitoring setup, focusing primarily on Prometheus (v2.40+), OpenTelemetry Collector (stable since v0.79), and related ecosystem tools as of early 2026.

Prerequisites

  • Familiarity with metric concepts such as labels (tags), time series, and cardinality
  • Experience using metrics collection tools (Prometheus, OpenTelemetry Collector, or similar)
  • Access to your metric storage backend (Prometheus TSDB, Thanos, Cortex, or hosted SaaS)

Note: This guide assumes use of stable features. Preview or experimental metrics APIs, such as those currently evolving in OpenTelemetry 1.20+, require separate consideration.

Understanding High Cardinality and Its Impact

Cardinality refers to the number of unique series generated by distinct combinations of labels. For example, a counter metric with labels region and user_id can easily produce millions of unique series when user IDs are unbounded.

Problems with excessive cardinality include:

  • Excessive memory and CPU usage in metric scrapers and servers
  • Increased storage costs and slower queries
  • Potential for metric ingestion failures when limits are exceeded

Hands-on Steps

1. Identify high-cardinality labels

Begin by inspecting your current metrics. Use Prometheus’ built-in query functions to find cardinality of label sets:

// Count series by a given label 'user_id'
count(count by (user_id)({__name__=~".+"}))

This query shows the number of distinct series per user_id, helping highlight problematic labels.

2. Apply cardinality limits early

In OpenTelemetry Collector (v0.79+), leverage the filterprocessor to drop or mutate high-cardinality attributes before export. For example, scrubbing user IDs or truncating long strings:

processors:
  filter/high_cardinality:
    attributes:
      exclude:
        - user_id
        - session_token
service:
  pipelines:
    metrics:
      processors: [filter/high_cardinality]

Similarly, configure Prometheus scrape jobs to ignore or relabel certain high-cardinality labels:

scrape_configs:
- job_name: 'myapp'
  relabel_configs:
  - source_labels: [user_id]
    action: labeldrop

3. Use aggregation to reduce dimensionality

When possible, aggregate metrics at the source (application or exporter level) to reduce label cardinality. For instance, count errors by broader categories like error_type or region rather than individual users.

4. Set reasonable retention and retention policies

Retention policies that automatically downsample or delete old data prevent unbounded growth in cardinality over time. With Thanos or Cortex, configure compaction and retention accordingly.

Common pitfalls

Dumping unique user IDs or request IDs as labels

Labels like user_id, session_id, or request_id often explode cardinality. Use logs or traces for such identifiers instead.

Ignoring metric limits in Prometheus servers

Prometheus 2.x introduces limits on maximum series and label cardinality configurable by flags (--storage.tsdb.max-series, --storage.tsdb.max-labels-per-series). Exceeding these leads to scrape failures or data loss.

Complex label combinations

Having multiple labels together where each has many distinct values amplifies cardinality multiplicatively. Avoid unnecessary labels that don’t add actionable insight.

Relying solely on client-side aggregation

While client-side aggregation helps, it can increase complexity and delay. Balance aggregation between client, collector, and storage layers.

Validation

To measure and validate cardinality and performance:

  1. Use Prometheus’ TSDB inspection commands (promtool tsdb analyze) to understand index size and series count.
  2. Leverage exporter or collector metrics like otelcol_processor_filter_dropped_spans_total to monitor attributes dropped due to filtering.
  3. Review scrape logs and monitoring dashboards for warnings about dropped series or failed scrapes.

Checklist / TL;DR

  • ✔ Identify high-cardinality labels early with queries and inspection tools.
  • ✔ Avoid using unbounded labels (e.g., user_id, session_id) as metric labels.
  • ✔ Filter or drop high-cardinality attributes at collection time.
  • ✔ Aggregate metrics to reduce dimensionality before storage.
  • ✔ Configure cardinality and series limits in Prometheus or backend systems.
  • ✔ Use logging and tracing tools for per-request or user-level identifiers.
  • ✔ Monitor your ingestion metrics and TSDB stats continuously.

When to choose labels vs logs/traces

Choose metric labels when the dimension is low-cardinality and has meaningful aggregation value (e.g., region, instance type). Choose logs or traces for high-cardinality identifers such as user IDs, request IDs, or arbitrary strings.

References

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Post