You create a new service, your dashboard is operating perfectly, and your alerts are silent, and then suddenly your observability cost goes up tremendously. Why did this happen, nothing failed but you’re having a significant issue with your metrics? Most likely, you’ve been exposed to high cardinality data.

In this guide, we will discuss all things high cardinality metrics in a Kubernetes environment and provide you with some ideas for controlling expenses while maintaining visibility. We will discuss sampling, aggregating, retention strategies, and making smarter choices for leveraging OpenTelemetry.

Why High Cardinality Metrics Are a Problem

High cardinality means you have too many unique combinations of labels/metrics when working with metric data. For instance, using a metric that includes a pod name, user ID, request path, and region can create an explosion of metric series. A typical scenario for Kubernetes, for example, will include many ephemeral pods, many containers that scale continuously, and many labels like pod UID/container ID that change continuously. All of these changes create new metric series.

This issue multiplies significantly with Kubernetes. High cardinality metrics are bad for many reasons: more metric series means larger storage volumes, higher memory use, and slower queries. Most observability platforms charge based on either volume of data or number of active metric series, so costs can escalate quickly.

The problem is that not all high-cardinality metrics are inapplicable. There are times when you do need that level of detail to debug an issue. The solution, therefore, isn’t to try to get rid of all high-cardinality metrics, but rather to control when and where to use them.

Start with smarter instrumentation

In order to obtain the least expensive measure, it is important to begin by instrumenting things better. The least expensive measure available to you is the one you have not collected. Instead of creating an optimiser for your data store or query performance, you should first evaluate how you are producing the metrics that you use. Do not add labels that have unbounded values. Labels for user IDs, request IDs, or timestamps should almost never be considered valid labels.

Instead, you should look to add labels based on dimensions that do not change very often (i.e., Service Name, Endpoint Group, Region). If you need more detailed debugging information than what your metrics can directly provide, you can usually get it from a log or trace.

A good rule of thumb here is “if the label could be added to infinitely, then do not include it in metrics.”

Sampling without losing insight

Sampling is an effective way to reduce your data collection and maintain useful patterns.

For metric-based data, you can either collect metric data less frequently than previously done or limit it to a partial number of events. When using trace-based data, sampling is a much more significant tool; you can choose to store only a percentage of requests and still retain a representative behavior.

The two most common methods used for sampling are:

  • Head sampling – deciding if you will keep data before reading the entire request.
  • Tail sampling – deciding whether to keep a record after having read the entire request.

While tail sampling tends to be more expensive than head sampling, it usually provides an increased amount of control, particularly when considering slow or completely failed requests.

You can sample in a Kubernetes environment either at the application level or through an observability pipeline, such as an OpenTelemetry collector.

Aggregate before you store

Raw materials represent a high cost in terms of energy/time. Aggregation is less costly and is also generally more helpful than raw materials. Instead of maintaining each individual data point, use aggregation in the source where items are classified or on collector type aggregation layer (i.e., instead of measuring latency based upon request ID, measure the percentiles based either per service or per endpoint).

Aggregating the data creates fewer time series records and therefore a more efficient way to perform query processing on them, as well as aligns closer to how teams typically use their dashboards. Two-layer process to use for many teams:

  • High-resolution data = Used for a short time frame of analysis, typically troubleshooting/debugging.
  • Aggregated data = Used for an efficient long-term frame with successful monitoring of service availability/real-time analysis/long-range trend analysis.

Retention tiering that matches real usage

There are many kinds of data that do not need to be kept for an eternity. In most cases, teams do not look at very detailed metrics for anything older than a few days. Many systems still keep all of this full-resolution data for weeks to months.

Using retention tiering is a better solution:

  • Keep the highest-resolution data for the most recent time period (typically 24 to 72 hours).
  • Aggregate or down-sample data after a certain period of time.
  • For long-term trends, only store summary data.

This method not only greatly reduces storage cost but also continues to provide valuable historical insights.

Use OpenTelemetry efficiently

OpenTelemetry is powerful, but it can also generate a lot of data if not configured carefully. Start by controlling what gets exported. Use processors in the OpenTelemetry Collector to filter out unnecessary labels or metrics. You can also apply sampling and aggregation directly in the pipeline.

Another important step is batching. Sending data in batches reduces network overhead and improves efficiency. Also, review default instrumentation. Many libraries collect more metrics than you actually need. Disable what is not useful.

Build a cost-aware observability architecture

Cost control should not be an afterthought. It should be part of your design. A cost-aware setup usually includes

  • A collector layer that filters and aggregates telemetry
  • Clear guidelines on metric naming and labeling
  • Separation between debugging data and monitoring data
  • Regular audits of high-cardinality metrics

You can also set budgets or alerts for observability costs. If usage spikes, you want to know early. Think of observability like any other production system. It needs scaling rules, limits, and governance.

Maintain Visibility Without Overspending

You want to keep your data intact, not reduce it recklessly. High-cardinality metrics provide great value when used correctly. They help debug complex problems and understand edge cases. However, when they are not controlled properly, they will become one of the major hidden costs.

With the right instrumentation, sampling, aggregation, and retention strategies, you can make your Kubernetes observability ownership both powerful and affordable at the same time.

Wrapping up

Observability should produce transparency. You should never receive invoices that are unexpected. If you ever feel that your metrics system is too out of hand, it is seldom a tool issue; it is a data strategy issue.

Start slow. Audit the metrics in your system to determine what labels are excessive, consolidate where possible, and then apply aggregation to your metrics where applicable. The cumulative result of these decisions leads to significant monetary savings over time.

If you’ve enjoyed reading this, take a few moments today and look into one high-cardinality metric in your system and simplify it. That single action alone could significantly improve your overall observability setup.

Share:

Get involved!

Get Connected!
Join our community. Expand your network and discover great content!

Comments

No comments yet