The Monitoring Mirage: When “What” Isn’t Enough

It’s 3 AM on a Tuesday. Your pager goes off, jarring you awake. API requests are timing out, and your Kubernetes cluster is showing all the signs of distress. You log in, eyes half-closed, and pull up your Grafana dashboards. CPU usage is spiking, pods are restarting, and request latency is through the roof. The numbers are screaming: something is broken. But what, exactly, is the root cause? And more importantly, how do you fix it?
For many teams running microservices on Kubernetes, the answer to “what’s broken?” often comes from the widely adopted kube-prometheus-stack. It bundles Prometheus and Grafana into an opinionated setup, providing a robust foundation for monitoring your infrastructure. On the surface, it seems like the holy grail of Kubernetes visibility.
But here’s the catch: monitoring is not observability. And if you confuse the two, you’ll inevitably hit a wall when incidents strike, your cluster scales, or you simply need to understand the intricate dance of your distributed applications. In this first post of my observability series, I want to break down this critical distinction, highlight the often-overlooked gaps in kube-prometheus-stack, and suggest how we can move toward true Kubernetes observability.
The Monitoring Mirage: When “What” Isn’t Enough
I’ve witnessed this scenario countless times. A team deploys kube-prometheus-stack, sets up beautiful Grafana dashboards, and configures a slew of alerts. Everything looks fantastic until that fateful 3 AM page. In one instance, API requests were timing out, and while Prometheus showed CPU spikes and Grafana displayed pod restarts, the on-call engineer spent two agonizing hours manually correlating logs, checking recent deployments, and guessing at database queries. The culprit? A rogue batch job with an unoptimized query hammering the production database.
Their monitoring stack told them something was broken, but not why. Had they had distributed tracing, they could have followed the slow requests back to that exact query in minutes, not hours. This highlights the core difference:
- Monitoring answers, “What is happening?” You collect predefined metrics (CPU, memory, network I/O) and set alerts when thresholds are breached. Your alert fires: “CPU usage is 95%.” Now what?
- Observability answers, “Why is this happening?” It empowers you to investigate using interconnected data you didn’t even know you’d need. Which pod is consuming CPU? What specific user request triggered it? Which database query is slow? What changed in the last deployment?
The classic definition of observability relies on three pillars: metrics, logs, and traces. Metrics are numerical values over time, perfect for trends and alerts. Logs provide unstructured text for contextual events. Traces map the full lifecycle of a request as it flows across services. Prometheus and Grafana excel at metrics, giving you one crucial piece of the puzzle. But true Kubernetes observability demands all three pillars working in concert.
Beyond the Dashboards: Common Pitfalls of a Metrics-Only Approach
Let’s be fair: kube-prometheus-stack is the default for a reason. With a simple Helm command, you can have Prometheus scraping metrics, Grafana serving dashboards, and Alertmanager ready for notifications. It feels like magic to spin up a comprehensive monitoring solution in minutes. The dashboards for cluster health, node metrics, and pod resource usage are immediately available, and the data flows in automatically.
Yet, relying solely on this powerful foundation comes with its own set of challenges, particularly as your environment grows and incidents become more complex.
The Cardinality Beast
Prometheus loves labels, and for good reason—they make metrics incredibly powerful for filtering and aggregation. But there’s a dark side: high cardinality. Cardinality is the number of unique time series created by combining a metric name with all its possible label values. If you add dynamic labels like user_id or transaction_id, you can inadvertently create millions of unique time series. I’ve personally witnessed production clusters go down not because of the application, but because Prometheus itself was choking under the immense load of high cardinality metrics.
A simple counter tracking HTTP requests with method, endpoint, user_id, and transaction_id could easily explode into billions of time series. Prometheus isn’t built for this kind of scale for arbitrarily unique labels. Instead, stick to low-cardinality labels like method, endpoint, and status_code. This drastically reduces the number of time series while still providing meaningful aggregations. You can check your cardinality with count({__name__=~".+"}) by (__name__), and if you see metrics with hundreds of thousands of series, you know you have a problem.
Scaling Headaches
For small, single-cluster setups, a standalone Prometheus instance might be sufficient. But in large enterprises with multiple clusters, or even just a single, rapidly growing cluster, Prometheus’s inherent scaling limitations become apparent. Without advanced strategies like federation or sharding, a single Prometheus instance struggles beyond 10-15 million active time series. While Prometheus federation allows a global instance to scrape from cluster-specific instances, this adds complexity and still doesn’t solve the fundamental storage limits of a single backend.
The Drowning Effect: Alert Fatigue
kube-prometheus-stack ships with a comprehensive set of default alerts. While useful as a starting point, they can quickly lead to alert fatigue. Engineers find themselves drowning in notifications that don’t actually help resolve issues. An alert like “KubePodCrashLooping” fires for every pod in CrashLoopBackOff, including those in development namespaces or expected restarts during deployments. The signal-to-noise ratio plummets.
A more effective approach involves tuning these alerts based on criticality. By filtering alerts to only fire for production-critical namespaces or services, you ensure that high-priority notifications get the attention they deserve, improving response times and reducing engineer burnout.
Dashboards That Show What but Not Why
Grafana dashboards are undeniably powerful for visualizing metrics. They look impressive and give an immediate overview of system health. You can see CPU at 95%, a spike in network errors, or an increase in pod restarts. But these panels typically show symptoms, not root causes. They tell you what is happening, not why.
A PromQL query like 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) will tell you CPU usage. To understand why it’s high, you might follow up with topk(10, rate(container_cpu_usage_seconds_total[5m])) to find the top consuming pods. But even then, you don’t know the specific request path, user action, or external dependency that caused the spike. Without distributed tracing, you’re often left guessing, resorting to questions in Slack like, “Did anyone deploy something?” or “Is the database slow?”
Bridging the Observability Gap: A Path Forward
My opinion is clear: kube-prometheus-stack is a fantastic monitoring foundation, but it’s not the endgame for observability. It’s a crucial first step, but not the complete picture. Kubernetes observability requires more than just metrics. It needs logs, traces, and the ability to correlate all this data seamlessly to provide meaningful context.
So, how do we close this observability gap? The answer lies in augmenting your existing metrics setup with dedicated solutions for logs and traces, and then connecting them. This means:
- Adding a centralized logging solution: Tools like Loki (Grafana’s log aggregation system), Elasticsearch with Kibana, or your preferred cloud provider’s logging service (e.g., CloudWatch Logs, Google Cloud Logging) are essential.
- Adopting distributed tracing: Jaeger or Tempo (another Grafana-native solution) are excellent choices for visualizing request flows across your microservices.
The beauty is that within the Grafana ecosystem, adding Loki and Tempo is remarkably straightforward. With a few Helm commands, you can deploy them alongside your existing Prometheus setup. Then, by configuring Grafana to use these as data sources, you unlock the ability to jump from a metric spike in Prometheus to related logs in Loki and traces in Tempo. This is where monitoring starts evolving into true observability—where you can click on a problematic metric and immediately see the logs from that specific pod or trace the full journey of a slow request.
Looking ahead, OpenTelemetry offers an even more unified approach. It provides a vendor-neutral way to instrument your applications to capture metrics, logs, and traces in a single pipeline. Instead of bolting together siloed tools, OpenTelemetry allows you to build a cohesive observability foundation from the ground up, providing a consistent data format and collection mechanism across all three pillars. I’ll delve into this powerful evolution in the next post of this series.
Conclusion
Kubernetes observability is a journey, not a destination, and it’s certainly more than just pretty Prometheus and Grafana dashboards. While kube-prometheus-stack gives you a strong monitoring foundation, it leaves critical gaps in logs, traces, and the crucial ability to correlate disparate pieces of information. If you rely solely on it, you’ll likely encounter metric cardinality explosions, alert fatigue, and dashboards that tell you what went wrong but not why.
True Kubernetes observability requires a mindset shift. You’re not just collecting numbers; you’re building a system that helps you ask questions you didn’t even know you’d need to answer during an incident. When that pager goes off at 3 AM, you want the power to trace a slow API call from the user, through your microservices, down to the exact database query that’s causing timeouts. Prometheus alone simply won’t get you there.
Embrace kube-prometheus-stack as the robust monitoring solution it is, but acknowledge its limits. Plan to integrate logs and traces into your pipeline, actively manage metric cardinality, and tune your alerts for signal over noise. Start moving toward a unified observability approach with OpenTelemetry. The robust observability foundation you build today will directly impact how quickly you can respond to incidents tomorrow, saving your team countless hours of frustrating detective work. Your future self – and your on-call team – will undoubtedly thank you for it.
In the next part of this series, I will show how to deploy OpenTelemetry in Kubernetes for centralized observability. That is where the real transformation begins.
Read next: OpenTelemetry in Kubernetes for centralized observability.



