Using Kubernetes Events for Better Monitoring and Faster Troubleshooting
-
Sanplex Content -
2026-03-01 09:00:00 -
15
I. Introduction
Monitoring is essential for maintaining system stability. In the Kubernetes ecosystem, there is no shortage of tools for collecting and analysing resource metrics. From the community-driven Metrics Server to CNCF-graduated Prometheus, teams have many mature options for tracking CPU, memory, storage, and other resource indicators.
But resource-based monitoring has limits.
In practice, metrics alone are often not enough to explain what is happening inside a Kubernetes cluster. This is especially true when teams need to understand short-lived failures, operational state changes, or issues that are not clearly reflected in resource usage.
Resource-based monitoring typically has two major weaknesses.
1. Limited real-time accuracy
Most resource monitoring systems collect data in periodic intervals using pull-based or push-based models. As a result, the monitoring data is not truly continuous. If an anomaly occurs between two collection points and then disappears before the next one, it may never be captured accurately.
Short-lived spikes, intermittent failures, or brief instability can easily be smoothed out or missed entirely. In some cases, the monitoring system effectively clips these transient peaks, reducing visibility into what really happened.
2. Limited scenario coverage
Some operational scenarios cannot be described well through resource metrics alone.
For example, the start, restart, or termination of a Pod cannot always be explained simply by CPU or memory usage. If a metric value drops to zero, that does not tell us whether the Pod completed normally, crashed, was evicted, or failed to start in the first place. Resource metrics show symptoms, but not always the cause.
So how does Kubernetes address this gap?
To provide more visibility into cluster behaviour, Kubernetes includes an Events system. Events record important changes to Kubernetes resources and store them in the API Server. These events can be queried through the API or viewed directly with kubectl.
By collecting and analysing Events, teams can detect cluster issues in real time, diagnose failures more quickly, and surface problems that resource monitoring may miss.
II. Understanding Kubernetes Events
1. What are Kubernetes Events?
A Kubernetes Event is a resource object that records an action taken by a component at a specific point in time. In simple terms, it captures what happened inside the cluster.
Whenever the state of a resource changes, Kubernetes may generate a new Event.
Components across the Kubernetes control plane and node runtime can report Events during execution. For example, the scheduler may emit an event explaining why a Pod was placed on a specific node, or why scheduling failed. Other components may record image pull failures, restarts, readiness issues, node-related warnings, and more.
These Events are sent to the Kubernetes API Server and stored in etcd.
By default, Kubernetes does not retain Events for long. To prevent etcd from filling up with transient operational data, the default retention policy removes Events roughly one hour after their last update. That means Event data is useful, but short-lived unless it is exported elsewhere.
Teams can inspect recent Events in the cluster with standard kubectl commands, either at the cluster level or for a specific resource. In most cases, only Events from the last hour will be available.
2. Events shown in kubectl describe
When you run kubectl describe on a resource, Kubernetes displays Events directly related to that object.
For example, if you describe a Pod, you may see output such as:
28s (x240979 over 167d)
This means the same type of Event has occurred 240,979 times over 167 days, and the most recent occurrence happened 28 seconds ago.
This also reveals an important behaviour of Kubernetes Events: repeated Events are automatically aggregated. If you run kubectl get events, you will not see 240,979 separate records. Instead, Kubernetes merges repeated Events of the same type into a consolidated entry.
If you describe a different resource, such as the Deployment that owns the Pod, you may see a different set of Events or no Events at all. This shows that Events are tied closely to the involvedObject they reference. In other words, the Events you see depend directly on the resource you inspect.
3. Viewing all Events in a namespace
Kubernetes also allows you to retrieve all Events within a specific namespace.
However, by default, kubectl get events does not sort the output by time, which makes troubleshooting harder. In most cases, it is better to sort by creation timestamp so that the Event stream is easier to follow chronologically.
Since Events are standard Kubernetes resources, they also include metadata such as their names. This means you can retrieve the names of all Events in a namespace and then inspect individual entries in more detail if needed.
4. Anatomy of a single Event object
You can also inspect a specific Event in YAML format to better understand its structure.
Some of the most important fields include:
- count: how many times this Event has occurred
- firstTimestamp and lastTimestamp: when this Event first and most recently occurred
- involvedObject: the resource associated with the Event
- reason: a short machine-readable code describing why the Event was generated
- message: a more detailed human-readable explanation
- source: the component and host that generated the Event
- type: either
NormalorWarning
In practice, the reason field is especially useful for filtering and automation, while message is more useful for human investigation.
The type field is also important:
- Normal means the state transition is expected
- Warning means the state transition is unexpected or indicates a problem
Take the lifecycle of a Pod as an example.
When a Pod is created, it typically starts in the Pending state while Kubernetes handles scheduling, image pulling, container creation, and startup. Once the health checks pass and the Pod becomes Running, Kubernetes may emit a Normal Event because the state transition is expected.
If the Pod later fails because of an OOM condition or another runtime problem, Kubernetes may emit a Warning Event because the Pod has entered an unexpected state.
This is exactly why Event monitoring matters. By watching Events, teams can detect operational issues that may never be obvious from metrics alone.
III. Node-Problem-Detector
Several built-in Kubernetes components already generate Events, including kubelet, deployment-controller, and job-controller. But the built-in Event system is mainly focused on container and workload lifecycle behaviour.
It does not provide deep visibility into the operating system, container runtime, or lower-level node dependencies such as networking and storage.
That creates a gap.
When a Kubernetes node becomes unhealthy, the cluster may not always generate sufficient node-level Events by default. Yet workload stability depends heavily on node stability. Kubernetes has historically placed more responsibility for node management on the IaaS layer, but as Kubernetes has evolved into a more complete cloud-native operating platform, the need for better node diagnostics has become more obvious.
This is where Node-Problem-Detector (NPD) comes in.
NPD is deployed as a DaemonSet and performs diagnostic checks on Kubernetes nodes. Its outputs follow the Kubernetes Event model, allowing node-level anomalies to be converted into Events and pushed into the API Server for unified management.
When NPD detects an abnormal condition, it creates a node-related Event. Operations teams can then inspect the node and understand the issue more quickly using commands such as kubectl describe node.
NPD can detect a wide range of problems, including:
- basic service issues, such as NTP not running
- hardware problems affecting CPU, memory, disk, or network interfaces
- kernel-level issues, such as deadlocks or filesystem corruption
- kubelet health issues, including repeated restarts
- container runtime issues, such as unhealthy Docker or containerd services
By extending Event generation beyond core workload signals, NPD gives teams a much better operational view of cluster health.
IV. Persisting Kubernetes Events
Kubernetes Events are useful, but the default storage model is limited.
Because Events are stored in etcd for only about one hour, they are not suitable for long-term analysis. In addition, etcd is not designed for rich search, statistical analysis, or alerting workflows. By default, Kubernetes only provides basic filtering by attributes such as reason, type, and time. In most environments, Events are still consumed manually through commands like kubectl describe or kubectl get events.
For many teams, that is not enough.
In real production environments, there is often a strong need to:
- search Events over a longer retention period during incident analysis
- trigger real-time alerts for abnormal Events such as
Failed,Evicted,FailedMount, orFailedScheduling - subscribe to Events for custom monitoring workflows
- filter and analyse Events across multiple dimensions
- perform trend analysis and compare Event volumes over time
To support these use cases, Kubernetes Events need to be exported to a persistent external system.
Two commonly used tools for this purpose are kubernetes-event-exporter and kube-eventer.
Among them, kube-eventer, open-sourced by Alibaba Cloud, can run in the cluster as a Deployment and export Kubernetes Events to destinations such as DingTalk, Alibaba Cloud Log Service (SLS), Kafka, InfluxDB, and Elasticsearch.
This makes it possible to retain Events long term and integrate them into external monitoring, analytics, and alerting systems.
V. Building a Kubernetes Event Center
A practical way to operationalise Event monitoring is to build a centralised Kubernetes Event Center.
In this model, Node-Problem-Detector and kube-eventer are installed across all Kubernetes clusters.
- Node-Problem-Detector converts node-level anomalies into Kubernetes Events
- kube-eventer exports those Events to a persistent backend such as Alibaba Cloud Log Service
- the logging platform then provides querying, analysis, visualisation, dashboards, and real-time alerting
This creates a unified Event pipeline that goes far beyond the default Kubernetes experience.
1. Visual dashboards
With a centralised Event platform, teams can build dashboards such as:
- Event overview dashboards
- node Event search and analysis
- Pod Event search and analysis
This improves both observability and incident review by making cluster behaviour easier to inspect over time.
2. Real-time alerts
Abnormal Events can also be analysed and turned into actionable alerts.
Once Event data is exported into a system such as SLS, teams can notify the right people immediately when high-risk conditions occur. Alert channels may include DingTalk groups, SMS, phone calls, or custom webhooks.
Typical alerting scenarios include:
- Flink OOM events: when a Flink Pod runs out of memory, a targeted alert can be sent to the relevant application team
- Pod eviction events: when a Pod is evicted from the cluster, operations teams can be notified immediately
- Pod scheduling failures: if a Pod cannot be scheduled, the platform can alert the infrastructure team
- Conntrack table full: when the connection tracking table on a node reaches capacity, the responsible team can be notified before the issue causes wider impact
By using Events for real-time alerts, teams can respond faster to failures that may not be obvious from standard metrics alone.
Conclusion
Kubernetes Events contain a large amount of operationally useful information. They help platform and operations teams understand resource state changes, troubleshoot failures, and detect issues that are difficult to identify through metrics alone.
This article introduced the basic Kubernetes Event object, explained how Events work in practice, described how Node-Problem-Detector extends Event visibility to node-level problems, and showed how tools such as kube-eventer can be used to persist Events for dashboarding and alerting.
Event monitoring should not replace metrics monitoring, but it is an important complement to it. By combining both approaches, teams gain better real-time visibility, better diagnostic depth, and broader monitoring coverage across Kubernetes environments.