
That said, one of the SRE principles worth noting is the golden signals. These are four metrics that are the most important to monitor for any application:
Traffic
The amount of demand the system is responding to, normally measured as the number of requests per second. Think of this as your heartbeat or pulse rate. Just as a heart rate measures the number of times your heart beats in a minute, traffic indicates how many requests your system is handling. A suddenly elevated heart rate might indicate stress or excitement, much like a surge in traffic might hint at increased user activity or a possible DoS attack.
Saturation
How much of the available capacity the system is using. This could be the percentage of CPU, memory, disk, and network capacity in use, for example. This can be likened to lung capacity when you’re exercising. When you’re at rest, you’re using a small portion of your lung capacity; when you’re running, you’re pushing your lungs to use as much of their capacity as possible. Similarly, if your system’s resources are being fully utilized, your system is “breathing heavily,” potentially leading to exhaustion or slowdown.
Errors
The proportion of requests that fail or return an unexpected result in comparison to the total number of requests. This is a good indicator of the reliability and stability of a system. Imagine going for a health checkup and receiving some abnormal test results. These anomalies, like unusual blood work, might point toward specific health issues. Similarly, a higher rate of errors in a system could indicate underlying problems that need addressing.
Latency
The time to process and respond to a particular request. This is a good indicator of the performance of a system. This is akin to the reflex time of the human body. For instance, the time it takes for your hand to pull away from something hot. In an optimal state, you’d have a quick reflex, just as an efficient system would have low latency. Delays in reflex might suggest neurological concerns, just as high latency could point toward performance bottlenecks.
These are the metrics I will concentrate on in the system. The idea is that if you can monitor these four metrics, you will have a good understanding of the health of the system, much like I attempt to understand the health of my body.