Five reasons to monitor a system and the four golden signals

Why should you monitor a system, and what? We’ll go through five reasons and the four golden signals. Effective monitoring relies on a culture of collaboration and trust and should not be underestimated.

Five reasons to monitor a system and the four golden signals

In today’s world, applications are becoming more demanding and complex. So are the systems running the applications. Effectively monitoring a complex application and system is a lot of work and should not be underestimated.

Why should you monitor a system? And what signals to pay attention to when setting up monitoring? This post will review the five reasons to monitor a system and the four golden signals.

Effective monitoring relies on a culture of collaboration and trust. It’s about shared understanding and responsibilities between developers and operations. Shorten the development life cycle by including monitoring throughout—from planning to operations.

Reasons

Monitoring enables a system to tell you what’s going on. Is the system behaving normally? Is there something broken or about to break? There are many reasons to monitor a system, including the following five:

Alerting: When something happens inside the system, you want to know about it. Just sitting there, looking at the monitoring information, and waiting for something to happen, will not cut it; this is where alerting comes into play. Monitoring allows you to set up alerts based on what is happening.

Analyzing trends: Systems change over time. Databases grow. The number of people using the application increases. These are just some examples of trends that you want to be able to recognize. Monitoring enables you to analyze these trends to act, like planning for extra capacity.

Experimentation: Monitoring allows you to experiment scientifically. You’re able to determine if the changes have the desired effect. What is the system’s performance after adding an extra node or implementing a cache? You’ll never really know without monitoring the system.

DevOps aims to shorten the development life cycle by developing faster, testing regularly, and releasing more frequently while maintaining quality. Including monitoring provides visibility throughout the entire development life cycle—from planning, development, testing, and operations.

Visibility allows unified DevOps teams to respond quickly to problems that affect the user experience. Shifting left in the development life cycle lets you detect problems sooner and minimize production incidents. This approach embraces the you built it, you run it principle, where developers have operational responsibilities to enhance the quality of the services.

The DevOps infinity loop.

Debugging: When something happens with the system, you want to know why this happens. The latency just went through the roof; what happened simultaneously? The ability to dive deep into the system and find and solve the root cause is crucial for the system’s reliability.

Automation: Combining monitoring and event-based automation allows you to respond to events automatically and helps you minimize the duration of downtime by making sure everything is running smoothly. Automation reduces the amount of human interference, allowing you to spend more time on other topics.

These events and actions include:

  • Scale the system based on the current load.
  • Restart a service when a specific error occurs.
  • Create an incident ticket when a particular incident occurs.

The four golden signals

There are many signals you can monitor. But what signals matter for the performance and, eventually, the user experience? The four golden signals—defined by Google’s Site Reliability Engineers (SREs)—are latency, traffic, errors, and saturation. When approaching this from a security or business perspective, these signals revolve around reliability and change.

Latency: How much time does it take to service a request? The lower the time, the better. Latency occurs between the client and the serving API. Or, for example, between the server and the database. Separate tracking of successful and failed requests. Failed requests are usually served quicker. Do not filter out the failed requests; a slow error is worse than a fast error.

Traffic: How much demand is being placed on the system? We measure traffic using a high-level system-specific metric; this can differ per system. For example, for a storage system, measuring traffic might be transactions per second. As for a web service, this usually is HTTP requests per second.

Errors: Is the system successfully serving requests? Tracking the rate of failed requests is essential to determine if the system is operating correctly. A failed request can be explicit, for example, an HTTP 500 (server error) response. Implicit, for example, an HTTP 200 (success) response, but with the wrong content. Or by policy, for example, higher than one-second response times.

Saturation: What’s the utilization of the system? Are we reaching the maximum? By measuring the most constrained resources, you can determine the fullness of the service. For example, in an I/O-constrained storage system, monitor the I/O. Whereas in a memory-constrained system, monitor the memory. In more complex systems, latency increases can be an indicator of saturation.

The performance of the system usually degrades before reaching 100% utilization. Having a utilization budget or target is crucial. Set up alerts for these targets to be proactively informed about reaching the system’s limits.

Tracking the saturation of the system supports capacity planning. For example, when to add extra memory or storage to keep up with the usage?

Simplicity

Just as with other systems, monitoring can become complex, fast. A complex monitoring system can become fragile, hard to change, and a lot of work to maintain. Therefore, it’s essential to keep it as simple as possible. For best practices, there isn’t a one-size-fits-all approach. The monitoring system should answer two questions: what’s broken, and why?

When designing the monitoring system, keep the following in mind:

  • Keep the collection and aggregation of metrics balanced and straightforward.
  • Minimize the number of alerting rules.
  • Remove signals that are not used in any dashboard or rule.
  • Automate actions where possible.

When setting up alerts, keep the following in mind:

  • Establish a baseline, the ‘normal’ behavior of the system.
  • Calibrate the alerts using the established baseline.
  • Limit the number of people that an alert notifies.
  • Ensure that the alerts are actionable.

Continuously review and optimize the monitoring system to ensure it stays simple and effective.

Summary

That’s it! Thank you so much for taking the time to read this post. We went through five reasons to monitor a system, the four golden signals, and what to consider when designing a monitoring system and setting up alerts.

Effective monitoring relies on a culture of collaboration and trust and should not be underestimated. Just as with other systems, monitoring can become complex, fast. Therefore, it’s essential to keep it as simple as possible.

The monitoring system should answer two questions: what’s broken, and why? Shorten the development life cycle while maintaining quality by including monitoring throughout—from planning to operations. Continuously review and optimize to make sure it stays simple and effective.