In today’s world, applications are becoming more demanding and complex. So are the systems running the applications. Effectively monitoring a complex application and system is a lot of work and should not be underestimated.

Why should you monitor a system? And what signals to pay attention to when setting up monitoring? In this post, we'll go through five reasons to monitor a system and the four golden signals.

Effective monitoring relies on a culture of collaboration and trust. It’s about shared understanding and responsibilities between developers and operations. Shorten the development life cycle by including monitoring throughout—from planning to operations.


Reasons

Monitoring enables a system to tell you what’s going on. Is the system behaving normally? Is there something broken or about to break? There are many reasons to monitor a system, including the following five:

Alerting

When something happens inside the system, you want to know about it. Just sitting there, looking at the monitoring information waiting for something to happen, will not cut it; this is where alerting comes into play. Monitoring allows you to set up alerts based on what is happening.

Systems change over time. Databases grow. The number of people using the application increases. These are just some examples of trends that you want to be able to recognize. Monitoring enables you to analyze these trends to act, like planning for extra capacity.

Experimentation

Monitoring allows you to experiment scientifically. You’re able to determine if the changes have the desired effect. How is the performance of the system after adding an extra node or implementing a cache? You’ll never really know without monitoring the system.

With DevOps, the goal is to shorten the development life cycle by developing faster, testing regularly, and releasing more frequently while maintaining quality. Including monitoring provides visibility throughout the entire development life cycle—from planning, development, testing, and operations.

Continuously review and optimize the monitoring to make sure it stays simple and effective.

Visibility allows unified ‘DevOps’ teams to respond quickly to problems that affect the user experience. Shifting left in the development life cycle enables you to detect problems sooner and minimize production incidents. This approach embraces the ‘you built it, you run it’ principle, where developers have operational responsibilities to enhance the quality of the services.

DevOps life cycle

Debugging

When something happens with the system, you want to know why this happens. The latency just went through the roof; what happened at the same time? The ability to deep dive into the system and find and solve the root cause is crucial for the system’s reliability.

The monitoring system should answer two questions: what’s broken, and why?

Automation

Combining monitoring and event-based automation allows you to respond to events automatically and helps you minimize the duration of downtime by making sure everything is running smoothly. Automation reduces the amount of human interference, allowing you to spend more time on other topics.

These events and actions include:

  • Scale the system based on the current load.
  • Restart a service when a specific error occurs.
  • Create an incident ticket when a specific incident occurs.

The four golden signals

There are many signals you can monitor. But what signals matter for the performance and eventually the user experience? The four golden signals—defined by Google’s Site Reliability Engineers (SREs)—are latency, traffic, errors, and saturation. These signals revolve around reliability and change when approaching this from a security or business perspective.

Just as with other systems, monitoring can become complex, fast. Therefore, it’s important to keep it as simple as possible.

Latency

How much time does it take to service a request? The lower the time, the better. Latency occurs between the client and the serving API. Or, for example, between the server and database. Separate tracking successful and failed requests. Failed requests are usually served quicker. Do not filter out the failed requests; a slow error is worse than a fast error.

Traffic

How much demand is being placed on the system? We measure traffic using a high-level system-specific metric; this can differ per system. For example, for a storage system, measuring traffic might be transactions per second. As for a web service, this usually is HTTP requests per second.

Errors

Is the system successfully serving requests? Tracking the rate of failed requests is essential to determine if the system is operating correctly. A failed request can be explicit, for example, an HTTP 500 (server error) response. Implicit, for example, an HTTP 200 (success) response, but with the wrong content. Or by policy, for example, higher than one-second response times.

Saturation

What is the utilization of the system? Are we reaching the maximum? By measuring the most constrained resources, you can determine the fullness of the service. For example, in an I/O-constrained storage system, monitor the I/O. Whereas in a memory-constrained system, monitor the memory. In more complex systems, latency increases can be an indicator of saturation.

The performance of the system usually degrades before reaching 100% utilization. Having a utilization budget or target is crucial. Set up alerts for these targets to be pro-actively informed about reaching the limits of the system.

Tracking the saturation of the system supports capacity planning. For example, when to add extra memory or storage to keep up with the usage?


Simplicity

Just as with other systems, monitoring can become complex, fast. A complex monitoring system can become fragile, hard to change, and a lot of work to maintain. Therefore, it’s important to keep it as simple as possible. For best practices, there isn’t a one-size-fits-all approach. The monitoring system should answer two questions: what’s broken, and why?

When designing the monitoring system, keep the following in mind:

  • Keep the collection and aggregation of metrics balanced and simple.
  • Minimize the amount of alerting rules.
  • Remove signals that are not used in any dashboard or rule.
  • Automate actions where possible.

When setting up alerts, keep the following in mind:

  • Establish a baseline, the ‘normal’ behavior of the system.
  • Calibrate the alerts using the established baseline.
  • Limit the number of people that an alert notifies.
  • Ensure that the alerts are actionable.

Continuously review and optimize the monitoring to make sure it stays simple and effective.


Conclusion

There it is! Thank you so much for taking the time to read this post. We went through five reasons to monitor a system, the four golden signals, and what to keep in mind when designing a monitoring system and setting up alerts.

Effective monitoring relies on a culture of collaboration and trust and should not be underestimated. Just as with other systems, monitoring can become complex, fast. Therefore, it’s important to keep it as simple as possible.

The monitoring system should answer two questions: what’s broken, and why? Shorten the development life cycle while maintaining quality by including monitoring throughout—from planning to operations. Continuously review and optimize to make sure it stays simple and effective.

As always, I’d love to hear what you think. You can find me on LinkedIn and Twitter. See you next time. Bye for now!