
Every Saturday morning, I am in the habit of collecting metrics on a spreadsheet. I weigh myself, take my blood pressure, check my resting heart rate, and record how long I slept each night and how much exercise I have done. I also go for a 5 km run and record the time and how I feel. This acts as a system check. I have been doing this for several years and now I have a lot of data.
This data provides insights into what constitutes my “normal,” allowing me to perceive shifts over time and detect any abnormalities. For instance, a gradual increase in my weight could prompt me to reevaluate my diet, while an elevated blood pressure might lead me to seek medical advice. The modest effort I invest each Saturday morning offers an enlightening view of my health and fuels my motivation to continually enhance my fitness.
Monitoring the System
With the factory and citadel in place, you have an automated system for safely deploying the application and an environment in which to securely run it. However, there are a lot of moving parts, and it is going to be tough to understand what is happening. It will be difficult to notice, let alone fix problems. This is the reason to start to collect metrics and logs and make them available in a central place, as I do with my spreadsheet. By understanding what is normal and what is changing over time and noticing any anomalies, you will understand the health of the system and identify opportunities for improvement.
In an on-premises environment, you have the advantage of physical access to the hardware for metric collection. However, this is not an option in Google Cloud or any other cloud environment, as the hardware is owned by the service provider. Fortunately, Google Cloud is engineered with built-in metrics, logging, and tracing from the ground up, and these signals are centrally aggregated. The platform automatically collects thousands of metrics, which you can supplement with custom metrics from your applications for a full picture.
The crux is, while most of the data is readily available, you need an observatory, a centralized point to monitor this vast data universe. This chapter will guide you in building that observatory.
Note
The code for this chapter is in the observatory folder of the GitHub repository.
Site Reliability Engineering
Operating an application in a Cloud environment is a discipline in its own right. Site reliability engineering (SRE) is Google’s preferred approach, and the tools supplied by Google Cloud (as you would expect) support SRE. There are three excellent free books on the subject available at the Google SRE website. These O’Reilly books are also highly recommended: Building Secure and Reliable Systems by Heather Adkins et al. and Observability Engineering by Charity Majors et al.
This chapter will not delve into the mechanics of SRE per se. Instead, it will introduce you to a collection of tools specifically designed to monitor applications operating on Google Cloud. Gaining insights into your application’s behavior is critical for identifying, debugging, and rectifying issues.