Automated Alert & Aggregation Rule Generation For CockroachDB Metrics

Like all software systems, metrics are crucial for understanding the inner workings of a system and getting a pulse on how that system is functioning. Any monitoring and debugging framework is incomplete without metrics. To use metrics effectively, however, it is important to understand two things: which aspect of the system a particular metric defines, and […]

Like all software systems, metrics are crucial for understanding the inner workings of a system and getting a pulse on how that system is functioning. Any monitoring and debugging framework is incomplete without metrics. To use metrics effectively, however, it is important to understand two things: which aspect of the system a particular metric defines, and how it should be used for interpreting the health of the system. Additionally, to build effective monitoring dashboards and alerts it is also necessary to identify correlations between multiple metrics Here at Cockroach we have frequently faced problems around underuse/misuse of metrics due to lack of documentation and guidance around how to aggregate and use the metrics as indicators of system health and performance.
Source: CockroachDB