High Availability & Fault Tolerance for Monitoring Stack

Shishir Khandelwal
3 min readDec 27, 2021

--

In this article, we are going to think and answer some important questions such as -

If the monitoring stack looks after the reliability of an application, then, who looks after the reliability of the monitoring stack?

How to run the monitoring stack in a highly available & fault-tolerant manner?

Just so we are on the same page,

HA= High availability

FT= Fault tolerance

Prerequisites

A good understanding and experience with the Prometheus-Grafana-Alertmanager stack are required to understand the problem outlined in this article & the approach suggested for it.

These detailed guides are a good starting point-

Prometheus & Grafana — https://shishirkh.medium.com/monitoring-stack-setup-part-1-prometheus-grafana-372e5ae25402

Alertmanager & Pushgateway- https://shishirkh.medium.com/monitoring-stack-setup-part-2-alert-manager-push-gateway-3a533461e37c

The flow

Let’s analyze the fault tolerance of each component of the monitoring stack.

Prometheus:

  • If only a single instance of Prometheus is running, there could be a loss of metric data when this single instance crashes or runs into an error as there will be no component left to scrape the metrics.
  • Running multiple instances of Prometheus can be a good solution.

Grafana :

  • If only a single instance of Grafana is running, there won’t be any data loss, since grafana queries the Prometheus time-series database to display the metric data.
  • If it crashes, it can be started manually or via an automated cron job.

Alertmanager :

  • If only a single instance of Alertmanager is running, there will be a negative impact when this single instance crashes or runs into an error as data about some alert states may get lost or some alerts may not be sent.
  • Running multiple instances of Alertmanager can be a good solution.

Pushgateway :

  • If only a single instance of Push gateway is running, there could be a loss of data when this single instance crashes or runs into an error as there will be no receiver for some metric data.
  • Running multiple instances of Pushgateway can be a good solution.

Conclusion

From the above discussion, it is clear that to make the monitoring stack HA & FT, the efforts need to be directed at

  • Prometheus
  • Alertmanager
  • Pushgateway

How would we do this?

There could be multiple approaches to do this, one of which has been discussed below —

Running alert managers in a cluster mode: Under this model, multiple alert manager instances can be set up in sync i.e. the instances would be in constant communication with each other.

When this happens, a Prometheus instance would be configured to send alerts to all the alert manager instances. Since the alert manager instances are in sync together, they’ll detect a duplicate alert and send only 1 alert to the receiver.

In this way, HA & FT of alert managers can be ensured.

Running multiple Prometheus instances: Under this model, multiple instances of Prometheus would be set up with the same targets.

In case one (or two) instances of Prometheus go down, the other would ensure there is no loss of data.

Grafana would have each Prometheus as a Data Source. For e.g. In the case of prometheus01 down, Grafana can display the metrics data by using prometheus02 or prometheus03 as a data source.

In this way, HA & FT of Prometheus can be ensured.

Final words

In the next parts of this series, we will see all this in action in a step-by-step guide. Stay tuned!

If you found this post helpful & knowledgeable, be sure to follow & leave lots of 👏🏻 Claps 👏🏻 It encourages me to keep writing and helps other people in finding it :)

I share tips, experiences & articles on my Linkedin Account. You’ll love it if you are into Cloud, DevOps, Kubernetes, Integrations, etc. Follow me on LinkedIn — https://www.linkedin.com/in/shishirkhandelwal/

--

--

Shishir Khandelwal

I spend my day learning AWS, Kubernetes & Cloud Native tools. Nights on LinkedIn & Medium. Work: Engineering @ PayPal.