High Availability & Fault Tolerance for Monitoring Stack

In this article, we are going to think and answer some important questions such as -

If the monitoring stack looks after the reliability of an application, then, who looks after the reliability of the monitoring stack?

How to run the monitoring stack in a highly available & fault-tolerant manner?

Just so we are on the same page,

HA= High availability

FT= Fault tolerance

Prerequisites

A good understanding and experience with the Prometheus-Grafana-Alertmanager stack are required to understand the problem outlined in this article & the approach suggested for it.

These detailed guides are a good starting point-

Prometheus & Grafana — https://shishirkh.medium.com/monitoring-stack-setup-part-1-prometheus-grafana-372e5ae25402

Alertmanager & Pushgateway- https://shishirkh.medium.com/monitoring-stack-setup-part-2-alert-manager-push-gateway-3a533461e37c

The flow

Let’s analyze the fault tolerance of each component of the monitoring stack.

Prometheus:

  • If only a single instance of Prometheus is running, there could be a loss of metric data when this single instance crashes or runs into an error as there will be no component left to scrape the metrics.
  • Running multiple instances of Prometheus can be a good solution.

Grafana :

  • If only a single instance of Grafana is running, there won’t be any data loss, since grafana queries the Prometheus time-series database to display the metric data.
  • If it crashes, it can be started manually or via an automated cron job.

Alertmanager :

  • If only a single instance of Alertmanager is running, there will be a negative impact when this single instance crashes or runs into an error as data about some alert states may get lost or some alerts may not be sent.
  • Running multiple instances of Alertmanager can be a good solution.

Pushgateway :

  • If only a single instance of Push gateway is running, there could be a loss of data when this single instance crashes or runs into an error as there will be no receiver for some metric data.
  • Running multiple instances of Pushgateway can be a good solution.

Conclusion

From the above discussion, it is clear that to make the monitoring stack HA & FT, the efforts need to be directed at

  • Prometheus
  • Alertmanager
  • Pushgateway

How would we do this?

There could be multiple approaches to do this, one of which has been discussed below —

Running alert managers in a cluster mode: Under this model, multiple alert manager instances can be set up in sync i.e. the instances would be in constant communication with each other.

When this happens, a Prometheus instance would be configured to send alerts to all the alert manager instances. Since the alert manager instances are in sync together, they’ll detect a duplicate alert and send only 1 alert to the receiver.

In this way, HA & FT of alert managers can be ensured.

Running multiple Prometheus instances: Under this model, multiple instances of Prometheus would be set up with the same targets.

In case one (or two) instances of Prometheus go down, the other would ensure there is no loss of data.

Grafana would have each Prometheus as a Data Source. For e.g. In the case of prometheus01 down, Grafana can display the metrics data by using prometheus02 or prometheus03 as a data source.

In this way, HA & FT of Prometheus can be ensured.

Final words

In the next parts of this series, we will see all this in action in a step-by-step guide. Stay tuned!

If you found this post helpful & knowledgeable, be sure to follow & leave lots of 👏🏻 Claps 👏🏻 It encourages me to keep writing and helps other people in finding it :)

I share tips, experiences & articles on my Linkedin Account. You’ll love it if you are into Cloud, DevOps, Kubernetes, Integrations, etc. Follow me on LinkedIn — https://www.linkedin.com/in/shishirkhandelwal/

--

--

--

I spend my day learning AWS, Kubernetes & Cloud Native tools. Nights on LinkedIn & Medium. Work: Engineering @ PayPal.

Recommended from Medium

Automate dynamic provisioning of cloud resources with the new plug-in for Microsoft Azure

How to Hard Reset LG F510K G Flex2

Hard Reset LG

Reactive Spring Testing

My Experience in LGMVIP_DS_SEPTEMBER

Managing internal and external developer experience in platform DX development

Rest Assured Series with Lombok(Spice up your Test Automation)

Spice up your Test Automation

tsuru 1.6.0 released

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Shishir Khandelwal

Shishir Khandelwal

I spend my day learning AWS, Kubernetes & Cloud Native tools. Nights on LinkedIn & Medium. Work: Engineering @ PayPal.

More from Medium

Infrastructure Testing & Compliance

Why do we need infrastructure templating?

Cloud Incident Response — A Playbook for Tools, Training and Best Practices

Creating GitOps Pipelines Using Amazon Elastic Kubernetes Service(EKS) and GitHub Actions |…