High Availability & Fault Tolerance for Monitoring Stack

  • If only a single instance of Prometheus is running, there could be a loss of metric data when this single instance crashes or runs into an error as there will be no component left to scrape the metrics.
  • Running multiple instances of Prometheus can be a good solution.
  • If only a single instance of Grafana is running, there won’t be any data loss, since grafana queries the Prometheus time-series database to display the metric data.
  • If it crashes, it can be started manually or via an automated cron job.
  • If only a single instance of Alertmanager is running, there will be a negative impact when this single instance crashes or runs into an error as data about some alert states may get lost or some alerts may not be sent.
  • Running multiple instances of Alertmanager can be a good solution.
  • If only a single instance of Push gateway is running, there could be a loss of data when this single instance crashes or runs into an error as there will be no receiver for some metric data.
  • Running multiple instances of Pushgateway can be a good solution.
  • Prometheus
  • Alertmanager
  • Pushgateway

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Shishir Khandelwal

Shishir Khandelwal

135 Followers

I spend my day learning AWS, Kubernetes & Cloud Native tools. Nights on LinkedIn & Medium. Work: Engineering @ PayPal.