It’s important that we provide users with the best experience. Part of that means that our service is available through hardware failures. And when things do go wrong, we need systems in place to monitor key metrics and send alerts to services and our team. Initially, we chose Datadog as our monitoring solution because it was easy to set up, and it provided integrations to services that we used. Then we started scaling our customers’ infrastructure to keep up with demand and saw our infrastructure go from 5 to 500+ servers. This didn’t jive with Datadog’s per-server cost model, as it increased our bill from $75 to $7,500+ per month. In order for us to move away we needed something that provided auto-discovery of new servers, collected host and container metrics, alerted us on abnormal conditions, and had an easy way to visualize data. We turned to the open-source world and discovered Prometheus, a monitoring solution built by SoundCloud.

Prometheus is a poll-based system, which means every so often it looks for metric exporters and makes an HTTP request to them. Then it stores that info in its time-series database and later sends alerts if necessary. Configuring and starting Prometheus with standard exporters and alerts is straightforward because all of the components are Dockerized. All of the components we used had a UI, as well as a metrics route which exposes Prometheus formatted metrics. DigitalOcean has a great setup tutorial, and RobustPerception maintains a blog about Prometheus, which has been a great resource to us.

Discovering

To find the IP addresses of the servers to query, Prometheus provides a static file and dynamic service discovery options. Since our stack is hosted on EC2, we used its service discovery mechanism to find instances with specific tags to determine what to monitor. Having EC2 provide this list is advantageous because if we relied on a push-based agent, and it didn’t start, we wouldn’t know that the server was up. When we rolled out Prometheus, we discovered ghost servers that Datadog didn’t know about, because the push-based agent couldn’t send information due to infrastructure networking issues.

Collecting

To replicate the Datadog agent, we needed to run two exporters: Node exporter and cAdvisor. To check the health of our instances, we use Node exporter to expose server metrics like CPU, RAM, and disk usage. We used cAdvisor to get similar information about the containers on our instances. You’ll need to get familiar with some of the names as they’re different than what Datadog calls them. Note: it’s been recommended to run Node exporter directly on the host and not containerize to get the most accurate metrics about the host; specifically filesystem metrics.

Alerting

Prometheus periodically looks at its data, compares it with the rules it’s given, and sends an alert to the alert manager if a rule matches. Alertmanager receives alerts from Prometheus and routes them to services that can get your attention, like PagerDuty, Slack, or a custom webhook. In our setup we send all alerts to a Slack channel, software recoverable alerts to our recovery service, and critical alerts to PagerDuty. Alertmanager can also dedupe and group alerts as they come in, which is useful when you want to mute minor alerts when there is a critical alert. For example, if a host is unreachable, then the CPU and RAM alerts should be muted.

Visualizing

The final component we needed was a dashboard tool. The Prometheus docs recommend Grafana which uses Prometheus as a datasource and provides many ways to visualize data; including gauges and bar graphs. One advantage Grafana has over Datadog is the ability to export and import dashboards via JSON files. This allowed us to check in our dashboard to GitHub, and easily generate dashboards programmatically thanks to the JSON format. In Datadog, we had issues with alerts and dashboards accidentally getting modified or deleted with zero ways to recover. With Prometheus, since these are committed to GitHub, we could easily recover.

Datadog has provided our team with a lot of value, but became too costly as we scaled. Prometheus has helped us reduce our monitoring cost by 98% and has all the features Datadog offered, for free! If you have any questions or need help getting set up, feel free to tweet me @anandAKAdjfaze.