Rancher: Part 4 - Monitoring

Welcome to part 4 of a multi-part series where I will walk through the setup and configuration of Rancher.

In the previous posts, I walked through the installation of Rancher, deploying the server and 3 hosts that can run dockerized workloads. I then deployed a Jenkins Stack and a Load Balancer from the default catalogs. In part 3, I created a private catalog and add a service to it.

In this post I will show you how to monitor Rancher and the services deployed within it.

My, very personal, Needs

If I am going to move some of my services to docker, and Rancher, there are a couple things I need. They are not very complex, but... I want to know:

  • How many containers are running. It'd be great if I had a way to show the number of containers over time as it would help me figure out if workloads had been added or removed.
  • Some basic node statistics, like CPU and memory and disk utilization. This helps me figure out if I need to add more nodes to my system
  • When 'bad things' happen. For me, the list of bad things is fairly simple:
    • A node goes down
    • A container that used to be running, no longer is
    • Some service becomes unavailable. To simplify this, all I really need to know is if a publicly accesible port/service goes away.

All that said, if I just new about the bad things... and received and email when they happened... I'd be a happy camper.

Other organizations might have many more requirements (historical trending, chargeback/showback, SNMP traps for alerts, etc) but my needs are simple.

Preconfigured Monitoring Options

There are a few monitoring options built into the default catalogs

Datadog

Datadog is a cloud monitoring service that is extremely powerful. There are built in catalog entries that deploy the datadog agent containers on the rancher nodes and start reporting details to the service. Datadog offers multiple tiers of service, including one that is free.

I signed up for my free account and then deployed the datadog services. Like most things with Rancher, this was painless and just worked. Returning to the Datadog site, I logged in and created a dashboard that showed some details: number of containers, cpu and memory usage by container, network tx/rx information. All very cool stuff. There are integrations with almost every conceivable cloud service and/or application so you should be able to build any kind of reporting you want and/or need.

Realizing that Datadog could report on everything I need, I turned to monitoring. I created a monitor to let me know when no data had been reported from an agent, this would tell me when either the Datadog agent or the entire host had failed (or the network between my house and the datadog servers). It was simple to create the monitor and I tested it by shutting down one of my nodes. Sure enough, the monitor started showing that no data had been received from that node and the alert triggered.

Datadog clearly is a very rich service that would support everything I need. There is, however, a downside.

The Datadog free tier does not allow for setting up monitors and alerts. Since my most critical need is to know when something fails, I'd need to move to their paid service which runs $15 per node per month. To be fair to Datadog, that's only $45/month for three nodes and if I was an organization that depended on services being available, it would be a reasonable price. But, I'm just a guy hacking at a system that can afford a little downtime until I just happen to notice something is down.

Prometheus

Prometheus is another offering directly in the Rancher catalog. Similar to Datadog, Prometheus gathers a bunch of data and lets you build dashboards and monitors based on the data. Deployment is as simple as any other Rancher stack, just supply a couple of values in a form and Rancher takes care of it for you.

I had very high hopes for Prometheus. In my past professional life, I did some work on/with an operational management tool that gathered data, pushed it to a time series database, and let you build reports and alerts against the data. During this time, I had often looked at Prometheus as a better solution that what we had. After playing with Prometheus, I am not so sure anymore.

It is certainly powerful. It collected a ton of data and provided some out of the box reports that were interestin: cpu usage over time, number of containers over time, and similar. When it came to configuring a monitor, however, I could not figure it out. The alerting functionality in prometheus is still very early and I hope they'll get it working in an easier fashion, but for now... I'll admit that I lost patience with it (especially after seeing how easy it was with Datadog).

If anyone reading this has some pointers for me with setting up monitoring and alerting in Prometheus, please let me know. I'dlove to try it out!

My Opinion

Well... I am not sure what to say. Datadog ticked all the boxes but I just can't pull the trigger at $15/month/node. If I were running this for my employer or relied upon services being up more, it'd be a no brainer as the price is really reasonable. But for home hobbyists, it's overkill.

What I will likely end up doing is building something custom. My needs are so simple that a series of bash scripts toquery nodes and containers is not to terribly difficult. Maybe I'll even turn it into a service in my private catalog.