By Kobus Steenekamp, CTO, Autochartist
Don’t get me wrong, we love AWS, and we use a ton of great products from them, but when it comes to cost control, it can easily get out of hand. Case in point, AWS’s CloudWatch. Like any tech team, we are constantly under pressure to do more cool stuff, faster and cheaper. Also, we’ve also found that CloudWatch isn’t easy to use for services outside of the AWS infrastructure or when services are hidden away inside Docker containers.
Instead of CloudWatch we use a little home-grown system, appropriately named “Checks”, which was developed about 12 year ago by one of the Autochartist founders. We’ve used it since forever and the initial code was so primitive that I suspect the first version was developed using Notepad in the middle of the night with a Pentium III computer and a dial-up modem!
Our business is linked to trading on global financial markets and so we need to be up and running pretty much all the time. We only get a small gap for weekly maintenance on Saturdays. To make things even more complex, not all applications and services are also everybody’s responsibility. That is why we also needed our checks system to be able to notify different people at different times for different problems
So Checks has developed over time and now consists of a handful of elegantly designed and written classes that span only a few hundred lines of code, but can monitor pretty much anything we can throw at it. For example, when a server crashes, a service on a server has failed, a batch process did not complete execution, an application error occurred, or when an application experiences performance degradation.
Best of all, the entire system runs on a micro EC2 instance which is incredibly cheap.
Our little home-grown system can also execute custom scripts when alarms trigger; allowing us to restart services or make API calls when certain conditions are met. It can even run native database queries – allowing us fine-grained monitoring of data consistency and integrity.
From an alerting standpoint, we are also able to notify people using several different channels. For issues that are not so critical, we use email. For normal issues that are important, we use Slack. Finally, for issues that are urgent or critical and need a response from somebody that might not be online right away, we use good old SMS.
All that said, we’ve found it extremely difficult to beat the cost and functionality of running our home-grown monitoring system.