A monitoring system is only as valuable as it is dependable. If you can’t trust that you’ll get alerted anytime something is wrong with your servers then your monitoring isn’t worth whatever you’re paying for it. Our goal at Panopta is to let you relax in confidence knowing that if your phone isn’t going off, then your infrastructure is fine. And whenever there are problems, you know right away.
Everyone knows that Murphy’s law applies especially well to the Internet. Things break in all sorts of different ways, from an obscure bug in code that pops up to take things down to a complete power outage at your datacenter because a truck drives into it. You never know what will break when, so you just have to plan for the worst and make sure your system can adapt and keep running as things go downhill.
This applies to monitoring systems just like everything else. A monitoring system has one subtle but crucial difference from the typical website or online application that further complicates our design. Most websites have regular patterns to their traffic, with peaks during the day when most of their visitors are awake, and deep hours after hours which are great for doing maintenance – short disruptions during these times have minimal impact and can often be done without raising the ire of customers (or bosses!)
The pattern of a monitoring systems is completely different, with a flat load across any time period (well, except for a steady upward trend as we grow!) We perform the same number of checks in the wee hours of the morning as we do in the early afternoon, and our customers rely on us to continuously check their servers regardless of the hour.
We’ve designed our infrastructure from the ground up with this in mind. Planning for every piece of the system to eventually fail has allowed us to keep our system running without disruption for the past four years, through billions of checks and millions of outages, and lets us accommodate regular maintenance by shifting load between pools of resources.
Every piece of our infrastructure is designed with redundancy in mind. Starting with our monitoring nodes that are distributed around the world hosted in the datacenters of dozens of different providers. Each one of these can (and do) fail at any time. When this happens our central infrastructure detects the failure and seamlessly moves checks to a nearby neighbor, fast enough that the overall system never misses a check. Once the node comes back, the checks are returned and the system continues on. The same mechanism allows for rolling upgrades of monitoring nodes when we release new functionality.
In the event of a total hardware failure of a node or a sudden spike in demand for monitoring capacity, we can completely reprovision a monitoring node from a bare OS image in about five minutes using a single Fabric command.
Our central infrastructure, which runs our outage detection, notification, reporting and control panel applications, is run out of two clusters of servers in the Dallas and Seattle datacenters of our partner SoftLayer. Configuration and monitoring data is replicated in both directions in near-realtime between locations, making use of SoftLayer’s secure and high-speed private network. All applications are able to run in either location, which gives us the ability to withstand a partial or complete failure of either datacenter and to shift operations around as needed for maintenance purposes.
Fortunately, SoftLayer’s been quite stable over the past four years, and the cases where we have had to fail over primary operations from one location to the other have all been due to planned maintenance that was scheduled in advance. But we sleep well at night knowing that whenever something unexpected does come up, we’re prepared.
Of course, this raises the question of who’s monitoring the monitoring system. For that we are covered as well, with a separate system that is continually watching all of the core components of our system to ensure that they’re all functioning as expected. Which means that all of our customers can sleep easy at night, knowing that we’ll continue to watch their systems, every minute of the day!