Lessons from the Recent Basecamp Outage
The weeks of March 4 and March 11 weren’t great for the popular project management and team communication software Basecamp. Basecamp’s CTO, David Heinemeier Hansson (also known as DHH) wrote two very honest posts about the outages and promised to do better for their customers. Luckily, Basecamp up until now had experienced few outages, and even came close to achieving the sought after 99.999% in uptime, so their reputation likely will not suffer too much, especially with the fast acting and transparent way the company handled reporting the outages.
Why is this outage significant?
We found something rather interesting in DHH’s posts, however, and wanted to explore the cloud outage Basecamp experienced on March 12 from 9:13 PM Central to 1:53 AM Central time. This outage was actually caused by a widespread cloud outage that Google Cloud storage experienced, taking out several of Google’s own applications such as Gmail, Google Drive, Hangouts, and Google Maps for several hours over large portions of the globe. You can read more about the Google outage here.
In his post, DHH mentions several times that the outage is due to their cloud storage provider experiencing an issue, but what caught our attention the most was DHH’s genuine promise to do better for Basecamp users:
“We’re stopping all major product development at Basecamp for the moment and dedicating all our attention to fixing these single points of failure that the recent outages have revealed. We’re also going to pull back from our big migration to the cloud for a while, until we’re able to comfortably commit to a multi-region, multi-provider setup that’s more resilient against these outages.” DHH, signalvnoise.com
We think it’s important to note that part of Basecamp’s promise to do better involves pausing their migration to the cloud. While we think it is a very valid reaction to their situation, we also felt that there were potentially some blind spots in Basecamp’s monitoring that could have helped prevent this issue. Periods of transition are difficult to monitor, and it’s far easier to leave an opening while transitioning to a cloud environment than it is to leave an opening when monitoring an already established cloud infrastructure.
DHH reports several times in his play-by-play of events that his team was working to establish even intermittent access to their services while their storage provider tried to determine the root cause of the issue. At one point, they even report that while Basecamp is largely available, there are still errors coming through on file uploads. Unfortunately, there’s not a lot that would have saved Basecamp given that such an important part of their infrastructure was with a single provider and therefore they didn’t have a lot of room to maneuver.
How Monitoring can help prevent outages during a transition:
Despite Basecamp’s outage being largely unpreventable, we wanted to talk a bit about the significance monitoring can play in the transition from one environment to another. With a continued movement from on-prem and bare metal IT infrastructure to cloud environments, we’re seeing a need for more extensive monitoring while companies are in transition.
One of the biggest takeaways from Basecamp’s outage is that you might need a more extensive and detailed monitoring architecture during a transition to make sure you do find failures once an outage has already occurred.
Your monitoring system needs to offer you a broad set of tools to both help alert you to any issues in your infrastructure and to help you solve problems when they occur. A tool that can gather necessary data during an incident and help systems admins diagnose problems or even find a workaround can be a game changer during a sustained outage of some type. An automated remediation product (for example, Panopta’s CounterMeasures) will help your team work more efficiently when dealing with an issue.
Automated remediation is a new type of monitoring tool which can do anything from automatically reroute traffic to gather useful data for systems admins as they try to resolve issues. While we can’t say for sure that automated remediation would have helped in the Basecamp situation due to the cloud outage an external provider experienced, we wanted to make a note of how they could potentially help resolve issues before they turn into sustained outages in general. In addition, gathering data for a system admin before they even begin to look at a problem can make a world of difference. When you’re moving from one environment to another, resolving incidents quickly can make the transition smoother and put less stress on your team.
Having a robust monitoring system which offers a broad range of tools can keep you a step ahead in a catastrophe. As we mentioned previously, Basecamp has an incredibly stable and useful product overall. While they may have found themselves scrambling for even a 99% in uptime when they had been previously reaching for the coveted 99.999%, we think their frank and straightforward reporting gave great insight into both the issues that can occur when transitioning to a cloud environment and to the kind of company they are.