In the IT industry, infrastructure fails all the time – it’s a known fact that everyone accepts-and there are thousands of tech professionals who won’t tell you anything about it. Usually, the details of failures are kept private, either never mentioned or sanitized into a generic root cause analysis (RCA) that gives only basic information. Very seldom do you get to see what really happened behind the scenes when things go wrong, which is truly unfortunate.
However, these failures, as the old adage goes, are a great teacher. The battle scars and war stories of seasoned system administrators have built their character and establish the skill to quickly assess and resolve problems that come up. Getting to see problems arise and see how they’re dealt with is the best way for more junior staff to learn their trade. Unfortunately the important problems aren’t textbook and this often requires real fires and good teachers in order to gain real knowledge.
One of our customers recently ran into a series of intermittent hardware problems, which led to a number of outages for their SaaS application over a period of 24 hours. They’ve agreed to allow us to describe the problems they ran into and the steps they took to resolve them, along with the lessons they took away from the event and their plans for improving their infrastructure.
While this experience gives some degree of Schadenfreude to over-stressed techies, it is also a learning opportunity that can help others avoid similar pain when Murphy’s Law visits them.
Our customer, who will remain nameless, runs a cutting-edge online application with clients spread across the United States. The company has been growing quickly since the public launch of their service, which has kept them on a strict agile development process that produces a new release roughly every month. The fast pace of deployments combined with a heavy focus on increasing functionality means continual changes to their technical infrastructure. These circumstances came to a head recently when they encountered problems with their production infrastructure.
The first outage occurred at 5:30 AM local time. Fortunately for them, they had invested the time to setup proper monitoring of all their production components. As soon as the main server went down, the their systems administrator who was on call that week was woken by SMS alerts to investigate. Initially it seemed like a random server hang, so the machine was rebooted and he grabbed a bit more sleep. When the second round of alerts started a couple of hours later, the alerts went to the main operations team members as well as the head of customer service. Because their customer base is primarily in the US and would soon be online to use the system, pressure began to mount.
Due to complications in their database infrastructure, they did not have a failover site setup to shift customers to while the production server was diagnosed. In addition to this, they did not have a good way to proactively notify customers. So once their users started to come online, their support team started getting flooded with calls asking about the problem. For a relatively small team based in one location, the constant chime of ringing phones only raised the overall stress level.
As their customer support team worked the phones, the operations team focused on the problem server with their hosting provider. The complexity of their virtualized environment left a number of possible scenarios to investigate.
One thing they had going for them was a regular, well-tested backup system which had stored a snapshot of their main databases offsite earlier that morning. This would allow them to rebuild a new server if needed, but more importantly they were able to stage a copy of the database on another machine and extract some critical data that their key customers needed to operate. This is obviously not the same as having a fully operational application, but is better than being completely dead in the water and allowed them to minimize the impact on their most important customers.
After a number of false leads, the operations team ultimately determined that all of the problems were caused by a flaky power supply that was intermittently fluctuating voltage levels and triggering reboots. Once their hosting provider swapped in a replacement, the issues no longer occurred.
Unfortunately, failure of servers and network devices are a fact of life. Depending on your budget and appetite for downtime, you can build more resilient infrastructure in order to minimize the impact of failures, but true 100% availability is rarely realistic. Instead, you can prepare for things to break and have pre-arranged processes in place to follow when things go bad that can minimize the impact.
Based on this customer’s hardware problems, there are a number of lessons to be learned which can help you minimize the impact of outages with the right process.
- Have a failover site available with DNS configured with low TTLs. When you have a catastrophic problem with your primary infrastructure, you can update DNS and point visitors to a site that responds. Ideally this would be a warm version of your application/data. If that’s not possible, a simple static site will help inform customers of the issue and avoid a generic browser/connection error. This alone will head off countless phone calls.
- Establish emergency communication channels with customers in advance. If you’re going to use email or Twitter to let them know when there are problems, make sure customers know about this in advance – add it to your welcome information, help docs, set it as a footer on outgoing emails, etc. Then make sure to use it – post as soon as you know there are problems, and regularly follow up with status updates.
- Have a plan in place for proactively alerting key customers, the ones that you can least afford to lose or upset. Make sure you have their contact information someplace other than in your production server where you can reach it in the event of problems.
- Ensure that you have good off-site backups of your key application data, and that you have verified it’s integrity with regular test restores. Even if you don’t find yourself in a position of having to do a rebuild, it can come in handy by allowing you to access some critical data while dealing with the primary problem.
In the end, one bad component resulted in a very long & stressful day for our customer and their users. After this was resolved, their operations team setup a static emergency maintenance site and began work on the changes needed to setup full database replication. This allowed to have a full warm standby environment ready the next time Murphy’s Law strikes. Take this as an opportunity to learn from their experience and be ready for disaster.
Have your own horror story?
If you’ve been in the IT industry for a while you can almost certainly empathize? What’s the worst situation you’ve been in? What are the most valuable lessons that came out of it? I invite you to send me your story and we’ll share them (anonymously, of course) with our readers so everyone can learn from them, and hopefully save everyone some sleepless nights!