Updated June 26, 2017
This post is part of our six part series on DNS. The complete list is here: Part 1: DNS Basics, Part 2: DNS and Performance, Part 3: Common Problems and Solutions, Part 4: Best Practices for Setup, Part 5: Monitoring an Anycast Service, Part 6: The Importance of Highly Available DNS.
In our DNS Series we’ve covered the basics of DNS, why DNS is important to performance, common problems and solutions, and best practices for setup. For the last two bonus posts, we’re getting more detailed in two specific areas. Last week we got into detail on how to effectively monitor an Anycast service and this week, we’re wrapping up our DNS Series on the importance of highly available DNS. Thanks for reading!
Back in October of 2016, a large scale attack plagued the majority of the internet in North America causing tens of thousands of sites to be unavailable, including well known sites like Twitter, Github, Spotify, and Reddit. This major outage was due to an attack on the DNS services provided by DynDNS. This attack was a reminder of the fragility of the Internet, where its vulnerabilities are, and the importance of highly available DNS. There is a lot of discussion focused on building failover and redundancy into the compute infrastructure on which your applications run. However, a solid DNS infrastructure is often overlooked, which leaves the door open to these types of outages.
The emergence of public mega clouds like AWS, Azure, and Google Cloud have enabled users with the tools to be largely immune to individual node failures. Beyond those sorts of micro failures, running your infrastructure in multiple geographically independent data centers to allow for complete site failovers is typically the next level of redundancy. Again, the mega clouds are huge enablers of this with their current (and expanding) global footprints.
While this is certainly helpful, there is more you can do on your own to protect yourself from these types of outages. A lack of consideration for DNS in your disaster recovery and highly available DNS planning can be a crucial weakness. This weakness was made visible in the midst of the attack occurring against Dyn on Friday. The go-to move for most companies is “more cloud”, but all the cloud in the world can’t help if your DNS is not functional. We’d like to focus on why an attack on a single core provider should not have caused this type of widespread disruption.
DNS has inherent fault tolerance built-in with support for secondary authoritative DNS servers. The common approach here is to list multiple nameservers within the same provider.
~ dev$ host -t ns twitter.com twitter.com name server ns3.p34.dynect.net. twitter.com name server ns4.p34.dynect.net. twitter.com name server ns2.p34.dynect.net. twitter.com name server ns1.p34.dynect.net.
Having multiple nameservers from a single provider protects you from single node failures, but does not save you from the provider itself experiencing outages. This lack of a highly available DNS set up is exactly what resulted in the downtime of a majority of the affected sites on Friday. A targeted DDoS attack taking out an entire provider is an extreme case, but an internal problem like a buggy release could result in the same sort of outages. The solution is to leverage the built-in secondary authoritative DNS for zones but to also host your DNS with other disconnected providers. This way, a provider-wide outage will not have a catastrophic effect.
Full disclosure – although we were not affected by the Dyn problems on Friday, we were still susceptible to this type of failure. We saw this as a wakeup call and took these additional measures ourselves. We’ve historically relied on IBM SoftLayer to manage our authoritative DNS, and their globally distributed AnyCast DNS servers have proven to be quite reliable. In order to build up an additional layer of protection, we replicated our DNS zone to Google’s authoritative DNS services to give us that highly available DNS setup.
~ dev$ host -t ns panopta.com panopta.com name server ns2.softlayer.com. panopta.com name server ns1.softlayer.com. panopta.com name server ns-cloud-b1.googledomains.com. panopta.com name server ns-cloud-b2.googledomains.com. panopta.com name server ns-cloud-b3.googledomains.com. panopta.com name server ns-cloud-b4.googledomains.com.
Adding more DNS providers is the easy part; the larger challenge is ongoing management. Unless you can assure that you will have strict discipline to modify DNS in all your providers, you’re allowing for the possibility of human error by introducing inaccurate records. A more ideal solution is to automate the syncing of both providers. We did this with the help of libcloud.
Apache’s Libcloud project helps abstract the various API interfaces to automatically manage DNS across multiple providers. We wrote a python script which reads our authoritative DNS records out of a JSON file hosted on a private server and uses Libcloud to sync those settings out to all providers. This allows us to seamlessly migrate to a new provider or incorporate more DNS providers in the future. Our ops team knows to make all DNS modifications in that JSON file. In addition, we’re using Panopta’s DNS checks to directly query/monitor each DNS provider node we are leveraging for a select set of critical DNS entries; this ensures the syncing is operating as expected (at a high level).
We feel more secure having this additional layer of redundancy built into our infrastructure and we urge all of you to consider similar changes. We’ll be publishing the Libcloud script mentioned above to our public Github repository so you can implement a similar solution. We’ll be sure to post an update on Twitter and on this blog post when it’s available for you to download.