With this winter’s historic snow still piling up here in Chicago, our developers
have been stuck indoors delivering lots of new functionality for all of our
customers. We’ve got a wide range of improvements this time, hitting most of
our major systems so there should be something for everyone in this release. It
looks like spring is still several months away (at least) so there will
definitely be more to come soon – keep an eye on our blog soon for details of
what’s coming up.
For anyone that’s using PagerDuty for unified notifications across different monitoring systems or Uptime.ly for displaying application status pages, we have partnered with both companies and now have support for integrating into both systems.
We’ve also added the ability to control all-clear alerts. Normally, once an outage has been resolved we notify everyone that was previously alerted to let them know. However, there are definitely cases where this extra alert is unneeded, such as when you’re fixing something in the middle of the night and the rest of your team doesn’t need to get a text to know that you’ve finished.
You can disable these alerts on a per-notification schedule level, if you never want to receive them, or for individual outages. See our knowledgebase for details on how to set this up.
Do you get tired of configuring the same checks each time you bring a new server online? Or even worse, worry that you’ve missed important checks after you’ve setup servers? Our new server template functionality is here to help with both of those. You can now setup templates for monitoring, then apply them to existing or newly created servers and have everything automatically configured for you. We’ve been beta testing this with our own internal monitoring for a while, and our ops team loves that they can quickly configure new servers with a single click.
Templates can be created from scratch or you can base them off of already configured servers, then apply the template through the Add Server wizard or as part of an API call to create a new server. Please see our knowledge base for details on how to create server templates and how to apply them.
In addition to the tree-based grouping structure that we’ve always had for managing servers, we’ve now added the ability to organize servers by tags.
Apply one or more tags to each of your servers to classify them by operating system, purpose or other dimensions. Then filter by tag when searching for servers or when viewing your outage history. We’ll continue to build on tagging in future releases with new bulk management functionality that will let you make configuration updates to groups of servers based on tags and drive email and public reports based on tags. To learn more about implementing server tags please see our knowledgebase.
New Detailed Reports
A common request that our support team gets is for access to the raw check data for individual servers, either for the duration of an outage or an entire month. Until now, we’ve had to perform an export from our database to get this information, which could take a while to process. We’ve now fully automated the process, allowing you to generate detailed outage- and check-data anytime you’d like.
You’ll now see Export Data buttons at various points throughout the control panel that submit export requests. Because some of these can generate a substantial amount of data, these requests are queued and processed asynchronously. Once completed, you’ll receive an email with the export file, and can find the exports in the Reporting section of the control panel for future reference.
Reseller Management UIs
We haven’t forgotten about our co-branded and white-label resellers who offer monitoring services to their customers. You can now view summary information and details on all of the accounts that you have setup, with support for suspending, upgrading and closing accounts as needed. You can also jump directly to individual customer’s accounts to manage their configuration and outage responses for them.
Try out the new interfaces and let us know what else would help make the reseller experience smoother and more productive.
In addition to a number of minor UI enhancements and bug fixes, we’ve also added API support for everything described above. If you haven’t yet tried out our API, it lets to make any configuration change you can make in the control panel through a REST-based API. To try out the API, go to Settings > API Key Management in the control panel where you can setup an API key and try out our complete web-based API explorer.
Finally, we’ve added a bit of (useful) eye candy – you’ll now see a badge in your browser’s tab bar showing the number of active outages on your account. If you regularly keep the control panel open in a tab, this is a great way to instantly see if you have any outages, even before you get alerts sent to your phone.
American politics is always a hectic affair and the rollout of the Healthcare.gov for Americans everywhere has been a bumpy path. In response to this, we would like to release some facts about the response time and availability of the Healthcare.gov website for bloggers and journalists to use as a resource in their own coverage. Using our own Panopta server monitoring system, we set up network checks on the Affordable Care Act’s Healthcare.Gov website finding it was only available for use by the American public 86% of the time during the month of November!
That 86% availability is, by the standards of any online industry, abysmal. Now, it is understood that the roll out of healthcare.gov was “fumbled” but how and where was healthcare.gov fumbled? We checked the healthcare.gov servers, every minute, to check different aspects of the public facing infrastructure including Authoritative DNS, HTTP availability and content checks.
The HTTP content checks were set up to see if Americans could sign-up for insurance through a log in page that we found that was only available 86% of the time. We checked https://www.healthcare.gov/marketplace/global/en_US/registration looking for the phrase “the system is down at the moment.” If that text was present on the page (see image below), then we would register this as time where the site was unavailable for the public to sign-up for insurance.
This unavailability totaled to over 4 days and 6 hours in 54 outages across the month of November. This content unavailability made up the bulk of the downtime for healthcare.gov but other aspects of their network infrastructure had outages as well. It is important to note that some of those outages may be planned maintenance to the healthcare website, which do impact users but are sometimes the only good ways of making improvements.
Healthcare.gov suffered over 6 hours of DNS outages. But Healthcare.gov has a robust DNS infrastructure on separate coasts with multiple backups so it is possible -because at no point were all of the DNS servers down- that availability to all users across the Nation were not impacted by these isolated DNS outages. Though it is not ideal to lose a DNS server, Healthcare.gov’s expansive DNS infrastructure keeps it available to users and can help it prevent DDoS attacks.
A quick reminder, DNS is the domain name system that allows websites to be found using a “name” like “healthcare.gov” instead of a set of numbers, an IP address, like “18.104.22.168″. DNS is a critical aspect of any website’s infrastructure because it allows users to find your website easily with a domain name. Having an expansive set of backup DNS servers are part of our recommendations for maintaining good DNS related uptime.
The combined DNS and HTTP outages totaled to 4 days and 11 hours of total downtime for healthcare.gov. This startlingly bad start means there is definitely room to improve.
Probably included in those improvements are changes like this page (see below), which has been added but still designates time where Americans cannot sign up for insurance. We have added HTTP content checks that will be watching for outages when this page appears.
We will continue to keep our eye out for improvements and outages for the healthcare.gov website and release another summary for its December performance to see if those improvement benefit the American Public, the end user. For the most updated facts, follow us on twitter or send us an email at firstname.lastname@example.org.
The holidays are now officially in full swing, from the door wreaths on your neighbor’s door and the lights on the trees on main street, this means its time for another update from the Panopta Holiday Index.
Last year, we published our first e-commerce whitepaper on the availability of 132 websites of major retailers. In that whitepaper, we found nearly 35% of the sites could not meet that mandatory minimum,as established by hosting companies, of 99.9% website availability leading up to the sales season. During the main holiday shopping season, November 15th to January 4th, we set up the Panopta Holiday Index for executives, retailers, and shoppers as a public resource. During that holiday season over a fifth of domains failed to meet the 99.9% availability standard and has offered us a continued reason to keep examining and publishing our data.
It is with great pleasure that we present our second annual whitepaper on the status and outlook for major retailers websites this upcoming Black Friday, Cyber Monday, and the rest of the holiday shopping season. To receive a copy of the whitepaper, just add your email address to the form below and it’ll be waiting in your inbox shortly!
And stay tuned to the Panopta Holiday Index to see how your favorite retailers do this season.
If you have been keeping up with the news, you’ll know that if you have been trying to buy health insurance on the new health insurance exchange you get message like this:
One of our employees, Gareth, decided, in lieu of all of the news, to monitor healthcare.gov and here are the steps he is taking to figure out exactly when healthcare.gov is going to go live again for anybody to buy insurance. Follow us on twitter because we’ll tweet exactly when we can buy insurance online again. This breaking news brought to you by Panopta and you’ll see a message like this:
Below we are details for setting up content checks on your favorite webpages . Gareth did in 5 minutes. We love this new way of using Panopta. Definitely not designed this way but we love what people come up to monitor.
First we set up the server (designed on our bootstrap application wizard)
Making sure to set up HTTP checks.
Next we modified our HTTP fullpage requests to take in HTTP options…
Added in that URI from above and told Panopta that the string “The System is down at the moment” cannot be present for there to be availability/uptime
Then we tested it to make sure
In the end we threw in some DNS monitoring to make sure that everyone in America could find the webpage anyway. You can find more about that here. So now the Panopta alert system will ping Gareth and I directly whenever anything else goes down as well as send me emails to me when I get an all clear. I’ll show you in a Part II how to set up alerts for Healthcare.org
In fact, another employee, Taylor, is monitoring the arrival date of Google’s Nexus 5. More on that later.
We’re excited to announce a large number of enhancements to Panopta as part of our v3.11 release! This release included a fair amount of backend improvements which brings an even more robust and reliable monitoring experience to you. In addition to those backend improvements, here are some of the user facing enhancements which you can benefit from:
Sign in with Twitter
Now you can sign into Panopta using your Twitter login in addition to our normal username/password authentication. Eliminate the need for another login you have to remember by linking your twitter account to your Panopta login! To set this up, go to Settings | My Account in the control panel and click the Connect Your Twitter Account button. This will walk you through Twitters approval flow, then you can always login with the button on the login page.
While you’re at it, make sure to follow @Panopta to catch all of our latest news and announcements!
Enhancements to maintenance schedules
Many Panopta customers use the maintenance schedule functionality extensively. Maintenance schedules allow you to configure Panopta with periods of time where your systems/servers will be down intentionally and when you don’t want alerts to be sent. In order to make this even easier to use, we’ve added the ability to search through archived maintenance schedules along with the ability to copy an archived schedule for a brand new maintenance schedule.
In addition to this, maintenance schedules now apply to checks performed by the monitoring Agent as well.
New REST API
Earlier this month, we announced the release of our new REST API and we’re already seeing some interesting uses. You can read about it in more detail here: http://www.panopta.com/2013/09/13/brand-new-panopta-api/
Monitoring node white/black listing
Our global monitoring network allows you to perform remote checks on your servers from around the world. However, there are times where it is helpful to restrict checks to only a subset of locations, usually for services that have more stringent security policies in place where connections can only be made from certain source IP addresses. Or you may want to exclude a single location from performing checks, such as when your server is hosted in the same datacenter as one of our monitoring nodes.
To accommodate this, we’ve implemented a way for you to explicitly white/black list any of our external monitoring nodes to be used for confirmation checks. Since we have over 30 monitoring nodes (and growing), this will allow you to limit the number of IP addresses which you have to add to your firewalls. Full white/blacklist controls are included in our Growth package and above, and can be found in the Monitoring Location tab in the Edit Server window.
Even more fine grained checks and controls added
In addition to the basic ICMP Ping check which looks for any response, we’ve added support for detecting and alerting on packet-loss as well. Now you can be alerted if packets are being dropped along the way which can be one cause of higher latency to your servers.
In addition to checking for a specific IP address for an A record from your DNS server, we’ve also added the ability to check the value of your MX records. Soon we’ll add the ability to check for CNAME records as well.
With the latest release, you’re now able to view the detailed check results (node location, date/time, response time) for response time outages as well.
Lastly, we’ve added the ability to Pause any currently configured checks. This allows you to stop the checks from occurring in case you need to diagnose an issue and want to avoid any noise in your logs from Panopta!
That’s a small taste of what v3.11 brings to you. Login and take advantage of the new functionality. A large amount of our feature development is driven by customer requests so we’d love to hear your feedback and any other suggestions – please add your thoughts in the comments below or email email@example.com.
In the IT industry, infrastructure fails all the time – it’s a known fact that everyone accepts-and there are thousands of tech professionals who won’t tell you anything about it. Usually, the details of failures are kept private, either never mentioned or sanitized into a generic root cause analysis (RCA) that gives only basic information. Very seldom do you get to see what really happened behind the scenes when things go wrong, which is truly unfortunate.
However, these failures, as the old adage goes, are a great teacher. The battle scars and war stories of seasoned system administrators have built their character and establish the skill to quickly assess and resolve problems that come up. Getting to see problems arise and see how they’re dealt with is the best way for more junior staff to learn their trade. Unfortunately the important problems aren’t textbook and this often requires real fires and good teachers in order to gain real knowledge.
One of our customers recently ran into a series of intermittent hardware problems, which led to a number of outages for their SaaS application over a period of 24 hours. They’ve agreed to allow us to describe the problems they ran into and the steps they took to resolve them, along with the lessons they took away from the event and their plans for improving their infrastructure.
While this experience gives some degree of Schadenfreude to over-stressed techies, it is also a learning opportunity that can help others avoid similar pain when Murphy’s Law visits them.
Our customer, who will remain nameless, runs a cutting-edge online application with clients spread across the United States. The company has been growing quickly since the public launch of their service, which has kept them on a strict agile development process that produces a new release roughly every month. The fast pace of deployments combined with a heavy focus on increasing functionality means continual changes to their technical infrastructure. These circumstances came to a head recently when they encountered problems with their production infrastructure.
The first outage occurred at 5:30 AM local time. Fortunately for them, they had invested the time to setup proper monitoring of all their production components. As soon as the main server went down, the their systems administrator who was on call that week was woken by SMS alerts to investigate. Initially it seemed like a random server hang, so the machine was rebooted and he grabbed a bit more sleep. When the second round of alerts started a couple of hours later, the alerts went to the main operations team members as well as the head of customer service. Because their customer base is primarily in the US and would soon be online to use the system, pressure began to mount.
Due to complications in their database infrastructure, they did not have a failover site setup to shift customers to while the production server was diagnosed. In addition to this, they did not have a good way to proactively notify customers. So once their users started to come online, their support team started getting flooded with calls asking about the problem. For a relatively small team based in one location, the constant chime of ringing phones only raised the overall stress level.
As their customer support team worked the phones, the operations team focused on the problem server with their hosting provider. The complexity of their virtualized environment left a number of possible scenarios to investigate.
One thing they had going for them was a regular, well-tested backup system which had stored a snapshot of their main databases offsite earlier that morning. This would allow them to rebuild a new server if needed, but more importantly they were able to stage a copy of the database on another machine and extract some critical data that their key customers needed to operate. This is obviously not the same as having a fully operational application, but is better than being completely dead in the water and allowed them to minimize the impact on their most important customers.
After a number of false leads, the operations team ultimately determined that all of the problems were caused by a flaky power supply that was intermittently fluctuating voltage levels and triggering reboots. Once their hosting provider swapped in a replacement, the issues no longer occurred.
Unfortunately, failure of servers and network devices are a fact of life. Depending on your budget and appetite for downtime, you can build more resilient infrastructure in order to minimize the impact of failures, but true 100% availability is rarely realistic. Instead, you can prepare for things to break and have pre-arranged processes in place to follow when things go bad that can minimize the impact.
Based on this customer’s hardware problems, there are a number of lessons to be learned which can help you minimize the impact of outages with the right process.
- Have a failover site available with DNS configured with low TTLs. When you have a catastrophic problem with your primary infrastructure, you can update DNS and point visitors to a site that responds. Ideally this would be a warm version of your application/data. If that’s not possible, a simple static site will help inform customers of the issue and avoid a generic browser/connection error. This alone will head off countless phone calls.
- Establish emergency communication channels with customers in advance. If you’re going to use email or Twitter to let them know when there are problems, make sure customers know about this in advance – add it to your welcome information, help docs, set it as a footer on outgoing emails, etc. Then make sure to use it – post as soon as you know there are problems, and regularly follow up with status updates.
- Have a plan in place for proactively alerting key customers, the ones that you can least afford to lose or upset. Make sure you have their contact information someplace other than in your production server where you can reach it in the event of problems.
- Ensure that you have good off-site backups of your key application data, and that you have verified it’s integrity with regular test restores. Even if you don’t find yourself in a position of having to do a rebuild, it can come in handy by allowing you to access some critical data while dealing with the primary problem.
In the end, one bad component resulted in a very long & stressful day for our customer and their users. After this was resolved, their operations team setup a static emergency maintenance site and began work on the changes needed to setup full database replication. This allowed to have a full warm standby environment ready the next time Murphy’s Law strikes. Take this as an opportunity to learn from their experience and be ready for disaster.
Have your own horror story?
If you’ve been in the IT industry for a while you can almost certainly empathize? What’s the worst situation you’ve been in? What are the most valuable lessons that came out of it? I invite you to send me your story and we’ll share them (anonymously, of course) with our readers so everyone can learn from them, and hopefully save everyone some sleepless nights!
We’re extremely excited to announce the launch of version 2 of the Panopta REST API! The V2 API exposes all of the functionality which is available to you through our control panel site and has been in beta use for several weeks now. We’ve now released this to all of our customers and have made it available on all of our plans.
The Panopta API is a great way to enhance your monitoring experience even further by pulling critical uptime & outage data into external apps for a powerful mashup. Acknowledge an outage or escalate it to advance the notification schedule to it’s notification step. Or use the API to automate adding/removing servers, checks and users to Panopta when something changes on your end for a fully integrated experience. The possibilities go on. We’re happy to help you think of other ways to integrate as well, just reach out to our first class support team.
To begin using the API, simply login to your control panel and access the API key management page from under the settings menu. You can setup API keys for different purposes and designate them as read-only or write access. Once you’ve setup an API key, you”ll be able to immediately start seeing it in action using our API explorer tool.
A powerful API is only as useful as the documentation you provide showing how to use it. By inputting your API key, you can immediately see all the API endpoints available to you and the operations/parameters it takes. See below:
Once you’ve selected an API endpoint to explore, you’re presented with an easy to use interface to input any of the optional parameters with an explanation. Submit the form and immediately see the request url, response code, body and headers! Not a single line of code needed and you’re able to explore the power of the API!
You can access the API explorer here or just visit the API key section in your control panel. Developers will be happy to hear that every new feature added to Panopta will be represented in the API as well.
Have an idea for additional API’s which should be added? Or want to show us how you’ve used the Panopta API to enhance your monitoring experience? Email our dev team or send us a note on Twitter (@panopta) and tell us all about it. You”ll make our day!
Last but not least, we want to really thank the developers of swagger for developing a framework which enabled us to bring such a powerful tool to all our valued customers.
We’re excited to announce that we’ve reworked our monitoring packages from the ground up! We think you”ll find the new offering to be extremely competitive while still maintaining Panopta’s dedication to a high level of quality and customer service.
The screenshot below shows just a glimpse of our new lineup of packages and how they scale with your growing business.
So what makes these new plans so special? Here are a few reasons why this change in our plans benefits our customers in a big way:
Packages that scale with your business – We put a lot of thought into the various stages a company goes through as it progresses. As you grow, your monitoring requirements change and we’ve done our best to capture that with our new package lineup! We’ve made it easier to find the right “fit” for your company with comfortable transitions between plans so that you’re getting what you need for a price that fits into your budget.
More bang for your buck – We’re including even more checks and features in each plan to provide great value with the same enterprise quality you’ve come to expect from Panopta. One noteworthy change is we’ve removed SMS and voice alert limits across the board. Now, all plans come with unlimited SMS/Voice alerts!
Monitoring Agent – We’ve separated agent resource checks from network service checks. We believe monitoring with the agent provides you with invaluable insight when assessing how your infrastructure is performing. Separating the agent checks out allows you to monitor both without having to cut back because of limitations.
That along with a newly streamlined signup and setup process makes getting setup with Panopta easy and painless.
New to Panopta? Signup now and give it a try for 30 days at no cost! We’re confident that you”ll be satisfied with what Panopta has to offer and our first class support team is here to help you along the way.
Already a Panopta customer? Don’t worry, we’ll be migrating all of our existing customers onto these new and improved plans so that you benefit from the same savings as all of our new customers. We’ll be sending out communication about this and processing plans changes to batches of customers at a time. If you have any questions or need assistance before then, feel free to let our billing team know and we’ll do our best to assist.
Network Solutions has become a villain and victim in the hosting world with their long outages and intermittent services over the past few weeks caused by several hacking DDoS (Distributed Denial of Service) attacks that have compromised its hosting and DNS services. These intentional attacks against Network Solutions are, obviously, not their fault but their critical server issues do trickle down and affect us all.
From the Network Solutions customer, the customer of a Network Solutions customer, to the general users of the internet, these attacks do not just compromise the discreet products Network Solutions sells but many of our interactions on the web.
Compounding the technical problems that Network Solutions encountered was their tepid non-response to the attack. In the wake of their most recent outages this week, there have been no reports or news distributed on the Network Solutions homepage, via twitter, or any other source directly confirming their outages until service had been restored. This is problematic because Network Solution’s customers, unless they have a very refined monitoring infrastructure in place, have no way of confirming and confronting outage problems.
For example, a Network Solution’s customer could both simultaneously have outages on their non-Network Solutions servers caused by internal issues and also be inaccessible to the users at large because of the DDoS attack affecting their Network Solutions hosted Authoritative Name Servers. Or that customer could have no issue with their hosting servers. But the real problem here lies in Network Solutions handling of their outage and giving clear feedback to its customers so that their customers only address real problems are not a murky grey area of possible issues.
This is a fundamental flaw in for anyone dealing with network and server outages. Panopta firmly believes that contact and communication is the most important cornerstone of dealing with outages and acknowledging outages to yourself and the public as a whole is the first step.
The internet is built on a paradigm of open-knowledge and open-trust. We trust that the appropriate knowledge and information will openly accessible allowing us to trust the companies we buy from online especially because there are a lot of the time no face to face interactions that establish traditional trust. Any type of solution is no longer valuable when trust is compromised and information is not shared between consumer and provider. In the end, if we as internet companies behave like Network Solutions we will end up like Verizon, AT&T, and Congress providing an invaluable resources that is resented by all of our users.
When Network Solutions fails to meet this paradigm whatever genuine solutions – improved network infrastructure, advanced programming, heroic troubleshooting – falls to the wayside. It is like literally throwing those man hours and intelligence away.
These DDoS attacks creates a cascading ripple effect that even impacts those who don’t use Network Solutions as a client or a client of a client. Because the internet is an ecosystem and economy, the web-based tools and information held broadly across the servers who went offline means that inefficiencies were created and necessary work could not be done impeding the the flow of online services down to real world. Read: the internet is not a game anymore and these attacks affect the real economy of people, objects, and interactions of commerce.
Look at it like a series of concentric circles or a bulls eye, at the center Network Solutions outage from our estimates around 125,000 of their customers and each of those on averaged out -across the small personal websites to large social networking companies-to around 1000 customers each meaning at the very least 125,000,000 people were impacted at the edge of the bulls eye. Now that impact varies from case to case.
The impact at the edge may be quite benign for a small personal page but for something like an E-Commerce company,who use doesn’t use Network Solutions, being at the very edge of the bulls eye can have some awful consequences. Imagine this E-commerce company relies on an outside billing application service, who does use Network Solutions for their DNS, to database and process credit card billings. The E-commerce company and brand thinks that their sales are running through smoothly but in fact outages in the domain name system has hampered their revenue because customers of the e-commerce company cannot connect to those billing information servers. Additionally, the E-commerce site owners may not be able to locate that problem because the billing applications services IP address is cached inside of their browsers. The E-commerce company probably won’t who was hosting their authoritative name servers for their billing company in the first place but what they didn’t know came back to hurt them.
You can learn from these attacks and Network Solution’s response to better prepare yourself to respond to similar situations. As a first step in preparing for DNS attacks, make sure you have proper monitoring of your DNS infrastructure in place so you know as soon as there are any problems will can impact your customers and visitors. Then work out your response plan, and be prepared to post updates to your website, Twitter or other relevant communications channels.
The trickle down damage caused by DDoS attacks however has a flip side in that accountability and responsibility can trickle up and make a name for your brand. To take responsibility and inform your clients you promote a reputation of openness and reliability that can benefit your brand going forward.
What are the biggest threats that you’ve had to deal with, and what types of responses have worked best? Share your experiences in the comments below, and follow us for future updates on hosting options, and server infrastructure news and analysis.
Over the past week, DNS-related outages have impacted a number of sites. LinkedIn, the third largest social network in the world, suffered several hours of downtime where their site was instead being directed to domain sales page. After investigation, they determined that this was due to a domain hijacking incident with their DNS provider, Network Solutions. Network Solutions claims “a small number” of its clients were affected including LinkedIn. Cisco offers contrary evidence that up to 5000 webpages were affected by Network Solutions outage, including USPS.com, Subaru, Mazda USA, US Airways, Craigslist, and Weather.com.
Following the Network Solutions problem, hosting providers Zerigo and GoDaddy have had extended DNS outages from what appear to be DDoS (Distributed Denial of Service) attacks that have taken many of the websites that they host and manage offline.
DNS is key component of how the internet works, and normally plays a behind-the-scenes role and doesn’t get much attention. However, as the last week has shown, it can cause serious problems for online businesses. What should you know about DNS and how to minimize the impact of DNS problems on your site?
Today, we will start by stepping back and explaining the functional parts of the internet and how your webpage, web applications, and all that other coding fits in to the Domain Name System.
What is the internet
“[The internet] is a series of tubes.” – Alaskan Senator Ted Stevens in 2006
This famous quote by Senator Ted Stevens has been memed and boo-hooed by the media as a sign of how disconnected lawmakers are from the way things actually work. But, honestly, do you know how the internet, in it’s most macroscopic overview, is organized? who or what organizes these so-called “tubes”?
Today, we will look at the basic cornerstone of the internet: the Domain Name System (DNS for short). It belies almost every single interaction you have had with the internet this morning, yesterday, and likely your entire web presence. And, as is the case with Brookstone (noted below) it can mean your website is offline while your own servers hum along perfectly.
Fundamentals of the internet
The fundamental organizing structure of the internet is the Internet Protocol (IP) address. An IP address is a label for any possible device connected to any type of network whether it is a printer, computer, smartphone, or server. For simplicity sake, we will be talking about IP addresses as they relate to the public internet. The IP address’ main function is to give label and identity to servers and end-user computers. Each IP address is a unique number between 1 and 4,294,967,296, for simplicity sake they are written in the form of four separate numbers joined by dots, like 22.214.171.124.
But as you know from experience, you don’t often type in digits in to your web browser to bring up a webpages. Instead, you use a domain name like www.panopta.com.
These domain names are paired with IP addresses so that information can be remembered by us, human beings, simply via names. Web browsers then “translate” these domain names back to an IP address.
This “translation” process, however, is not a decryption of numbers in to letters; rather it is a correspondence between a series of servers and IP addresses.
Roots and Authorities
From a high overview, this correspondence begins with Internet Corporation for Assigned Names and Numbers (ICANN) and their Root name servers.
The root name servers are a series of servers located across the world containing, redundantly, all of the registered domains on the internet. They are the heart and origin point of the internet with an archive of all domain names. These root name servers index all registered domain names in to grouping servers labeled A through M. Those lettered groupings outsource the work of knowing the exact IP address to registered Authoritative name servers.
Authoritative Name servers are registered with ICAAN as the certified source that the root name servers must use to find an IP address for a specific domain. But they are maintained by private individuals and companies outside of ICAAN.
An authoritative name server is a service that you pay for that hosts your current IP address and domain name. Because of the simplicity of the type of infomation stored on an authoritative name server, many IP address and domain name pairings are stored on single authoritative name server together.
The authoritative name server is the pivotal part in the DNS that gives authoritative answers (the correct IP Address) to queries of any browser in the world searching for your domain. These authoritative answers come in the form of Fully Qualified Domain Names (FQDNs). A FQDN is a domain name that is exact and understands the distinction between www.panopta.com and my.panopta.com. Only authoritative name servers know all of the possible FQDNs; a root name server only has registered the base domain name, like panopta.com.
As a testament to their importance, these authoritative name severs always have several redundant servers containing the same information to ensure if one of the name servers goes down there are others to back it up. It is also in part to ensure that the entire DNS infrastructure does not collapse because each of these Authoritative name severs are a key rally point for internet users.
This system with the Root Name servers and authoritative name servers, however, is not how all your internet interactions occur. Instead, your interactions, in general, are much more efficiently run by a caching server.
What is Caching
A caching server stores a small dictionary (think: abridged dictionary) of IP address and domain name combinations whose contents are determined by local usage and frequency. For example, Twitter is banned in China and their micro blogging needs are filled by a service called Sina Weibo. Sina Weibo, though it is used by over 368 million users in the world, will likely not be found on caching server used by web users in Honduras.
DNS caching servers work as a localized server that hosts a small but efficient curated compendium of the most important FQDNs for their region. They exist for you, the end-user, to access information with minimal lag time by avoiding round-trip requests to root and authoritative name servers.
These caching servers are maintained by a variety of different groups including Google, local internet providers, and hosting providers. They are operated in order to reduce lag time for user based on the general assumption that people in a given physical area are going to browse similar web pages.
Another critical element of DNS caching servers is that they are constantly looking to update themselves and eliminate webpages that are no longer operating, think pets.com, and bring webpages that are getting more attention in to their caching list. They do this by using their own resolver mechanism and their user’s queries to query root name servers to locate a new set of IP addresses and Domain Name combinations
When DNS goes wrong
Given all the pieces that make up the Domain Name System, it should be clear that there are various ways that things could go wrong. If you want to keep your website available, it is important to prepare for these problems. Fortunately, there are a few ways monitoring can save you some future headaches.
To get started, let’s walk through the different problems that could occur. There are two main problems that can occur in DNS that are beyond your control. But you ought to know about them in order to keep in mind what can go wrong.
At the most catastrophic level, root name servers could go down. This is would be a internet wide problem and be considered something that would bring down most of the functional uses of the the internet. This event is extremely unlikely, given that there are worldwide copies of the root name servers some of them distributed by anycast routing for additional redundancy.
A local caching server could break,
Beyond those two problems, there are other preventable problems. An outage with authoritative name servers can be very frustrating for you because it will disable access from your clients and set you off on a wild goose chase with your other servers. Caching adds to the complexity of determining these outages, as your view through your local caches will be different than other visitors. Directly monitoring authoritative name gives you better insight into the state of your authoritative servers and allow infrastructure decisions to be made quickly and accurately.
DNS outages are relatively frequent occurrences, based on our aggregate monitoring data roughly 5% of all outages are DNS-related. Just the other day, Brookstone, a publicly traded gift retailer, had a prominent DNS outage. For over an hour, neither of Brookstone’s two authoritative DNS servers (NS97.WORLDNIC.COM and NS98.WORLDNIC.COM, which are provided by Network Solutions) were returning IP addresses. This took their entire site, as well as email and any other services they rely on, offline for that entire hour.
A more troubling problem, the configuration in their authoritative name servers could be changed without your knowledge. Authoritative name server’s can be targeted by someone (read:hacker ) and your FQDN could be directed to a different IP address,
What you should be monitoring
Fortunately, both of these situations can be detected and handled with appropriate monitoring of your authoritative DNS servers. Our DNS checks support full queries to resolve a given FQDN and compare the IP address that is returned against one or more correct addresses. By setting up checks to each authoritative server to ensure that your domain name can be resolved, and that it is returning the correct IP address, you can be alerted whenever there are problems and can then work with your DNS provider to resolve the problem.
DNS lookups are part of the entire process that is performed when performing an HTTP check, and if your DNS servers are having problems they will eventually show up in failed HTTP checks. Be aware, most monitoring systems make use of local caching servers when performing HTTP checks. Because of this, HTTP checks may not detect an authoritative DNS outage for some time depending on the caching settings for your domain. Doing separate DNS-specific checks against your authoritative servers is critical to maximize your site’s availability.
Because the authoritative name servers are run redundantly, each of the those servers should be checked separately so that you can detect problems with any individual server. However, because of this redundancy it’s not necessary to trigger immediate, urgent alerts when one authoritative server goes down. This is an important problem that should be addressed, but doesn’t necessary waking up someone in the middle of the night to address it immediately.
However, simultaneous outages of all of your authoritative servers is a very critical problem, which needs an immediate response. Through the use of our compound services, which generate a separate set of alerts when outages occur across multiple servers, you can handle this situation as well.
If you aren’t actively monitoring your DNS servers currently, we recommend you set up checks now to avoid future problems. If you have questions on how to best configure monitoring for your DNS servers or need any assistance setting things, feel free to comment below or email us.