This past Friday, a large scale attack plagued the majority of the internet in North America causing tens of thousands of sites to be unavailable, including well known sites like Twitter, Github, Spotify, and Reddit. This major outage was due to an attack on the DNS services provided by DynDNS. This attack was a reminder of the fragility of the Internet, and where its vulnerabilities are. There is a lot of discussion focused on building failover and redundancy into the compute infrastructure your applications run on. However, a solid DNS infrastructure is often overlooked, which leaves the door open to these types of outages.

The emergence of public mega clouds like AWSaws, Azure, and Google Cloud have enabled users with the tools to be largely immune to individual node failures. Beyond those sorts of micro failures, running your infrastructure in multiple geographically independent data centers to allow for complete site failovers is typically the next level of redundancy. Again, the mega clouds are huge enablers of this with their current (and expanding) global footprints.

While this is certainly helpful, there is more you can do on your own to protect yourself from these types of outages. A lack of consideration for DNS in your disaster recovery and high availability planning can be a crucial weakness. This weakness was made visible in the midst of the attack occurring against Dyn on Friday. The go-to move for most companies is “more cloud”, but all the cloud in the world can’t help if your DNS is not functional. We’d like to focus on why an attack on a single core provider should not have caused this type of widespread disruption.

DNS has inherent fault tolerance built-in with support for secondary authoritative DNS servers. The common approach here is to list multiple nameservers within the same provider.

~ dev$ host -t ns twitter.com
twitter.com name server ns3.p34.dynect.net.
twitter.com name server ns4.p34.dynect.net.
twitter.com name server ns2.p34.dynect.net.
twitter.com name server ns1.p34.dynect.net.

Having multiple nameservers from a single provider protects you from single node failures, but does not save you from the provider itself experiencing outages. This is exactly what happened to a majority of the affected sites on Friday. A targeted DDoS attack taking out an entire provider is an extreme case, but an internal problem like a buggy release could result in the same sort of outages. The solution is to leverage the built-in secondary authoritative DNS for zones but to also host your DNS with other disconnected providers. This way, a provider-wide outage will not have a catastrophic effect.

Full disclosure – although we were not affected by the Dyn problems on Friday, we were still susceptible to this type of failure. We saw this as a wakeup call and took these additional measures ourselves. We’ve historically relied on IBM SoftLayer to manage our authoritative DNS, and their globally distributed AnyCast DNS servers have proven to be quite reliable. In order to build up an additional layer of protection, we replicated our DNS zone to Google’s authoritative DNS services.

~ dev$ host -t ns panopta.com
panopta.com name server ns2.softlayer.com.
panopta.com name server ns1.softlayer.com.
panopta.com name server ns-cloud-b1.googledomains.com.
panopta.com name server ns-cloud-b2.googledomains.com.
panopta.com name server ns-cloud-b3.googledomains.com.
panopta.com name server ns-cloud-b4.googledomains.com.

Adding more DNS providers is the easy part; the larger challenge is ongoing management. Unless you can assure that you will have strict discipline to modify DNS in all your providers, you’re allowing for the possibility of human error by introducing inaccurate records. A more ideal solution is to automate the syncing of both providers. We did this with the help of libcloud.

libcloud

Apache’s Libcloud project helps abstract the various API interfaces to automatically manage DNS across multiple providers. We wrote a python script which reads our authoritative DNS records out of a JSON file hosted on a private server and uses Libcloud to sync those settings out to all providers. This allows us to seamlessly migrate to a new provider or incorporate more DNS providers in the future. Our ops team knows to make all DNS modifications in that JSON file. In addition, we’re using Panopta’s DNS checks to directly query/monitor each DNS provider node we are leveraging for a select set of critical DNS entries; this ensures the syncing is operating as expected (at a high level).

We feel more secure having this additional layer of redundancy built into our infrastructure and we urge all of you to consider similar changes. We’ll be publishing the Libcloud script mentioned above to our public Github repository so you can implement a similar solution. We’ll be sure to post an update on Twitter and on this blog post when it’s available for you to download.

We are excited to announce that we are further expanding our monitoring network. On April 18th, we will add two new monitoring nodes: in Toronto, Canada and Auckland, New Zealand. So that you may update your firewalls, here are the relevant IP addresses:

  • Auckland – 163.53.235.50
  • Toronto – 159.203.6.127

Three of our existing monitoring nodes will be changing IP addresses as well..  Their new IP’s are:

  • Chicago 2: Previously: 23.29.134.23 – Will become: 45.63.67.141
  • Singapore 2: Previously: 180.210.201.164 – Will become: 103.25.203.234
  • WDC 2: Previously: 199.58.161.213 – Will become: 192.96.206.34

Lastly, we will be shutting down our monitoring node in Brisbane Australia – 103.16.129.198, due to stability issues. If you are currently making use of this monitoring node, all checks will be seamlessly moved to our Sydney 2, Australia checker – 103.25.58.106.

Any of our customers that have firewall restrictions for our monitoring nodes should update their systems to account for these new IP addresses. Follow our RSS feed or subscribe to our monitoring network announcement list with the form on the right to be alerted whenever changes are made. If you are scripting firewall rules, you can visit http://bit.ly/pan-ips to get a plaintext version of our latest IP addresses.

We will continue to expand our fleet of monitoring nodes periodically. If there is a location that you would like us to consider expanding to, then please let us know via email.

We recently released a refresh for our Users, Groups, and Integrations. Not only does it look better but it’s also a lot easier to use. A quick recap of the changes are below.

Merging of Users and Contacts

In the past, we split roles across two different team member types – Users and Contacts. See the image below for a refresher of what that looked like.

old-users-groups-top_

 

In hindsight, it doesn’t make a ton of sense two have these two separate – contact-only is really just a role that should be able to be assigned to a User. So, we’ve merged them together. Now, you merely need to specify the role you’d like the User to have rather than managing two different types. Plus, we still allow you two grant a more intermediary role – Limited.

new-users-groups

Cleaner UX

To make things easier to manage, we’ve given Users, Groups, Integrations, and On-Call Schedules their own page. As well, we’ve given each a fresh UI to make the management process a bit easier on the eyes. We think it’s a lot more enjoyable to use and we hope you do as well.

users-top

 

 

integrations

groups-top

 

on-call

 

As always, please shoot any feedback or questions our way – hello@panopta.com. Happy monitoring!

 

With cyber threats at an all-time high, architectural best-practices call for placing servers on properly segmented networks with limited access from the public internet. While this certainly helps you mitigate your security risk, it leaves you with a significant monitoring blind-spot. One of Panopta’s core strengths is its ability to provide both an external and internal view of your infrastructure, using our external monitoring probes and Panopta OnSight. Panopta OnSight is a virtual appliance that sits behind your firewall and monitors the status of your infrastructure. It installs in just a few minutes on all well-known hypervisors, including: VMWare, Xen, Microsoft Hyper-V and KVM. OnSight provides you with the capability of a wide range of monitoring options. This includes:

  • Uptime and availability monitoring of network services
  • ICMP Ping
  • TCP/IP port checks
  • Agentless resource polling
    • SNMP
    • SSH (Linux)
    • WMI (Windows)
    • CIM

 

Why Use OnSight

If you manage business critical applications on your company’s private corporate network, you can use OnSight to leverage the same system you rely on for public monitoring to watch over your internal applications and tools. In addition, public and private cloud environments (like AWS, Azure, and Rackspace) are making it easier to build complex, multi-tiered applications. Cloud environments, especially hybrid cloud, are more likely to span multiple private networks, making it harder to get complete coverage. You’d want to deploy OnSight™ because of the depth of the architecture; it cannot be reached from the outside due to its multi-tier layout.

OnSight™ is also helpful in environments where auto-scaling is utilized. With the new nodes being spun up and killed, ensuring monitoring is keeping up is important! A typical application architecture may expose a public service (website, app, API) through a shared/dedicated load balancer. Then, behind the load balancer are several web application nodes and supporting servers which contribute to serving your application.

Simplified Diagram

Deploying OnSight onto the private segment of your cloud allows you to gain uptime and performance insight on each of  the individual nodes and resources in your application stack. This additional insight helps diagnose problems you may encounter and helps you detect issues before they result in downtime or major service degradation. Combining the view of your internal and external infrastructure delivers the complete visibility you require to provide your end-users with the best experience possible.

 

Security

The OnSightappliance exclusively communicates with the Panopta cloud via an outbound encrypted connection. It establishes the outbound connection to securely send monitoring data and events back to our SaaS cloud to power all of your reporting, notifications, and dashboards, as well as to download monitoring configuration. No inbound connectivity is required which keeps your private infrastructure unexposed. In addition, if you have servers that do not have outbound internet access, you can install our monitoring agent and configure it to send its data to the OnSightappliance instead. The OnSight™ appliance operates as a proxy, enabling monitoring on your private servers without requiring outbound access.

 

Getting Started and Setting up OnSight

Once you’ve downloaded and imported the appropriate OnSight image, you can begin monitoring in one of two ways:

1) You can put OnSight into discovery mode with a range of IP addresses to scan and it will build up a queue of devices/servers and the services running on each server. You can review the list of discovered servers and choose which ones you would like to add into Panopta to monitor. We’ll soon have support for auto provisioning rules using our existing template as well.

2) You can also manually add the servers (either through the control panel or API) and configure on each of them to use OnSight as their primary monitoring node. You can also handle provisioning of servers monitored by OnSight in bulk using our powerful template system.

 

Advanced Set Up Options for OnSight

OnSight also supports high availability options for environments in which internal monitoring is imperative to operations. To do this, deploy multiple OnSight instances (preferably on different underlying hardware) and set them to be part of the same cluster. When you do this, Panopta will automatically distribute checks evenly across all OnSight nodes in a cluster. This ensures no single appliance gets overworked. In addition, our central infrastructure continually monitors  each OnSightinstance in the cluster and in the event we lose our connection with any of the nodes, it will immediately failover all the checks to other nodes in the cluster so that your monitoring continues to run.

For more information on how to configure OnSight in a cluster, refer to our knowledge base article. If you have any other questions regarding OnSight™ or the installation process, our support team is available to answer questions! We can be reached via email or web chat.

 

About Panopta: Panopta provides advanced network and server monitoring for online businesses and service providers. We go beyond providing basic monitoring to give operations teams the tools they need to detect issues before they occur and minimize the impact of outages or slow load time. Contact us with any questions you may have, or sign up for the free trial and see for yourself!

 

Website developers and admins in today’s ever expanding web have a number of solutions to handle high availability, failover, and performance. One of those solutions is called Anycast; Anycast is a routing scheme which you can use to deal with the challenges of serving a global audience. By using a routing scheme like Anycast, you can ensure users are routed to the node closest to them. It also provides fault tolerance in the event that one of your POP’s (point-of-presence) is unavailable.

(image credit: http://www.wpmayor.com/wp-mayor-guide-wordpress-content-delivery-networks/)

(image credit: http://www.wpmayor.com/wp-mayor-guide-wordpress-content-delivery-networks/)

In the diagram above, you can see how an Anycast scheme works in context of a CDN (Content Delivery Network). In a CDN, website visitors in different parts of the world have their request routed to the server nearest them, though the results are the same. This helps with page load-time and also takes much of the burden off of your origin servers. Anycast is what enables routing requests to the nearest datacenter/network.

However, the benefits of Anycast add significant complications to monitoring. The Anycast POP that you are monitoring will be determined by the location of your monitoring probes. Running the test from a single probe leaves you with a sizable blind spot in that you won’t know about problems with the other Anycast POP’s because you’ll always be routed to the same location. In addition, any outage confirmation which is done by other nodes will likely test the wrong location thus causing the outage to get marked as false.

So how do you embrace Anycast in your architecture but still effectively monitor your resources?

Network Coverage
First, determine which POP’s your sites are being served from with your Anycast/CDN provider. You will then need to work with your monitoring provider to understand which of their probes will terminate to each of the aforementioned locations. This is where using a monitoring provider with a wide network footprint and diverse carrier backbone really helps. Upstream providers/carriers are important to mention here because defining which probes terminate to which Anycast POP’s is determined by more than just the geographic location. It’s determined by routing and the networks which the request travel through.

“Safe” Outage Confirmations
If your monitoring system attempts to verify outages from multiple locations (like Panopta does), the outage could get incorrectly ruled out. Panopta uses a outage voting process to verify authenticity of the outage; meaning if one of our nodes detect an outage on a server or website, 3-5 nearby nodes instantly attempt to confirm the outage from other locations. This rules out any local network or server issues with the primary node. If a majority vote is reached, that the site or server is considered down, and we begin the alerting process. If not, we check it again in 60 seconds per the normal schedule. With Anycast monitoring, there is the danger of the monitoring node checking the wrong server because of the scheme. One way to mitigate this is to fine-tune which probes get used for confirmation. You can determine which probes are safe to use with your monitoring provider.

We’ve pre-determined the mappings of our monitoring probes for common CDN providers like CloudFlare and MaxCDN. If you are using either of these providers (or any other provider), feel free to get in touch with our support team in order to determine how to best set up your monitoring.

About Panopta: Panopta provides advanced network and server monitoring for online businesses and service providers. We go beyond providing basic monitoring to give operations teams the tools they need to detect issues before they occur and minimize the impact of outages or slow load time. Contact us with any questions you may have, or sign up for the free trial and see for yourself!