Looking to find out exactly what the Panopta agent can do? Want to know the best suggestions for how to configure your notification? We’ve got your answers!
We’re excited to announce the launch of the new Panopta Q&A site at answers.panopta.com. This site is filled with information about Panopta’s features and functionality, about how to best monitor your servers and network devices as well as general information on deploying and managing large-scale online infrastructure.
The site is open for anyone to ask questions, so if you can’t find what you’re looking for, ask away! Find a great answer? Vote it up so others will benefit from it as well. Get started by introducing yourself and sharing your story about how Panopta has helped your team.
We’re excited to be participating in an upcoming webinar with the folks at Gigaspaces, discussing how to enhance scalability and performance of e-commerce applications. Included in the webinar will be a review of our 2011 holiday monitoring results, including examples of sites that performed superbly and others that suffered during Black Friday and Cyber Monday.
Details on the webinar and a signup link are below.
WHAT: GigaSpaces webinar on Scalability Solutions for Peak Retail Demand
WHO: Jason Abate, Founder & CEO of Panopta and Ron Anderson, GigaSpaces Director of Architecture
WHEN: Thursday, February 9 at 12:00 pm EST
WHERE: To register go online at: https://www1.gotomeeting.com/register/834220008
A monitoring system is only as valuable as it is dependable. If you can’t trust that you’ll get alerted anytime something is wrong with your servers then your monitoring isn’t worth whatever you’re paying for it. Our goal at Panopta is to let you relax in confidence knowing that if your phone isn’t going off, then your infrastructure is fine. And whenever there are problems, you know right away.
Everyone knows that Murphy’s law applies especially well to the Internet. Things break in all sorts of different ways, from an obscure bug in code that pops up to take things down to a complete power outage at your datacenter because a truck drives into it. You never know what will break when, so you just have to plan for the worst and make sure your system can adapt and keep running as things go downhill.
This applies to monitoring systems just like everything else. A monitoring system has one subtle but crucial difference from the typical website or online application that further complicates our design. Most websites have regular patterns to their traffic, with peaks during the day when most of their visitors are awake, and deep hours after hours which are great for doing maintenance - short disruptions during these times have minimal impact and can often be done without raising the ire of customers (or bosses!)
The pattern of a monitoring systems is completely different, with a flat load across any time period (well, except for a steady upward trend as we grow!) We perform the same number of checks in the wee hours of the morning as we do in the early afternoon, and our customers rely on us to continuously check their servers regardless of the hour.
We’ve designed our infrastructure from the ground up with this in mind. Planning for every piece of the system to eventually fail has allowed us to keep our system running without disruption for the past four years, through billions of checks and millions of outages, and lets us accommodate regular maintenance by shifting load between pools of resources.
Every piece of our infrastructure is designed with redundancy in mind. Starting with our monitoring nodes that are distributed around the world hosted in the datacenters of dozens of different providers. Each one of these can (and do) fail at any time. When this happens our central infrastructure detects the failure and seamlessly moves checks to a nearby neighbor, fast enough that the overall system never misses a check. Once the node comes back, the checks are returned and the system continues on. The same mechanism allows for rolling upgrades of monitoring nodes when we release new functionality.
In the event of a total hardware failure of a node or a sudden spike in demand for monitoring capacity, we can completely reprovision a monitoring node from a bare OS image in about five minutes using a single Fabric command.
Our central infrastructure, which runs our outage detection, notification, reporting and control panel applications, is run out of two clusters of servers in the Dallas and Seattle datacenters of our partner SoftLayer. Configuration and monitoring data is replicated in both directions in near-realtime between locations, making use of SoftLayer’s secure and high-speed private network. All applications are able to run in either location, which gives us the ability to withstand a partial or complete failure of either datacenter and to shift operations around as needed for maintenance purposes.
Fortunately, SoftLayer’s been quite stable over the past four years, and the cases where we have had to fail over primary operations from one location to the other have all been due to planned maintenance that was scheduled in advance. But we sleep well at night knowing that whenever something unexpected does come up, we’re prepared.
Of course, this raises the question of who’s monitoring the monitoring system. For that we are covered as well, with a separate system that is continually watching all of the core components of our system to ensure that they’re all functioning as expected. Which means that all of our customers can sleep easy at night, knowing that we’ll continue to watch their systems, every minute of the day!
We’ve had a chance to run the numbers and now have the final results of our 2011 Holiday eCommerce Availability Index. Over the course of six weeks, starting right before Thanksgiving and ending on New Years Day, we tracked the downtime of all major online retailer websites, checking every minute to see whether holiday shoppers would be able to make their purchases or would be forced to head to a competitor.
After analyzing the results, some of the losers had downtime of more than seven hours in one instance, while the winners had zero downtime throughout the entire holiday season, showing the differences of those who are prepared versus those who were not.
Overall, online retailers who experienced outages around the beginning of the holiday season tended to fix their issues. There were three times the number of downtime minutes during Thanksgiving weekend than in all of December, showing that once retailers saw their sites going down, they quickly reacted, possibly by adding more servers and enhanced infrastructure. Site performance was much more consistent in December as compared to November, with one spike on Dec 23, the last day to order with overnight shipment that would be delivered in time for Christmas.
Over 50 sites had no downtime at all and had very quick response times, including Apple, Staples, REI, and Kohls. On the flipside, over 70 sites monitored did have significant downtime, including 22 sites with outages lasting more than three hours. Some of these include OfficeDepot and GameFly.
Compared to the 2010 index, the sites that performed well continued to do so (including the big three Amazon, Target and Walmart). Some that had extensive trouble in 2010 made a big improvement (JCPenney and Express both had 100% uptime and The North Face decreased its downtime by approximately 90% .
Not all improved however, OfficeDepot had nearly eight hours of total downtime, an improvement from 2010 but still significant, while Levenger improved slightly but still had over four hours of total downtime.
Based on Forrester’s estimate of $60 billion in online sales during the holidays in the US alone, all this downtime certainly translated to lost revenue. If we make the rough assumption that these ecommerce sites represent 50% of the total, and sales are evenly spread throughout the holiday season (very rough approximations, I admit, but enough to get an order-of-magnitude estimate) then each minute of downtime corresponds to roughly $5000 in lost sales opportunity. Certainly some customers will wait for a site to come back and return, but regardless there is serious revenue at risk.
Aggregating all of the downtime across all 132 sites, we recorded nearly 9 days of downtime, which translates to nearly $65 million in lost sales opportunities!
Below are the full details of the measured downtime for the sites we monitored, along with their 2010 numbers. Interested in more details? Email info@panopta.com and we’ll send you the full dataset from the study.
How did your site do over the holidays? What worked (or didn’t) to prepare? Please share your ideas in the comments!
Update: We have been in contact with TigerDirect’s IT team and confirmed that one of the outages we recorded was actually due to their servers blocking our monitoring nodes rather than an actual website outage. The numbers below and on the holiday site have been updated to reflect this.
Click on any of the headers to sort the table.
Happy New Year from all of us at Panopta. Since you’re reading this in early January, I assume you’re back at work after what was hopefully a quiet and relaxing holiday season free of too many late-night outage alerts!
Hopefully you got a chance to check out our Holiday eCommerce Availability Index, which tracked the performance and availability of all big online shopping sites throughout the holiday season. We’re still working on crunching the final numbers, but it looks like November, namely black Friday and Cyber Monday, was rough for many of the online retailers (some with outages of over seven hours!) but December was a bit more quiet, with fewer disruptions. Watch our blog for the final writeup later this week.
2011 was a great year for us at Panopta, with lots of new functionality added to our system, thanks to the great feedback that we received from our customers around the world. Compared to the Panopta of a year ago – we have new versions of our monitoring agent, new mobile apps for the iPhone/iPad and Android, the first release of our monitoring appliance as well as lots of smaller additions for more accurate monitoring and more flexible notification. Our monitoring network also continued to expand, with our first locations in Africa and continued growth in Asia and North America.
We are looking forward to 2012 and what it has in store, namely a number of new features currently under development to help you better manage your infrastructure.
We’ll also be ramping up the information available on our blog this year, with insights into Panopta’s architecture and technical operations, details on advanced features and capabilities about Panopta that you might not know about, and a new series of customer profiles. So check back often! Also, if you’ve been able to solve some of your infrastructure management problems using Panopta and would like to be featured, please let me know and we’ll get an interview setup.
As always, we love to hear your feedback about the service, the blog or anything else!
We are extending our monitoring network with new nodes in Cairo, Egypt; Johannesburg, South Africa; and Tokyo, Japan. Also the address of our Denver, Colorado node will be changing. These new nodes will be enabled on Friday, December 2. If you have firewall rules allowing access to our monitoring network please update to include these.
For a complete list of our monitoring network addresses, see http://www.panopta.com/about-panopta/monitoring-network/.
For the major online shopping sites, the first major hurdle of the holiday season is behind them. Most of the sites we’re tracking made it through Black Friday without major website outages, although several did have disruptions. In addition to the widely publicized order form problems at Walmart, there were other sites with problems. The worst was clearly Gamestop, which had 23 measurable outages on Friday, with a total downtime of more than three hours. Ulta, Joann, Sephora and Cabelas also had multiple outages, but each of their downtimes were limited to less than 25 minutes.
With Comscore reporting record online sales for Black Friday, it is quite surprising to see how well most sites performed. Up next is Cyber Monday, the day that traditionally sees the highest online sales figures of the season.
Keep an eye on http://holiday.panopta.com to see how everyone handles the rush!
It’s been nearly two weeks since we started monitoring the biggest online shopping sites through the Panopta Availability Index – Holiday eCommerce edition, and we’ve already seen quite a bit of surprising activity surrounding these sites, even though the real shopping season hasn’t even started yet.
Of the 130+ sites we’re monitoring as part of this study, nearly half have had outages of some kind, including 36 (nearly 25 percent) sites with outages longer than 15 minutes!
Three major sites in particular, TigerDirect, Sears and KMart have had more than seven hours of downtime in the past two weeks and others, such as Abercrombie, OfficeDepot and Onsale.com have had more than four hours of downtime.
It’s possible that some of these were preventative maintenance periods to ramp up for the major rush of holiday traffic, but considering many of these occurred during the daytime, (not the late-night hours that IT teams typically use for planned maintenance) it might very well be a sign of what’s to come when the traffic increases during the major holiday shopping days.
Will the 70 sites that have had perfect uptime continue throughout Black Friday and Cyber Monday? You can find out by getting a real-time view of how all the sites are doing at http://holiday.panopta.com.
Retailers plan well in advance for the holiday season, but there are always the uncontrollable variables. For brick and mortar stores it can be unexpected weather and for online retailers, it can be their website, essentially their storefront, going down. Based on our 2010 research, more than 40 percent of ecommerce sites can expect to have disruptions, ranging from a few minutes of slowdown to major outages such as those that hit The North Face, Express and JCPenney during last year’s season.
Today we’re excited to launch the Panopta Availability Index – 2011 Holiday eCommerce Edition. Panopta AI tracks downtime and availability of more than 130 major online shopping sites throughout the holiday season and is available at http://holiday.panopta.com. Follow along to see how your favorite sites survive Black Friday, Cyber Monday and the rest of the holiday season!
We’re very excited to make the latest release of our monitoring infrastructure available to all customers. This release has a number of big improvements centered around notification, plus several new network service checks for more advanced email-related functions.
The biggest feature, and by far the most heavily requested from all of our users, is the new rotating contact type. This lets you setup things like on-call schedules or support shift schedules, loading your calendar into the system and automatically have the right person notified based on the schedule. No more having to remember to update configurations when shifts change, maintain separate email aliases or pass along a shared phones or pagers. You can also use this functionality to setup conditional contacts who only get notified during certain hours of the day. For details on how to setup a rotating contact, see our knowledge base.
We also now have support for sending alerts through Google Chat, IRC and Twitter, so you and your team can get notified with whatever channel works best. There’s even support for posting outage alerts to your public Twitter timeline to let your customers know about problems, with the ability to customize the outage messages to match your company’s voice.
This is the first of many planned additions to our notification support. Our goal is to not only have the most flexible notification system of any monitoring system, but to be able to handle all of the real-world setups that you might have – if you have a setup that you can’t configure in our system, we want to hear from you!
Finally, we added support for two new, advanced email checks:
- Microsoft Exchange monitoring through OWA
- Round-trip email delivery
For the round-trip email delivery, our system sends an email to your inbound SMTP server at regular intervals and waits to receive the message, via POP, IMAP or secure versions of either. Alerts can be generated if the message is never received, or if the delivery process takes longer than a configurable threshold, so you’ll always know that your email system is working correctly.
Read our blog
