Lies, Damn Lies and 99.9% Uptime

"There are three kinds of lies: lies, damn lies, and statistics."

-Benjamin Disraeli, popularized by Mark Twain

Lies, Damn Lies and 99.9% Uptime
Statistics don't lie outright. They just don't tell the whole truth.

Suppose your hosting provider claims 99.9% uptime during the past month. This means all the accumulated downtime during the whole month was no more than 40 minutes. Sounds great, right?

The numbers don't answer one important question: when did the downtime occur? What if you were down 40 minutes during your peak usage time on the busiest day of the week? Suddenly 99.9% of uptime doesn't sound so great. That's the whole truth often missing in uptime reports.

The All Important Monitoring Interval
Convinced you can do better than 99.9%, you search for another hosting provider. You finally settle on one that offers an additional "nine" or 99.99% uptime per month. No more than 4 minutes of downtime.

Before you get too excited, let's see where that extra nine comes from by examining the concept of monitoring interval. The monitoring interval is how often your hosted server is checked to make sure everything is working A-OK. Think of it as the lines on a ruler. It's going to be pretty hard to measure down to one eighth of an inch if your ruler only has one inch lines on it.

Suppose your application is monitored every 15 minutes. Now say your server is rebooted. If the monitor runs while the server is down, your server will show as down for 15 minutes, even though it only takes 3 minutes to reboot. If the monitor misses the reboot window, it won't show as being down at all.

A provider that offers 99.99% must have a small enough monitoring interval that it can measure down to the nearest .01%. How small is that exactly? Let's break it down using the shortest month:

28 days x 24 hours/day x 60 minutes/hour x .0001 = 4.03 minutes

A service provider must provide a monitoring interval of no more than 4 minutes to provide a 99.99% uptime guarantee.

Finally, what of 99.999%, the so-called "five nines" of uptime? Well, we would have to monitor every .4 minutes or every 24 seconds. With the reporting period increased to a year instead of a month, it's possible to have accuracy up to five nines with a 5-minute monitoring interval. Trouble is, who wants to wait a whole year for a report?

The best reporting will include a combination of daily, weekly, monthly and yearly statistics for comparison.

What Do You Mean, Down?
Now that you understand what a monitoring interval is, this next one should be easy: what is the meaning of "down"? If your service provider is providing uptime, how do they decide when something is down? Are they simply doing a "ping" of the server? Or are they testing the application itself?

If "up" to them means your server is running, even though your application is really "down", your uptime statistics take on a whole new meaning -- or lack of meaning.

Also, who is the one actually doing the monitoring? Ideally, you'd like to have a third party monitoring service. That way you know your monitoring numbers are independently verified.

Availability From a Business Perspective
There is a better way. Instead of settling for the one-size-fits-all approach of "nines of uptime", set your own availability goals. The key is to examine availability from a business perspective:

What are my business-critical periods?
How much downtime is acceptable during off hours?
What kind of monitoring interval is needed?
How do we know if the application is down?
Who is actually doing the monitoring?

Always make a distinction between business hours and after hours. You should have different availability requirements for each period, even if your application is used 24x7. Next, create your goal using words and whole numbers, not percentages. For example:

Zero downtime during business-critical periods.
No more than 2 unscheduled downtime incidents per month of no more than 5 minutes per incident during after hours periods.
No more than 1 scheduled maintenance period per month of no more than 30 minutes during after hours periods.
Monitoring interval of 5 minutes.
Monitor key aspects of the application, not the server.
Independent third-party monitoring from multiple locations.

After defining exactly what your availability goals are, you can now strive to achieve it. The difference now is that your goal is 100% achievable. That's a statistic you can count on.

Glen Kendell is a network architect and owner of Release to Production. He publishes a monthly newsletter called In-Production: Achieving True High Availability.

Related Posts

This entry was posted in web news. Bookmark the permalink.

Comments are closed.