Jeremy Wright initially brought up capacity scaling issues on his blog, essentially saying that Web 2.0 companies need 99.99% uptime or else [CORRECTION: Jeremy was generally ranting about uptime and never specifically called for 99.99% uptime; I inferred this based on a later post by Jeremy, which is worth reading]. I enjoyed Signal vs. Noise’s response to Jeremy’s post because it provides a concise explanation of why 99.99% uptime is really hard:
To go from 98% to 99% can cost thousands of dollars. To go from 99% to 99.9% tens of thousands more. Now contrast that with the value. What kind of service are you providing? Does the world end if you’re down for 30 minutes?
If you’re Wal-Mart and your credit card processing pipeline stops for 30 minutes during prime time, yes, the world does end. Someone might very well be fired. The business loses millions of dollars. Wal-Mart gets in the news and loses millions more on the goodwill account.
Signal vs. Noise goes on to question what the implications for a service like Flickr are if they operate at less 99% uptime; the answer is that right now it’s probably no big deal.
The same questions and arguments can be applied to internal corporate applications. Is it ok if your company’s e-mail doesn’t work all the time? Probably not, especially with people accessing over the web, via Treos and Blackberries, and working on the weekends. Is it ok if the human resources web portal doesn't have a 99% uptime? HR might tell you that it is, but chances are good that HR is not working 7 days a week in most companies.
Signal vs. Noise introduces a great term for evaluating uptime that came from Alistair Cockburn: “Criticality”. The percentage of uptime is directly related to the level of criticality.