What 99.99% uptime actually buys you
Every cloud reliability conversation eventually arrives at a number with four nines after the decimal point. Marketing pages quote it. Procurement scorecards weight it. Engineers cite it in design reviews. And almost nobody who quotes “99.99%” can tell you, without looking it up, what those nines actually buy them in minutes — or what the contract behind that number is willing to call “downtime” in the first place.
This is a guide to the second question. The math is the easy part. The hard part is that the headline number does not describe what most people think it describes, and the gap between the two is where reliability conversations go wrong.
The math, briefly
Availability is reported as the fraction of a defined time window during which a service was up. The “nines” notation collapses that fraction into a count of leading nines: 99.9% is three nines, 99.99% is four nines, 99.999% is five nines.
Each additional nine reduces the allowed downtime by an order of magnitude. The table below is the version every SRE has redrawn on a whiteboard at some point. It is worth memorising the right-most column — the per-month figure is the one that maps to how most SLAs are actually settled.
| Availability | Per year | Per month (30 d) | Per day |
|---|---|---|---|
| 99% (two nines) | 3 d 15 h 36 m | 7 h 12 m | 14 m 24 s |
| 99.9% (three nines) | 8 h 45 m 36 s | 43 m 12 s | 1 m 26 s |
| 99.95% | 4 h 22 m 48 s | 21 m 36 s | 43 s |
| 99.99% (four nines) | 52 m 33 s | 4 m 19 s | 8.6 s |
| 99.999% (five nines) | 5 m 15 s | 25.9 s | 0.86 s |
Two things are worth pulling out of that table.
First: the jump from three nines to four nines is the difference between an annual outage budget of nearly nine hours and an annual outage budget of under an hour. That is not a marginal upgrade — that is a different category of system, with a different category of cost.
Second: the per-day column is what your users actually experience. A service operating at “five nines” is allowed slightly under one second of downtime per day on average. If you have ever waited for a single failed API request to retry, you have already burned that budget. This is why monthly averages are deceptive — the number assumes that downtime spreads evenly across the window, which is not how outages happen in practice.
The denominator is doing the work
Here is the part the marketing page does not tell you: the percentage in the SLA is a fraction, and the fraction has a numerator and a denominator. Vendors negotiate the denominator — what counts as “service time” — and the negotiation is where the headline number comes from.
There are four common patterns. Each one reduces the size of the denominator, which mechanically reduces the amount of downtime needed to maintain a given availability number.
Pattern 1: Regional scope. The SLA covers “the service” in a named region. If the entire region goes dark, that is downtime. If a single zone within the region goes dark, that is usually not — the SLA assumes you should have built for zone redundancy. Practically: a single-zone deployment can experience hours of unavailability without the provider owing you a service credit. Multi-region SLAs are rarer than they look and often require explicit architectural commitments on your side.
Pattern 2: Control-plane vs data-plane. “Service availability” frequently means data-plane availability only — the ability to call existing resources. The control plane (creating, modifying, deleting resources) carries a separate, lower SLA, if it carries one at all. A service whose data plane is up but whose control plane has been wedged for an hour is contractually 100% available. From the perspective of a team trying to scale up during an incident, it is not.
Pattern 3: Covered services. Cloud platforms have hundreds of products. The SLA enumerates which ones are covered. The popular ones almost always are. Newer products, regional products, and products tagged “preview” or “beta” are almost always not — and the boundary moves quietly, especially as services graduate out of preview. If you have built on a service that the SLA does not name, you do not have an SLA on that service.
Pattern 4: Error-rate thresholds. Most modern SLAs do not measure binary up/down. They measure error rate. A representative clause reads something like: “The service is considered unavailable when the error rate exceeds 5% for at least 5 consecutive minutes, calculated over rolling 5-minute windows.” Every bold number in that sentence is a vendor lever:
- The error rate threshold (typically 0.1% to 10%, depending on the product) determines how degraded the service has to be before any clock starts at all.
- The minimum sustained duration (typically 1 to 10 minutes) means brief spikes never accumulate, even if they happen continuously.
- The aggregation window (typically 1 to 5 minutes) determines whether two consecutive 4-minute outages count as one 8-minute event or as zero events of qualifying duration.
- The “consecutive” qualifier means that recovery for a single minute resets the clock — a service that flaps every nine minutes between fully broken and barely working is contractually 100% available under most error-rate SLAs.
None of these patterns is dishonest. Each one is a sensible engineering decision about what to measure. What they do collectively is push the boundary between “service is down” and “service is working as designed” toward a definition that the provider’s existing telemetry can comfortably report on. The result is a number that means something specific — and rarely the thing the procurement spreadsheet thinks it means.
The clock starts when the vendor says it starts
Even after you have agreed on what counts as downtime, the SLA controls how long a given incident is allowed to be on the record.
The most consequential clause is the minimum incident duration. An SLA that defines downtime as “qualifying error conditions sustained for at least 5 minutes” will not credit a 4-minute incident, no matter how complete the outage. From the operator’s perspective, four minutes of total unavailability is a four-minute outage. From the SLA’s perspective, it never happened.
Multiply this across a month and the effect is striking. A service with one 4-minute outage every other day — call it 15 events per month, 60 minutes of cumulative user-visible unavailability — reports as 100% available on the SLA scorecard. The same service with a single 60-minute event reports as 99.86%. The user experience may be much worse in the first case (15 separate incidents your team had to wake up for, 15 customer-visible disruptions, 15 reset retry budgets). The SLA does not capture that, and the credit calculation does not either.
The clock also stops when the vendor decides recovery has occurred, and the recovery definition is rarely “the customer-observable error rate is back to baseline.” Often it is “monitoring shows the underlying infrastructure is healthy again,” which is an internal measurement. The lag between internal recovery and external recovery can be substantial — caches need to warm, queues need to drain, DNS needs to converge, third-party CDNs need to refresh. None of those minutes count.
The credit is not the loss
When an outage clears every clause and qualifies for a service credit, the credit is almost always a fraction of the affected month’s billing for the affected service. The structure is usually tiered:
| Monthly uptime | Typical credit |
|---|---|
| ≥ 99.9% | 0% |
| 99.0% – 99.9% | 10% |
| 95.0% – 99.0% | 25% |
| < 95.0% | 100% |
Specific thresholds and percentages vary by provider and product, but the shape is consistent.
A few things follow from this shape.
First, the credit is paid as a discount on the same service that failed you. If you have moved off the service because it failed, you cannot collect. If your monthly bill on the affected service is small, the credit is correspondingly small — a $40 credit on a database that took down your $200,000/month e-commerce site for an afternoon is not how risk is normally priced.
Second, the credit is not damages. It is a refund of money you paid for capacity that did not work. The contract typically caps the provider’s total liability at the prior 12 months’ billing under the same agreement, and excludes consequential damages explicitly. Customers who plan to “rely on the SLA” for revenue protection are planning incorrectly.
Third — and this is the one that most often surprises operators — the credit is usually conditional on the customer filing a claim, in writing, with detailed evidence, within a narrow window (30 to 60 days is typical). No claim, no credit. Provider self-reported downtime data is not automatically applied to your account in most agreements. You have to ask.
Practically: SLAs are useful for procurement comparison and for setting reasonable expectations. They are not useful as a financial hedge. If your business case requires a financial hedge against provider downtime, the instrument you want is insurance, not a service-credit clause.
What “monthly” obscures about reliability
A single monthly availability number compresses two independent properties of a service into one figure: how often it fails, and how long it stays failed when it does.
Consider two providers, both reporting 99.95% monthly availability — about 22 minutes of allowed downtime in a 30-day month.
- Provider A experienced four separate 5.5-minute outages, evenly distributed.
- Provider B experienced one 22-minute outage on the 14th.
Both providers will quote the same number on their marketing page. Both will pass the same procurement filter. For the operator on the receiving end, they are very different systems.
Provider A’s failure pattern is consistent with a transient instability — perhaps a flapping load balancer, a marginal capacity headroom, a recurring noisy-neighbour problem. Each individual incident is short, but their frequency means your team is paged constantly, your retry budgets are eroded continuously, and your customers see error states they will remember. The operational tax is high.
Provider B’s failure pattern is consistent with a single, well-defined incident — a configuration push, a hardware failure, a regional network event. It is a worse fault on a single day, but it is a single fault. You write a postmortem, you incorporate the lesson, and you move on. The operational tax is lower than Provider A’s even though the cumulative downtime is identical.
The headline number tells you nothing about which pattern you are buying. Some providers do publish a distribution of incident durations alongside the monthly number, but the practice is not yet standard, and it is much less common than quoting the headline figure on a marketing page.
The same point applies to time-of-day distribution. A service whose 22 minutes of monthly downtime falls entirely between 03:00 and 05:00 UTC on a Sunday is operating on a fundamentally easier schedule than a service whose outages cluster during peak hours in its largest customer’s region. The numerator is identical. The user impact is not.
So what should you actually look at?
The headline availability number is a starting point. It tells you whether a provider is in a serious-engineering category at all. A provider quoting 99.9% on its data plane is asking you to plan for almost 9 hours of cumulative annual outage; a provider quoting 99.99% is asking you to plan for about 53 minutes. That difference is real and matters for capacity planning.
For everything beyond that initial filter, the headline is the wrong question. Better questions, in rough order of practical signal:
- What is the incident frequency? Count of incidents per month, not cumulative duration. A provider with two long incidents is different from a provider with twelve short ones, even at the same monthly availability number. Some providers publish post-incident reports for every event; reading six months of those is more informative than any SLA.
- What is the recovery profile? When this provider has an incident, how fast does it usually clear? Median time to detection, median time to mitigation, and the long tail (the 90th percentile) matter much more than the average. A vendor that detects fast and recovers slowly is a different proposition from a vendor that detects slow but, once it sees the problem, fixes it in minutes.
- What is the dependency surface? Many provider incidents are not provider incidents at all — they are upstream incidents at a third party the provider depends on. DNS providers depending on a single registry, CDNs depending on a single TLS certificate authority, multi-region services with implicit single-region control planes. The dependency map determines correlated-failure risk.
- What is the independent measurement? A provider’s own data is one signal. Independent measurement from outside the provider’s network is a different signal, with no incentive to round in the provider’s favour. The two together are more informative than either alone — agreement strengthens the picture, divergence is itself data worth investigating.
- What do other customers’ incident postmortems say? When a real customer writes about an incident their provider caused, they describe what actually broke, how long it took to get a human on the line, whether the status page reflected reality, whether the recovery actually held. These are the artefacts that survive a contract negotiation.
The first three items in that list can usually be answered from public information if you are willing to read carefully. The fourth is what this project exists to produce. The fifth is hard work but disproportionately useful.
A worked example
Take “object storage” — a category where the SLA shape is unusually consistent across vendors, which makes the comparison cleaner than in most product areas.
Most major object storage products carry a monthly availability SLA in the 99.9% range, with a credit schedule that pays 10% for outages between 99.0% and 99.9%, 25% between 95.0% and 99.0%, and 100% below 95.0%. The error-rate qualifier is typically defined around “valid requests that result in 5xx responses or fail to complete,” sustained over a rolling window (usually 5 minutes), with the minimum incident duration in the same range.
Now overlay the patterns from earlier in this piece:
- The denominator is the affected region. A single-zone outage within the region does not qualify. The SLA assumes multi-zone deployment.
- The clock does not start until the error rate exceeds the contractual threshold for the contractual duration. A 3-minute total outage in a 5-minute-window SLA does not count. Two consecutive 3-minute outages with a 1-minute recovery between them do not count either.
- The covered scope is usually limited to the storage operations themselves (GET, PUT, DELETE). Adjacent services — IAM, replication, lifecycle policies — may carry separate SLAs or none.
- The credit is a discount on the affected month’s bill for the affected product, capped, and requires you to file a claim with evidence within 60 days.
The result is that two providers can both quote “99.9% monthly availability” on the same product class and still differ meaningfully on:
- How much of the platform is in the SLA’s scope at all.
- What aggregate behaviour qualifies as “an outage.”
- How short an outage can be before it stops counting.
- How easy or hard it is to actually collect on the credit.
- How frequently and for how long each provider has experienced credit-triggering events in practice (the SLA tells you what is contractually allowed; only measurement tells you what actually happens).
For procurement, the SLA is a comparison aid for the first four items. For operations, the fifth is the one that matters — and the fifth is what independent measurement is for.
What we publish that is different
Every number on the UptimeProject leaderboard comes from probes we operate ourselves, on networks no provider in the measurement set controls, against real endpoints, with the same checks an outside user would run. We measure on a fixed cadence — once per minute for HTTP, less frequently for slower-changing surfaces like TLS and DNS — and we aggregate by strict probe consensus. A minute counts as up only if a strict majority of reporting probes succeeded; a minute with fewer than three reporting probes is excluded from the availability denominator entirely so that gaps in our own coverage do not penalise the service being measured. The full description is on the methodology page, and the underlying data is exposed via a public read-only API.
The point of doing the measurement this way is to produce a number that is meaningful for the same reason an SLA number is not: there is no commercial incentive to round in any direction. The boundaries are stable, public, and versioned in git. When something here moves, you can see the diff that moved it.
Independent measurement does not replace the SLA. The SLA is a contract; our numbers are observations. What independent measurement does is give you a second reading — one whose denominator is not negotiable, whose clock does not pause, and whose credit schedule is irrelevant because it is not in the room when the data is collected. Put it next to the SLA, watch where the two agree, watch where the two diverge, and the divergence is where the interesting questions are.
The leaderboard is updated continuously as new measurements land. The data is published under CC BY 4.0 — copy it, cite it, argue with it.
Questions, corrections, requests for additional services to measure: hello [at] uptimeproject [dot] org.