Office 365 Achieved 99.99% Availability in Q3 2016. Does Anyone Still Care About Cloud SLAs?

Office 365 Health

The SLA Record for Office 365

I suspect most people don’t care to check whether Office 365 meets its Service Level Agreement (SLA). But I am the pernickety type that cares about detail like this. Which brought me to the page in the Office 365 Trust Center where Microsoft publishes SLA results on a quarterly basis. The information posted for the third quarter of 2016 covers availability through the end of September 2016. For the sake of completeness, Table 1 details the quarterly figures reported by Microsoft since they first started to report SLA results in 2013.

Q1 2013 Q2 2013 Q3 2013 Q4 2013 Q1 2014 Q2 2014 Q3 2014 Q4 2014
99.94% 99.97% 99.96% 99.98% 99.99% 99.95% 99.98% 99.99%
Q1 2015 Q2 2015 Q3 2015 Q4 2015 Q1 2016 Q2 2016 Q3 2016
99.99% 99.95% 99.98% 99.98% 99.98% 99.98% 99.99%

Table 1: Office 365 performance against SLA since 2013

Data for the last quarter is usually available six weeks after the quarter ends. That is, if the folks responsible for maintaining the page remember to update it. There have been times in the past where a number of vexed messages had to be dispatched to the Office 365 team to ask what had happened to quarterly figures when the data didn’t appear on time. In any case, the process flowed smoothly on this occasion.

 

 

Calculating Availability

Of course, many games are played with SLA figures to make the availability of systems seem better. Excluding planned maintenance is a much-loved method of improving a system’s availability. However, both Microsoft and Google scrupulously avoid such games. SLA is measured simply by comparing the total number of minutes that should be available to users in a given period and subtracting the time lost through incidents.

Each of the individual workloads running inside Office 365 has its own SLA definition. As an example, Microsoft’s SLA agreement defines downtime for Exchange Online to be “Any period of time when end users are unable to send or receive email with Outlook on the web.” The actual calculation is a little more complex:

The “Monthly Uptime Percentage” for a Service is calculated by the following formula:

((User Minutes – Downtime)/User Minutes) * 100

where Downtime is measured in user-minutes; that is, for each month, Downtime is the sum of the length (in minutes) of each Incident that occurs during that month multiplied by the number of users impacted by that Incident.”

All of this is very good. The problem that we face is that Office 365 is now so large that any individual incident is likely to affect only a very small proportion of users and therefore cannot affect the overall availability of the service. It would take an enormous incident involving millions of users for a sustained period to impact availability by 0.01%. After all, a lot of minutes have to be lost through outages when 85 million active users connect to a service.

How Incidents Affect the Office 365 SLA

To prove the point, let’s calculate the total number of minutes available to Office 365 users in a month and see what impact a massive outage has on the SLA. In a 30-day month, the total available minutes for the whole of Office 365 is some 3,672,000 million (Table 2). Now let’s assume that a small Office 365 datacenter region like the newly-launched U.K. region (which is only in the process of transferring tenants) experiences an outage lasting a complete 8-hour working day. If we assume that some 2.5 million users are affected by the outage, the impact on the availability of Office 365 is to reduce it to 99.97% (rounded).

Minutes in a 30-day month 43,200
Office 365 active users 85,000,000
Minutes available to users in the month 3,672,000,000,000
Users in Office 365 Datacenter region 2,500,000
Outage in hours 8
Total lost minutes 1,200,000,000
Availability 99.96732%

Table 2: How the swelling size of Office 365 reduces the impact of a service incident

To be clear, Office 365 outages occur all the time. And when an incident affects your tenant its impact can be sudden and infuriating. You can really do nothing to restore service but wait for Microsoft to troubleshoot and fix. Sometimes that process takes too long, such as when nine hours were required to fix an Exchange Online Protection (EOP) problem on June 30, 2016. That being said, extraordinary care has to be taken when applying fixes to a massive, complex, and worldwide infrastructure like Office 365 in case the cure provokes worse results.

The vast majority of incidents are transient and last just a short time and affect a relatively small number of tenants. The reality is that the last really major Office 365 incident was in June 2014 when a 7-hour outage for some U.S. tenants was provoked by a failure in the directory infrastructure for Exchange Online. That blip resulted in a Q2 2014 SLA figure of 99.95%. A series of incidents in the same quarter of 2015 reduced availability down to the same level. Despite the problems in those quarters, Office 365 still comfortably beat the 99.9% SLA guaranteed by Microsoft.

In fact, the only time I can remember Microsoft having to compensate tenants for not meeting the Office 365 is way back in September 2011. That was soon after the launch of Office 365 when some manual updates went wrong and compromised availability. The systems are different today and changes are much more controlled, so that kind of thing doesn’t happen anymore.

Timing Is Everything

Another factor that should be taken into account when calculating availability is the “official timing” for incidents. A tenant might consider that they are affected by a problem as soon as an issue comes to light. That problem must then be reported to Microsoft support and then accepted by Microsoft as a valid incident before the lost-minute clock starts ticking. Sometimes the gap between initial report and incident acceptance is brief, sometimes it takes longer. And sometimes tenants don’t realize that anything untoward is going on because a problem happens when their users are asleep.

To be fair to Microsoft, they have invested enormously in redundancy to ensure that Office 365 is highly resistant to failure. Redundancy is incorporated at all levels of the service from network to application and is backed up by sophisticated monitoring to ensure everything runs as expected.

With these points in mind, is it any wonder that Microsoft can consistently beat their 99.9% SLA target? In fact, as Office 365 user numbers continue to swell, it becomes harder and harder for any single incident to have a material impact on the Office 365 SLA.

Google’s Availability Record

Google beat Microsoft to the punch here because they were first to report SLA results (for Gmail initially) and to achieve a better than 99.9% SLA over a sustained period. Gmail attained an availability of of 99.984% in 2010. Back then, this was an important step forward for cloud services as it marked the point where a credible claim could be made that at least some cloud services had matured to a point where they provided a better SLA than was possible for on-premises systems.

Lately Google hasn’t been as good as Microsoft in reporting SLA. I can’t find an equivalent quarter-by-quarter tracking of SLA performance on their site (perhaps my search skills are deficient). I did find an FAQ answer that offers some historical perspective.

G Suite offers a 99.9% Service Level Agreement (SLA) for covered services, and in recent years we’ve exceeded this promise. In 2013, Gmail achieved 99.978% availability. Furthermore, G Suite has no scheduled downtime or maintenance windows. Unlike most providers, we do not plan for our applications to be unavailable, even when we’re upgrading our services or maintaining our systems. Google Cloud Platform has a 99.95% SLA, Google BigQuery Service, and the standard storage class of Google Cloud Storage have a 99.9% SLA except for the Durable Reduced Availability Storage class of Google Cloud Storage which has a 99% SLA.

Anecdotally, I see no evidence that Google has experienced any recent problems that would have severely impacted the SLA for their cloud services. Services like Gmail run at even more massive scale than Office 365 so the same logic about the number of affected users required to impact the SLA holds true here, too.

Better Than On-Premises

Data can be argued many ways to prove different points. What I think the SLA data discussed here proves is that major cloud services are extremely robust. At this point, I’ll advance the heretical notion that cloud services are more reliable than on-premises services delivered by the majority of IT departments. The investment made by Microsoft in software engineering, redundancy, hardware, and operational processes has paid off in terms of Office 365. And that’s a good thing.

Follow Tony on Twitter @12Knocksinna.

Want to know more about how to manage Office 365? Find what you need to know in “Office 365 for IT Pros”, the most comprehensive eBook covering all aspects of Office 365. Available in PDF and EPUB formats (suitable for iBooks) or for Amazon Kindle.