Office 365 continues to grow strongly and contribute to Microsoft’s cloud resources. Now supporting more than 100 million monthly active users, Office 365 has experienced some recent hiccups in service quality, but if you look at the Service Level Availability (SLA) quarterly data for Office 365 posted by Microsoft, it shows that service availability has been robust since 2013 (Table 1), which is when Microsoft first started to publish the SLA results.
|Q1 2013||Q2 2013||Q3 2013||Q4 2013||Q1 2014||Q2 2014||Q3 2014||Q4 2014|
|Q1 2015||Q2 2015||Q3 2015||Q4 2015||Q1 2016||Q2 2016||Q3 2016||Q4 2016|
|Q1 2017||Q2 2017|
Table 1: Office 365 SLA performance since 2013
The latest data, posted for Q2 2017 (April through June), shows that Office 365 delivered 99.97% availability in that period. The Q2 result marked a slight decrease in availability over the prior seven quarters. Even so, the fact that a massive cloud service posts 99.97% availability is impressive.
Microsoft takes the SLA seriously because they “commit to delivering at least 99.9% uptime with a financially backed guarantee.” In other words, if the Office 365 SLA for a tenant slips below 99.9% in a quarter, Microsoft will compensate the customer with credits against invoices.
Last November, I posed the question whether anyone still cared about the Office 365 SLA. The response I received afterwards showed that people do care, largely for two reasons. First, IT departments compare the Office 365 SLA against the SLA figures they have for on-premises servers to reassure the business that the cloud is a safe choice. Second, they use the data to resist attempts to move other platforms like Google G-Suite. Google offers a 99.9% SLA guarantee for G Suite, but they seem to be not as transparent about publishing their results.
Microsoft calculates the Office 365 SLA in terms of downtime, or minutes when incidents deprive users of a contracted service such as Exchange Online or SharePoint Online. As an example of the calculation, if you assume that Microsoft has 100 million active users for Office 365, the total number of minutes available to Office 365 users in a 90-day quarter is 12,960,000,000. Achieving a 99.97% SLA means that Microsoft considers incidents caused downtime of 3,888,000,000 minutes or 64,800,000 hours. These are enormous numbers, but put in the context of the size of Office 365, each Office 365 lost just 39 minutes of downtime during the quarter.
Of course, some users experienced zero downtime. Incidents might not have affected their tenant or they might not have been active when an incident happened. On the other hand, some tenants might have had a horrible quarter. Remember that Office 365 spreads across twelve datacenter regions and the service varies from region to region and from tenant to tenant, a fact that you should always remember when a Twitter storm breaks to discuss a new outage.
To better understand what the Office 365 SLA means, we need to take some other factors into account. These are described in Microsoft’s Online Services Consolidated Service Level Agreement.
First, among the exclusions applied by Microsoft we find they can ignore problems that “result from the use of services, hardware, or software not provided by us, including, but not limited to, issues resulting from inadequate bandwidth or related to third-party software or services;”
Defining what inadequate bandwidth means is interesting. For example, if a new Office feature like AutoSave consumes added bandwidth and causes a problem for other Office 365 applications, is that an issue for Microsoft or the customer?
Second, although the term “number of users” occurs 38 times in Microsoft’s SLA document, no definition exists for how to calculate the number of users affected by an incident. This might be as simple as saying that Microsoft counts all the licensed users when an incident affects a tenant. On the other hand, it is possible that an incident is localized and does not affect everyone belonging to a tenant. Knowing how many users an incident affects is important because the number of lost minutes depends on how many people cannot work because an incident is ongoing.
Third, Microsoft must accept that an incident is real before it starts the downtime clock. A certain delay is therefore inevitable between a user first noticing a problem and the time when Microsoft support acknowledges that an issue exists. Users might not be able to work during this time, but Microsoft does not count this lost time in the SLA statistics and availability seems better than it is through the eyes of end users.
You might also quibble about when Microsoft declares an incident over and stops the downtime clock as it might take some further time before Microsoft fully restores a service to the satisfaction of a tenant. On the other hand, Microsoft does count time when an incident is in progress outside the normal working day when users might not be active, which evens things out somewhat.
As I have argued before, Office 365 is now so big that it is meaningless to report an SLA for the worldwide service. What tenants really care about is the quality and reliability of the service they receive from their local Office 365 datacenter region, whether that is in the U.S., Germany, Japan, or elsewhere. This is the reason why ISVs like Office365Mon and ENow Software create products to allow tenants to measure SLA or the quality of service on an ongoing basis.
It would be good if Microsoft sent tenant administrators a quarterly email to give the overall SLA performance and the performance for the tenant, together with details of the incidents that contributed to the quarterly result. Tenants could then compare Microsoft’s data with their own information about the reliability of Office 365. This would be real transparency about operations and make the SLA more realistic and usable.
The calculation of the Office 365 SLA is completely in Microsoft’s hands. I make no suggestion that the data reported by Microsoft is inaccurate or altered in any way. As a user of the service since its launch in 2011, I consider Office 365 to be very reliable. However, the lack of detail (for example, SLA performance by datacenter region and service) makes it easy to think that the reported SLA data is purely a marketing tool.
In fact, the only true measurement of a service’s ability to deliver great availability is what its end users think. That measurement is unscientific, totally subjective, and prone to exaggeration, but it is the way the world works.
Follow Tony on Twitter @12Knocksinna.
Want to know more about how to manage Office 365? Find what you need to know in “Office 365 for IT Pros”, the most comprehensive eBook covering all aspects of Office 365. Available in PDF and EPUB formats (suitable for iBooks) or for Amazon Kindle.