What Happened to Microsoft’s Clouds Last Week?


The Problems

trying to set up some new Azure subscriptions for customers but the required permissions work that Microsoft does in the back ground (Azure AD permissions) didn’t appear to be working. I suspected there was a glitch in Microsoft somewhere and told them to open a support call. Later in the day, I started hearing that some Microsoft partners and their customers were having issues with Office 365. When one hears of problems across multiple clouds then you start to look for a commonality.

The Tenant

Few people seem to understand what The Tenant is in a Microsoft cloud deployment. Any business that signs up to Microsoft cloud service such as Office 365, CRM 365, Azure, and so on gets a tenant; that’s the process where you set up a something.onmicrosoft.com domain name, such as petri.onmicrosoft.com. This is a unique directory that is used to store your usernames and password hashes. This service is powered by Azure AD – no; you do not have user accounts in Office 365 or Azure. Just like you don’t have user accounts in on-premises Exchange. A directory provides authentication and authorization services. On premises you use a Windows domain powered by Active Directory Domain Services. In Microsoft’s clouds, you use Azure AD, even if you do not know it.
Microsoft’s cloud services authenticate and authorize against Azure AD. Whenever you sign into Office 365, Office 365 asks Azure AD to sign you in and authorize you. This is your tenant. Office 365 is a subscription that is associated with the tenant. And you get single sign-in to Microsoft cloud services by associating other cloud services such as Intune, Azure Information Protection, CRM 365, and Azure with this tenant.
What happens to Exchange, your file servers, and your Windows-authenticated applications if your on-premises domain goes offline? With no Active Directory, there are no authentication/authorization services and everything is effectively dead. So imagine what would happen to your Microsoft cloud services if your tenant, Azure AD, went offline.

The Outage

Later on Tuesday, I started seeing more comments about problems on Twitter. People were having problems with Visual Studio Team Services, Office 365, managing Azure resources, and more. I looked at the Azure Status page. There was a problem with services in South Central US, but there were also global issues in Azure AD, Bot Service, and Azure Resource Manager (ARM).

The problems as they happened in Azure [Image Credit: Microsoft]
The problems as they happened in Azure [Image Credit: Microsoft]
OK, South Central US appeared screwed, so if you had resources deployed in that Azure region then you were screwed. But Office 365 was also affected and, some myth-busting here, Office 365 is NOT hosted on Azure. So why were Office 365 and other clouds affected?
The clue was in the non-regional column (above). Azure AD was having a global issue. If Azure AD was in a weakened state, then that would affect all Microsoft cloud services.
Furthermore, in Azure, every “resource manager” (not classic) resource is managed via ARM – the APIs that sit between the admin tools that you work with and the resource managers that do the work in the background of Azure. This would explain the misbehaving VPN gateways, the virtual machines that wouldn’t start, and all the other Azure issues I was reading about on Tuesday.

What Happened?

Mary Jo Foley was all over this story, with her sources reporting that a lightning strike took out a data center in the Azure South Central US region. The cooling system had allegedly failed and equipment was shut down to avoid damage and loss of data.
Later, Microsoft posted a report online to discuss the issue. The report confirmed the lightning strike report and that customers hosted in that data center (one of several data centers in the region) were affected.

Microsoft's explanation of the September 4th outage [Image Credit: Microsoft]
Microsoft’s explanation of the September 4th outage [Image Credit: Microsoft]
But what about the global impact? Microsoft only said:

Non-regional services such as Azure Active Directory, Visual Studio Team Services, and Azure Resource Manager may have also experienced impact.

Huh? That one-liner underplays what really happened. If you understand the importance of Azure AD and ARM, then you’ll understand why the problems in South Central US affected customers in the USA, Europe, and possibly further afield, including customers that don’t have anything deployed in any US regions.
It’s clear that there was a bigger issue than the above analysis explains, and that Microsoft has some serious engineering to do to prevent a repeat of a local issue affecting global services.

Does This End Cloud Computing?

You know the person – the one who has a shrine built around their SAN – this is the person that will greet an outage like the above with joy and proclaim “this is why I will never go to The Cloud”. I need to pick the right words here: that person is an idiot. I wonder how long their infrastructure would be out for if they had suffered a lightning strike that had killed their cooling system? Hmm – a week, more? Would they still have a job afterward? In June 2017, an outage in a British Airways data center (after which failover is rumored to have failed) cost the company £150 million, or around $193 million. A data center outage for Delta airlines in 2016 cost that company $150 million. As anyone who knows anything about IT will tell you, bad stuff inevitably happens, and what counts is how you respond and prevent it from happening again.