Microsoft Azure AD Outage Highlights Upcoming SLA Updates
If you had trouble yesterday accessing many of Microsoft’s services, you are not alone. For several hours, late into the afternoon on the East coast, Teams, Azure AD, and many other services were inaccessible.
While outages are infrequent, they do happen with Microsoft 365 and each time this occurs, the company will post a triage of the root cause. In this instance, it was the rotation of security keys that sparked the fire that took down the services.
The short version is that Microsoft, on a scheduled frequency, rotates keys used to support Azure AD’s interactions with OpenID and other standards for cryptographic signing operations. Because of a “complex cross-cloud migration”, one such key was marked ‘retain’ which means that it should not be pulled out of operation.
You can probably see where this is going but that key was not retained and was pulled from operation with the net impact of many services no longer being able to authenticate correctly and taking down the services. This outage occurred because of a bug in the functionality to keep the single security key in rotation longer, not because of any outside threat.
What is “Inside Microsoft Teams”?
“Inside Microsoft Teams” is a webcast series, now in Season 4 for IT pros hosted by Microsoft Product Manager, Stephen Rose. Stephen & his guests comprised of customers, partners, and real-world experts share best practices of planning, deploying, adopting, managing, and securing Teams. You can watch any episode at your convenience, find resources, blogs, reviews of accessories certified for Teams, bonus clips, and information regarding upcoming live broadcasts. Our next episode, “Polaris Inc., and Microsoft Teams- Reinventing how we work and play” will be airing on Oct. 28th from 10-11am PST.
The other thing to point out here is that a similar incident occurred back in September and the company committed to improving the protection envelope around Azure AD services and more specifically, the backend to prevent issues like this from happening. At this time, those enhancements are not done rolling out but if they had been, they could have prevented this outage – look for the complete rollout to be finished by mid-2021.
Of course, the biggest issue for Microsoft is that they have SLAs that they must meet, and starting April 1st, the company will raise the public SLA to 99.99%, and more than likely, this outage would have tripped the circuit breaker on that agreement. Of course, being able to jump through the hoops to receive credits for downtime can be complex and a barrier to holding the company to its SLA.
Knowing the above, I’ll be curious to see if Microsoft postpones the SLA update until the rollout of the new backend updates are complete or if they will stick to their current plans.