Microsoft Azure AD Outage Highlights Upcoming SLA Updates
If you had trouble yesterday accessing many of Microsoft’s services, you are not alone. For several hours, late into the afternoon on the East coast, Teams, Azure AD, and many other services were inaccessible.
While outages are infrequent, they do happen with Microsoft 365 and each time this occurs, the company will post a triage of the root cause. In this instance, it was the rotation of security keys that sparked the fire that took down the services.
The short version is that Microsoft, on a scheduled frequency, rotates keys used to support Azure AD’s interactions with OpenID and other standards for cryptographic signing operations. Because of a “complex cross-cloud migration”, one such key was marked ‘retain’ which means that it should not be pulled out of operation.
You can probably see where this is going but that key was not retained and was pulled from operation with the net impact of many services no longer being able to authenticate correctly and taking down the services. This outage occurred because of a bug in the functionality to keep the single security key in rotation longer, not because of any outside threat.
The other thing to point out here is that a similar incident occurred back in September and the company committed to improving the protection envelope around Azure AD services and more specifically, the backend to prevent issues like this from happening. At this time, those enhancements are not done rolling out but if they had been, they could have prevented this outage – look for the complete rollout to be finished by mid-2021.
Of course, the biggest issue for Microsoft is that they have SLAs that they must meet, and starting April 1st, the company will raise the public SLA to 99.99%, and more than likely, this outage would have tripped the circuit breaker on that agreement. Of course, being able to jump through the hoops to receive credits for downtime can be complex and a barrier to holding the company to its SLA.
Knowing the above, I’ll be curious to see if Microsoft postpones the SLA update until the rollout of the new backend updates are complete or if they will stick to their current plans.