Last Update: Sep 04, 2024 | Published: Mar 06, 2017
Amazon’s cloud service, AWS, recently had a major outage that affected almost an entire region. With that outage fresh in our memories, I thought I’d write a post on how to prevent services outages in Azure.
Note that the focus of this article is Infrastructure-as-a-Service (IaaS).
Amazon Web Services had a major outage on February 28th that affected thousands of its customers, including some very well-known names, and this likely affected millions of users. I was busy teaching a course on Azure at the time and did not notice the outage myself, but I did read that a large section of the Internet was affected. An after-action report by Amazon stated that the entire outage was the result of a typo by an administrator, who was supposed to remove a few machines (by command line, with obviously no process to prevent such mistakes) but accidentally remove a lot of machines.
It would be easy for a pro-Microsoft person to crow about Amazon’s failure, but Microsoft has not been immune to operational error, either — in 2014, a procedural update error brought down Azure’s storage system.
One of the four cloud myths that I debunk in my aforementioned training course is that you get disaster recovery by default from your cloud vendor (such as Microsoft and Amazon). Everything in the cloud is a utility, and every utility has a price. If you want it, you need to pay for it and deploy it, and this includes a scenario in which a data center burns down and you need to recover. If you didn’t design in and deploy a disaster recovery solution, you’re as cooked as the servers in the smoky data center.
In this post, I’ll explain a few things that you can do, from the basics to the advanced, to design in resilience for your in-Azure services.
Most outages in Azure are localized and planned, such as updates being deployed to the Hyper hosts by Microsoft. Luckily, the hosts are running a form of Nano Server (fewer updates and faster reboots) and Microsoft got warm reboots (no POST) working on its hosts (a feature that never made it past Technical Preview 1 in Windows Server 2016). Outages to virtual machines are generally very short, but we can still have services outages. There are two ways that we can get guaranteed uptime (i.e., the service level agreement or SLA) from Microsoft:
The primary purpose of an availability set is to get the SLA from Microsoft; however, hosting veterans will tell you that this is a financial bet by the hoster. The big outage can and will happen, and the finances cover the hoster for those outages, but will your business survive?
We can backup running Windows or Linux Virtual machines using Azure Backup. We even have the ability to do item level restores from the backups. If a Microsoft engineer or one of your own administrators was to accidentally delete some machines, then Azure Backup can be used to restore those machines. If a region was to burn down, the default configuration of an Azure recovery services vault (storage used by Azure Backup) replicates the backups to a neighboring region, and you can do your restores there.
But this brings up a point that I constantly have to explain to those that are new to disaster recovery: backup is used when I need to do a small restore and disaster recovery is used when I need to get back an entire service or business. A restore from Azure Backup will take time, and that might be a lot of time if you’re talking about a large virtual machine or lots of virtual machines. I do not view Azure Backup (or any backup solution) as being suitable for a scenario such as the AWS one, but I still would want my backups at my secondary location so that I can still do those operational restores after the failover.
As I said earlier, if you want to have a disaster recovery solution, then you need to design it. Right now, there is no solution for replicating virtual machines in Azure from one region to another, but it is on the way; Microsoft demonstrated Azure-to-Azure recovery using Azure Site Recovery (ASR) at Microsoft Ignite 2016. Instead, if you want a DR solution you’ll need to:
In the event of a disaster, you can failover from the primary location to the secondary one, either automatically or manually by configuring the Traffic Manager rules (I’d recommend scripting this via Azure Automation from a third region). The downside of this type of design is that you have had to double the number of virtual machines deployed and keep them running.
However, we do know that a better solution is coming from Microsoft; Azure-to-Azure Site Recovery will replicate most virtual machines (never mix application replication and virtual machine replication) from your production deployment to most other Azure regions (subject to service availability). In the event of an AWS-style outage, you would execute a recovery plan to failover your virtual machines to another location, and your services would be up and running in a matter of minutes.
Note that there are some that are considering the use of GRS storage accounts as a DR solution; there are two problems with this approach:
The previous issues with GRS storage combined with what’s coming from Azure Site Recovery are why I recommend using the cheaper LRS option, which is all you get from Managed Disks, anyway.
To be honest, the best possible solution is to use multiple cloud services. Imagine if the storage outage from 2014 was to repeat itself. Even if I was armed with a DR solution in Azure, the outage was global, and I would not be able to failover to another Azure region.
Ideally, I should use multiple cloud vendors. For example, my production system would run in Azure and my failover site could be in AWS. But this isn’t a VM replication solution; it’s a data replication solution, so I would have to run networks, storage, and virtual machines in the secondary cloud. If Azure was to have a global outage, I could failover from my production system in Azure to a secondary system in AWS.
The drawbacks of this design are huge:
Cloud-doubters often like to pick on these outages and say “this is why I don’t do cloud.” However, these things happen with on-premises deployments, too. Host clusters crash. SANs fail completely. Networks decide to go belly up. And the business suffers with each of these outages. The difference is that when an outage happens in the cloud, the very best of IT is working on a solution. As I like to remind my customers, if there’s a Hyper-V outage, I feel more confident in Microsoft fixing it in Azure than I do in a typical admin fixing it in their computer room or data center. When outages do happen within the big cloud platforms, they are typically short, and because of the financial penalties on the vendor, new processes are created to prevent a repeat. How many of us can truly state the same of our internal IT practices, product knowledge, and skills?