Planned Maintenance For Azure Virtual Machines

computer time
Microsoft is adding a new feature that allows you to control the forced outages that occur to virtual machines when patches are delivered to Azure’s compute hosts.
 

 

Minimize Service Downtime

Ideally, when you deploy a service using Azure virtual machines, those virtual machines should be a part of a valid availability set. Here’s a quick reminder on availability sets:
There is no live migration in Azure. Imagine tens of thousands of machines live migrating on a single cluster and what that would do to the infrastructure and the applications in the virtual machines! Instead, when the host reboots the virtual machines have downtime.
An availability set tags virtual machines so that Azure knows to put them into different update domains. When Microsoft deploys updates to Azure, they do so in an ordered fashion, one update domain at a time. This means that only a small number of hosts are ever offline because of patching and rebooting. If you have configured anti-affinity by using availability sets, then only one (or a few) virtual machines will ever be down at one time.
The key part is: this must be a valid availability set to achieve the 99.95 percent SLA for the service on those machines. Putting one domain controller and one file server into an availability set achieves nothing for uptime and the SLA won’t apply. But putting 2 load balanced web servers into an availability set qualifies the web service for the SLA and minimizes downtime

An illustration of update domains and fault domains in Azure [Image Credit: Microsoft]
An Illustration of Update Domains and Fault Domains in Azure [Image Credit: Microsoft]
 
What about those services that don’t make up a part of a valid availability set? How much downtime do they incur? Once upon a time in Azure, they were down for as long as it took a physical server to reboot. Server admins know how long that can take. I’ve seen faster snail races. However, two years ago Microsoft introduced something called In-Place Migration, which was briefly known as Warm Reboot in Technical Preview 1 of Windows Server 2016. Almost all of the time that Microsoft patches a host, they will:

  1. Pause the virtual machine.
  2. Reboot the host management operating system, without rebooting the hardware.
  3. Re-start the virtual machine.

The entire process takes between 15-30 seconds. Most of us never notice that brief amount of downtime, however:

  • Sometimes the guest OS will require a reboot.
  • There are customers where that downtime happening during production is not acceptable.


 

Planned Maintenance

To be honest, there’s not much to this feature from our point of view but it will be very valuable to customers that must only have downtime during maintenance time of their own choosing. The new Planned Maintenance feature for Azure virtual machines allows that to happen:

  1. The customer will receive an alert saying that maintenance is scheduled to occur in one or more virtual machines.
  2. The customer will schedule their own maintenance window.
  3. When that maintenance window starts, the customer can redeploy the virtual machines, in Service Health in the Azure Portal, to hosts that Microsoft has already patched and rebooted.

Relocating Azure Virtual Machines using Service Health - Planned Maintenance [Image Credit: Microsoft]
Relocating Azure Virtual Machines Using Service Health – Planned Maintenance [Image Credit: Microsoft]
 
The redeploy action is one of the maintenance tasks in Azure. It allows you to reboot a virtual machine on another host. Planned maintenance will ensure that the destination host doesn’t have any planned outages in the near future.
To make Planned Maintenance work, Microsoft had to make some other improvements:

  • Alerting: You can create log-based alerts in Azure Monitor to inform you of upcoming maintenance in Azure.
  • Visibility: Azure Service Health in the Azure Portal will graphically inform you of upcoming maintenance to each of your affected virtual machines. Another feature called Scheduled Events reveals these tasks to applications in the guest OS via a REST API. You can also query for maintenance tasks using PowerShell and Azure CLI.


 
This is a simple feature and but it should save a few scalps in the operations departments of many Azure customers.