A lot of time has passed since Microsoft’s Wolfpack technology, the very crude failover clustering that was introduced in Windows NT Server 4.0, Enterprise Edition in 1997. These days, we primarily use Windows Server Failover Clustering to create Hyper-V clusters to host highly available virtual machines instead of services. That means we have many more highly available roles running on much bigger clusters, and we needed to change how to work with the nodes in a cluster.
Up until Windows Server 2008, clustering was a rough and specialized field. Since then the corners have been rounded, administration has become easier, and Failover Clustering is a necessary skill for anyone working with Microsoft virtualization.
Part of that improvement from Microsoft was the addition of pause and drain functionality in Windows Server 2012 (WS2012) that was improved and built upon in Windows Server 2012 R2 (WS2012 R2).
One of the major features of WS2012 Hyper-V was the ability to perform concurrent live migrations. In combination with this new feature, Failover Clustering added the ability to queue up live migrations. For example, imagine that a host has 100 virtual machines running on it. Your hosts allow up to five simultaneous live migrations.
You can select all of the virtual machines at once in Failover Cluster Manager, right-click, and select Move > Live Migration > Best Possible Node and then the magic happens: Five virtual machines will start to live migrate to other nodes in the cluster, and the remaining virtual machines will wait for a live migration slot to open up. Assuming that there isn’t a fabric problem, every running virtual machine will be drained from the host.
The pause action of a Failover Cluster builds upon queued Live Migration. You can select a node in a cluster and pause it by opening Failover Cluster Manager, browsing into Nodes, right-clicking the node, and selecting Pause. There are two ways to pause a node:
The Pause action is a Hyper-V administrator’s friend. You can plan some kind of maintenance, such as patching or hardware repairs, and move virtual machines with no perceivable downtime to other hosts in the cluster. The host moves into a paused state and is temporarily out of the cluster and you are free to work on that node without impacting services.
Pausing a host drains it of roles and allows non-disruptive planned maintenance. (Image: Aidan Finn)
Note that System Center Virtual Machine Manager will issue a warning because this action is “out of band” in the view of System Center.
Those of you that tried a pause action on WS2012 might have noticed something odd. We have the ability to crudely order the failover of virtual machines using a high/medium/low priority flag for each virtual machine.
By default, virtual machines with a low priority on WS2012 clusters were moved using Quick Migration, and there was considerable perceivable downtime. Microsoft’s thinking was that low priority meant that downtime was OK. A Quick Migration has less impact on a system than a Live Migration without RDMA. However, most customers used the low priority as an ordering system for production systems so downtime was not OK.
You can alter the MoveTypeThreshold on WS2012 Hyper-V clusters to force low priority virtual machines to live migrate when you pause a node. Note that, by default, low priority virtual machines will live migrate on WS2012 R2 Hyper-V (some parts of Microsoft do listen to feedback).
You can resume a node to exit the paused state and return to normal cluster operations. Doing this gives you the choice to bring back the previously hosted virtual machines or leave them where they are.
Shutting down a host was a gotcha for pre-WS2012 R2 clustered hosts. Many Hyper-V administrators made the mistake of shutting down a host and assuming that virtual machines would move without perceivable downtime to another host in the cluster. That was not the case at all; the host would shut down, bringing the virtual machines with it, and the virtual machines would be failed over to another node.
Microsoft prevented a lot of helpdesk calls when they changed this behavior in WS2012 R2. Now when you shut down a host, any virtual machines on that host will be automatically moved to other hosts in the cluster using the drain action.