Configuring Windows Server 2016 Hyper-V Compute Resiliency

hyper v abstract
In this post, I will explain how you can customize the functionality of Compute Resiliency, which is a feature that increases tolerance of transient errors in Hyper-V failover clusters.
 

 

Reminder of Compute Resiliency

Microsoft analyzed support calls and feedback for Hyper-V and found that a big pain point was how a Hyper-V cluster responded to very brief problems. For example, if a data center operator accidentally pulled the wrong network cable or a top-of-rack switch port became unstable, the cluster would probably react by assuming that hosts had gone offline and restarted the virtual machines on other hosts. The response to the problem, booting up virtual machines and restarting services, takes longer than the problem would take to resolve itself.
As a result, Microsoft created a number of features to become more flexible and tolerant in response to these short-term issues. One of these features or improvements is Compute Resiliency. Thanks to this improvement, Failover Clustering will be less aggressive with moving virtual machines from a host that is having heartbeat issues. The cluster will wait longer before failing over the virtual machines. In the event of a host having repeat issues (3 in one hour), the host will be quarantined for two hours. This results in virtual machines being live migrated to other healthy nodes.
Compute Resiliency has a collection of default configurations that Microsoft tuned to suit most customers but you might wish to modify this behavior. The settings for Compute Resiliency can be configured on a cluster or per-cluster group (virtual machine) basis.

Cluster Settings

The cluster has a number of settings that you can modify. Cluster settings are global settings that affect all cluster groups (virtual machines) unless overridden:

  • ResilienceLevel: You enable or disable Compute Resiliency, which is on by default (setting of 2 or AlwaysIsolate). You can disable Compute Resiliency (back to pre-WS2016 behavior) with a value of 1 or IsolateOnSpecialHeartbeat. With this setting, failover will always happen unless the node pre-communicates that a maintenance operation is taking place. In this case, the host should go into an isolated state without failover.
  • ResiliencyDefaultPeriod: This is the default amount of time that the cluster will allow a node to remain isolated. The default value of this setting is 240 seconds or 4 minutes.
  • QuarantineThreshold: This is the number of times that a node can become isolated in an hour before the cluster will be quarantined. This is set to 3 by default.
  • QuarantineDuration: This setting, set to 7200 seconds or 2 hours by default, controls how long a host will remain quarantined.

You can configure these cluster settings using the normal PowerShell method of configuring a cluster. In the below example, I am changing ResilienceLevel to 1, disabling Compute Resiliency:

(Get-Cluster).ResilienceLevel = 1

I can query the setting as follows:

(Get-Cluster).ResilienceLevel

Why might I disable Compute Resiliency? If I am running a black-box solution, such as a cluster-in-a-box (CiB), where the cluster can communicate over copper connections inside of the enclosure, then there are no external dependencies that have transient errors, such as a switch. In a CiB, if a node fails to heartbeat, then there really is an issue. Compute Resiliency would delay a real failover.

With normal Hyper-V clusters, most heartbeat failures are probably transient networking issues. Therefore, we should leave Compute Resiliency enabled. There are other options, such as extending the length of quarantine. This would allow engineers to come into the office during regular office hours to handle an overnight issue.

Configuring Virtual Machines

In the view of Failover Clustering, a virtual machine is a set of resources called a cluster group. The cluster group is named after the virtual machine.
We have one Compute Resiliency setting available to us on a per-virtual machine basis. This setting is called ResiliencyPeriod, which is set to -1 by default. This means that the virtual machine will inherit the ResiliencyDefaultPeriod value from the cluster. If you want some virtual machines to failover more aggressively, then you can reduce ResiliencyPeriod in the machine’s cluster group settings. For example, the below will configure VM01 to failover after a heartbeat timeout of 60 seconds:

(Get-ClusterGroup “VM01”).ResiliencyPeriod=60

Recommendation

Do not change these settings simply because you can. If your clustered virtual machines are behaving normally and your uptime is good, then leave things alone. If you do need to make changes, do so gradually or 1 at a time. Test thoroughly and document what you have modified. Therefore, you can reverse the changes if necessary.