Hyper-V

Configuring Windows Server 2016 Hyper-V Compute Resiliency

In this post, I will explain how you can customize the functionality of Compute Resiliency, which is a feature that increases tolerance of transient errors in Hyper-V failover clusters.

 

 

Sponsored Content

Maximize Value from Microsoft Defender

In this ebook, you’ll learn why Red Canary’s platform and expertise bring you the highest possible value from your Microsoft Defender for Endpoint investment, deployment, or migration.

Reminder of Compute Resiliency

Microsoft analyzed support calls and feedback for Hyper-V and found that a big pain point was how a Hyper-V cluster responded to very brief problems. For example, if a data center operator accidentally pulled the wrong network cable or a top-of-rack switch port became unstable, the cluster would probably react by assuming that hosts had gone offline and restarted the virtual machines on other hosts. The response to the problem, booting up virtual machines and restarting services, takes longer than the problem would take to resolve itself.

As a result, Microsoft created a number of features to become more flexible and tolerant in response to these short-term issues. One of these features or improvements is Compute Resiliency. Thanks to this improvement, Failover Clustering will be less aggressive with moving virtual machines from a host that is having heartbeat issues. The cluster will wait longer before failing over the virtual machines. In the event of a host having repeat issues (3 in one hour), the host will be quarantined for two hours. This results in virtual machines being live migrated to other healthy nodes.

Compute Resiliency has a collection of default configurations that Microsoft tuned to suit most customers but you might wish to modify this behavior. The settings for Compute Resiliency can be configured on a cluster or per-cluster group (virtual machine) basis.

Cluster Settings

The cluster has a number of settings that you can modify. Cluster settings are global settings that affect all cluster groups (virtual machines) unless overridden:

  • ResilienceLevel: You enable or disable Compute Resiliency, which is on by default (setting of 2 or AlwaysIsolate). You can disable Compute Resiliency (back to pre-WS2016 behavior) with a value of 1 or IsolateOnSpecialHeartbeat. With this setting, failover will always happen unless the node pre-communicates that a maintenance operation is taking place. In this case, the host should go into an isolated state without failover.
  • ResiliencyDefaultPeriod: This is the default amount of time that the cluster will allow a node to remain isolated. The default value of this setting is 240 seconds or 4 minutes.
  • QuarantineThreshold: This is the number of times that a node can become isolated in an hour before the cluster will be quarantined. This is set to 3 by default.
  • QuarantineDuration: This setting, set to 7200 seconds or 2 hours by default, controls how long a host will remain quarantined.

You can configure these cluster settings using the normal PowerShell method of configuring a cluster. In the below example, I am changing ResilienceLevel to 1, disabling Compute Resiliency:

(Get-Cluster).ResilienceLevel = 1

I can query the setting as follows:
(Get-Cluster).ResilienceLevel

Why might I disable Compute Resiliency? If I am running a black-box solution, such as a cluster-in-a-box (CiB), where the cluster can communicate over copper connections inside of the enclosure, then there are no external dependencies that have transient errors, such as a switch. In a CiB, if a node fails to heartbeat, then there really is an issue. Compute Resiliency would delay a real failover.

With normal Hyper-V clusters, most heartbeat failures are probably transient networking issues. Therefore, we should leave Compute Resiliency enabled. There are other options, such as extending the length of quarantine. This would allow engineers to come into the office during regular office hours to handle an overnight issue.

Configuring Virtual Machines

In the view of Failover Clustering, a virtual machine is a set of resources called a cluster group. The cluster group is named after the virtual machine.

We have one Compute Resiliency setting available to us on a per-virtual machine basis. This setting is called ResiliencyPeriod, which is set to -1 by default. This means that the virtual machine will inherit the ResiliencyDefaultPeriod value from the cluster. If you want some virtual machines to failover more aggressively, then you can reduce ResiliencyPeriod in the machine’s cluster group settings. For example, the below will configure VM01 to failover after a heartbeat timeout of 60 seconds:

(Get-ClusterGroup “VM01”).ResiliencyPeriod=60

Recommendation

Do not change these settings simply because you can. If your clustered virtual machines are behaving normally and your uptime is good, then leave things alone. If you do need to make changes, do so gradually or 1 at a time. Test thoroughly and document what you have modified. Therefore, you can reverse the changes if necessary.

 

Related Topics:

BECOME A PETRI MEMBER:

Don't have a login but want to join the conversation? Sign up for a Petri Account

Register
Comments (0)

Leave a Reply

Aidan Finn, Microsoft Most Valuable Professional (MVP), has been working in IT since 1996. He has worked as a consultant and administrator for the likes of Innofactor Norway, Amdahl DMR, Fujitsu, Barclays and Hypo Real Estate Bank International where he dealt with large and complex IT infrastructures and MicroWarehouse Ltd. where he worked with Microsoft partners in the small/medium business space.
External Sharing and Guest User Access in Microsoft 365 and Teams

This eBook will dive into policy considerations you need to make when creating and managing guest user access to your Teams network, as well as the different layers of guest access and the common challenges that accompany a more complicated Microsoft 365 infrastructure.

You will learn:

  • Who should be allowed to be invited as a guest?
  • What type of guests should be able to access files in SharePoint and OneDrive?
  • How should guests be offboarded?
  • How should you determine who has access to sensitive information in your environment?

Sponsored by: