What Is Windows Server 2016 Hyper-V Compute Resiliency?

In this post, I will explain Compute Resiliency. This feature of Windows Server 2016, where Failover Clustering is more tolerant of transient failures, can cause downtime to Hyper-V virtual machines.



Sponsored Content

Passwords Haven’t Disappeared Yet

123456. Qwerty. Iloveyou. No, these are not exercises for people who are brand new to typing. Shockingly, they are among the most common passwords that end users choose in 2021. Research has found that the average business user must manually type out, or copy/paste, the credentials to 154 websites per month. We repeatedly got one question that surprised us: “Why would I ever trust a third party with control of my network?

Unnecessary Failovers

Many people, especially those new to high availability or building complex environments, find that Failover Clustering can be difficult. If you stick to the well-walked path, the designs are not that hard. The things that cause the most trouble are the things that should be dependable, such as drivers and firmware in network cards. Unfortunately, these unpredictable hardware faults and external issues, such as switch reboots or operators pulling the wrong cables, can cause transient issues. Keep in mind, these can also be predictable hardware faults and external issues depending on the brand of the network interface. Regardless, this can be very difficult to troubleshoot and can lead to downtime.

Every node or host in a Hyper-V cluster sends a heartbeat to the cluster. This heartbeat lets the other nodes know that the sending host is still alive. If a host fails to send a heartbeat for a long enough period, then that host is assumed to be offline. The remaining nodes in the cluster seize the clustered roles, or virtual machines in the case of Hyper-V, from the assumed-dead node.

If a transient networking issue interferes with the heartbeat of a host, then the cluster assumes that there is a problem. It seizes the virtual machines from that host. The virtual machines are booted up on other nodes in the cluster. If there are complex dependencies, then booting up a large number of virtual machines might take a long time. In the meantime, the transient issue has gone away and the original host is back online. The problem with transient issues is that they repeat and they are extremely difficult to identify. If they happen enough, people can lose confidence in the cluster. The cluster is reacting correctly to an external fault but it still creates confidence issues

Tolerance of Transient Issues

Microsoft studied its support calls and received tons of feedback from customers regarding issues with Hyper-V clusters. It was clear that issues outside of clustering was causing many problems. Software has the flexibility to overcome hard issues. Therefore, Microsoft decided to build extra tolerance for transient external issues into Hyper-V failover clusters in the form of Compute Resiliency.

In short, Compute Resiliency slows down the aggressive failover actions of a Hyper-V cluster. Most actual host outages are caused by external problems. Microsoft did the math and decided that by default, a cluster will wait 4 minutes before responding to a host failing to heartbeat. The 4 minutes is enough time for an operator to realize that they have pulled the wrong cable or for a top-of-rack switch to restart after a crash. During this time, a non-responding host has a status of Isolated in the cluster and failovers will not occur.

If a host fails to return online after 4 minutes have passed, then the cluster will initiate a failover of every virtual machine. The virtual machines will behave differently depending on your storage system:

  • SMB 3.0: If the host is online and able to communicate with the storage, the virtual machines remain online.
  • CSV on Block Storage: The virtual machine is placed into a Paused-Critical state.

If a host returns online before 4 minutes have expired, then it rejoins the cluster. What if the host goes offline again? Once again, the host has a status of Isolated and failovers will not take place. The default time is 2 hours. If the host becomes isolated for a third time in a 2 hour period, then the cluster will place that host into a Quarantined state. It will live migrate the virtual machines to more suitable hosts in the cluster.

Note that the times mentioned in this post, 4 minutes and 2 hours, are defaults and can be overridden. The 4-minute wait can be modified on a per-virtual machine basis. Compute Resiliency can be disabled on the cluster. This might make sense for clusters where transient issues are unlikely to isolate hosts or a completely self-contained cluster, such as a cluster-in-a-box.

Related Topics:


Don't have a login but want to join the conversation? Sign up for a Petri Account

Comments (0)

Leave a Reply

Aidan Finn, Microsoft Most Valuable Professional (MVP), has been working in IT since 1996. He has worked as a consultant and administrator for the likes of Innofactor Norway, Amdahl DMR, Fujitsu, Barclays and Hypo Real Estate Bank International where he dealt with large and complex IT infrastructures and MicroWarehouse Ltd. where he worked with Microsoft partners in the small/medium business space.
Don't leave your business open to attack! Come learn how to protect your AD in this FREE masterclass!REGISTER NOW - Thursday, December 2, 2021 @ 1 pm ET

Active Directory (AD) is leveraged by over 90% of enterprises worldwide as the authentication and authorization hub of their IT infrastructure—but its inherent complexity leaves it prone to misconfigurations that can allow attackers to slip into your network and wreak havoc. 

Join this session with Microsoft MVP and MCT Sander Berkouwer, who will explore:

  • Whether you should upgrade your domain controllers to Windows Server
    2019 and beyond
  • Achieving mission impossible: updating DCs within 48 hours
  • How to disable legacy protocols and outdated compatibility options in
    Active Directory

Sponsored by: