In this post, I will explain Compute Resiliency. This feature of Windows Server 2016, where Failover Clustering is more tolerant of transient failures, can cause downtime to Hyper-V virtual machines.
Many people, especially those new to high availability or building complex environments, find that Failover Clustering can be difficult. If you stick to the well-walked path, the designs are not that hard. The things that cause the most trouble are the things that should be dependable, such as drivers and firmware in network cards. Unfortunately, these unpredictable hardware faults and external issues, such as switch reboots or operators pulling the wrong cables, can cause transient issues. Keep in mind, these can also be predictable hardware faults and external issues depending on the brand of the network interface. Regardless, this can be very difficult to troubleshoot and can lead to downtime.
Every node or host in a Hyper-V cluster sends a heartbeat to the cluster. This heartbeat lets the other nodes know that the sending host is still alive. If a host fails to send a heartbeat for a long enough period, then that host is assumed to be offline. The remaining nodes in the cluster seize the clustered roles, or virtual machines in the case of Hyper-V, from the assumed-dead node.
If a transient networking issue interferes with the heartbeat of a host, then the cluster assumes that there is a problem. It seizes the virtual machines from that host. The virtual machines are booted up on other nodes in the cluster. If there are complex dependencies, then booting up a large number of virtual machines might take a long time. In the meantime, the transient issue has gone away and the original host is back online. The problem with transient issues is that they repeat and they are extremely difficult to identify. If they happen enough, people can lose confidence in the cluster. The cluster is reacting correctly to an external fault but it still creates confidence issues
Microsoft studied its support calls and received tons of feedback from customers regarding issues with Hyper-V clusters. It was clear that issues outside of clustering was causing many problems. Software has the flexibility to overcome hard issues. Therefore, Microsoft decided to build extra tolerance for transient external issues into Hyper-V failover clusters in the form of Compute Resiliency.
In short, Compute Resiliency slows down the aggressive failover actions of a Hyper-V cluster. Most actual host outages are caused by external problems. Microsoft did the math and decided that by default, a cluster will wait 4 minutes before responding to a host failing to heartbeat. The 4 minutes is enough time for an operator to realize that they have pulled the wrong cable or for a top-of-rack switch to restart after a crash. During this time, a non-responding host has a status of Isolated in the cluster and failovers will not occur.
If a host fails to return online after 4 minutes have passed, then the cluster will initiate a failover of every virtual machine. The virtual machines will behave differently depending on your storage system:
If a host returns online before 4 minutes have expired, then it rejoins the cluster. What if the host goes offline again? Once again, the host has a status of Isolated and failovers will not take place. The default time is 2 hours. If the host becomes isolated for a third time in a 2 hour period, then the cluster will place that host into a Quarantined state. It will live migrate the virtual machines to more suitable hosts in the cluster.
Note that the times mentioned in this post, 4 minutes and 2 hours, are defaults and can be overridden. The 4-minute wait can be modified on a per-virtual machine basis. Compute Resiliency can be disabled on the cluster. This might make sense for clusters where transient issues are unlikely to isolate hosts or a completely self-contained cluster, such as a cluster-in-a-box.