Learn What IT Pros Need to Know About Windows 11 - August 26th at 1 PM ET! Learn What IT Pros Need to Know About Windows 11 - August 26th at 1 PM ET!
Hyper-V

What Is Windows Server 2016 Hyper-V Compute Resiliency?

In this post, I will explain Compute Resiliency. This feature of Windows Server 2016, where Failover Clustering is more tolerant of transient failures, can cause downtime to Hyper-V virtual machines.

 

 

Sponsored Content

Read the Best Personal and Business Tech without Ads

Staying updated on what is happening in the technology sector is important to your career and your personal life but ads can make reading news, distracting. With Thurrott Premium, you can enjoy the best coverage in tech without the annoying ads.

Unnecessary Failovers

Many people, especially those new to high availability or building complex environments, find that Failover Clustering can be difficult. If you stick to the well-walked path, the designs are not that hard. The things that cause the most trouble are the things that should be dependable, such as drivers and firmware in network cards. Unfortunately, these unpredictable hardware faults and external issues, such as switch reboots or operators pulling the wrong cables, can cause transient issues. Keep in mind, these can also be predictable hardware faults and external issues depending on the brand of the network interface. Regardless, this can be very difficult to troubleshoot and can lead to downtime.

Every node or host in a Hyper-V cluster sends a heartbeat to the cluster. This heartbeat lets the other nodes know that the sending host is still alive. If a host fails to send a heartbeat for a long enough period, then that host is assumed to be offline. The remaining nodes in the cluster seize the clustered roles, or virtual machines in the case of Hyper-V, from the assumed-dead node.

If a transient networking issue interferes with the heartbeat of a host, then the cluster assumes that there is a problem. It seizes the virtual machines from that host. The virtual machines are booted up on other nodes in the cluster. If there are complex dependencies, then booting up a large number of virtual machines might take a long time. In the meantime, the transient issue has gone away and the original host is back online. The problem with transient issues is that they repeat and they are extremely difficult to identify. If they happen enough, people can lose confidence in the cluster. The cluster is reacting correctly to an external fault but it still creates confidence issues

Tolerance of Transient Issues

Microsoft studied its support calls and received tons of feedback from customers regarding issues with Hyper-V clusters. It was clear that issues outside of clustering was causing many problems. Software has the flexibility to overcome hard issues. Therefore, Microsoft decided to build extra tolerance for transient external issues into Hyper-V failover clusters in the form of Compute Resiliency.

In short, Compute Resiliency slows down the aggressive failover actions of a Hyper-V cluster. Most actual host outages are caused by external problems. Microsoft did the math and decided that by default, a cluster will wait 4 minutes before responding to a host failing to heartbeat. The 4 minutes is enough time for an operator to realize that they have pulled the wrong cable or for a top-of-rack switch to restart after a crash. During this time, a non-responding host has a status of Isolated in the cluster and failovers will not occur.

If a host fails to return online after 4 minutes have passed, then the cluster will initiate a failover of every virtual machine. The virtual machines will behave differently depending on your storage system:

  • SMB 3.0: If the host is online and able to communicate with the storage, the virtual machines remain online.
  • CSV on Block Storage: The virtual machine is placed into a Paused-Critical state.

If a host returns online before 4 minutes have expired, then it rejoins the cluster. What if the host goes offline again? Once again, the host has a status of Isolated and failovers will not take place. The default time is 2 hours. If the host becomes isolated for a third time in a 2 hour period, then the cluster will place that host into a Quarantined state. It will live migrate the virtual machines to more suitable hosts in the cluster.

Note that the times mentioned in this post, 4 minutes and 2 hours, are defaults and can be overridden. The 4-minute wait can be modified on a per-virtual machine basis. Compute Resiliency can be disabled on the cluster. This might make sense for clusters where transient issues are unlikely to isolate hosts or a completely self-contained cluster, such as a cluster-in-a-box.

Related Topics:

BECOME A PETRI MEMBER:

Don't have a login but want to join the conversation? Sign up for a Petri Account

Register
Comments (0)

Leave a Reply

Aidan Finn, Microsoft Most Valuable Professional (MVP), has been working in IT since 1996. He has worked as a consultant and administrator for the likes of Innofactor Norway, Amdahl DMR, Fujitsu, Barclays and Hypo Real Estate Bank International where he dealt with large and complex IT infrastructures and MicroWarehouse Ltd. where he worked with Microsoft partners in the small/medium business space.

Register for Advanced Microsoft 365 Day!

GET-IT: Advanced Microsoft 365 1-Day Virtual Conference - Live August 24th!

Join us on Tuesday, August 24th and hear from Microsoft MVPs and industry experts about how to take advantage of Microsoft 365 at a technical level and dive deep into the features and functionality that will make your environment more secure and compliant.

RSVP Now

Sponsored By