With a lot of admins evaluating Hyper-V or encountering failover clustering for the first time, there appears to be a lot of confusion of what this Windows Server feature is used to accomplish. In this article I will explain what failover clustering is, and what role this Windows Server feature plays in enabling high availability (HA) in Hyper-V deployments.
The purpose of failover clustering is to provide high availability (HA), which gives a server infrastructure the ability to automatically respond to machine failures. For example, say you have a virtual machine running on Host1. Host1 is just one of a number of nodes (members) of a failover cluster. It has a sudden and catastrophic failure, leading it to stop operating. Every second, each host in the cluster sends a test to every other node to ensure that they are still operational. A failure to respond to a sequence of tests indicates a node/host failure. The other nodes will detect that Host1 has failed within a few seconds because Host1 will not respond to a series of these heartbeat tests.
No manual intervention is required. The cluster will automatically failover (move and start) the resources that were on Host1. For example, the virtual machine that was running on Host1 might be relocated to Host3 and started up.
Host1 fails, the heartbeats time out, and a VM is failed over to Host3.
Failover clustering provides Hyper-V virtual machines with HA with the following traits.
Many IT pros, including those who have been using Hyper-V, confuse live migration and HA. They are two different technologies that serve two different purposes. HA is all about minimizing downtime that is caused by failure. Live migration is used to enable service mobility and flexibility, while have no perceivable service downtime:
Live migration was limited to within the hosts of a failover cluster in Windows Server 2008 R2. However, since Windows Server 2012, live migration has not required a cluster, nor is it limited to the scope of a cluster.
Everyone wants to maximize uptime. But even nonclustered or standalone hosts are very reliable (I’ve run hosts for years with only scheduled patching windows), and failover clustering comes with some costs (additional host capacity, supported shared storage such as a SAN, and additional networking). Are those costs worth it if the availability of VMs on non-clustered hosts is high, and you can still have live migration without failover clustering? This means that failover clustering is not for everyone.
Small and medium businesses might steer clear of clustering because of these costs but this isn’t a “big versus small deployment” split. Some of the biggest customers will be adverse to the costs of clustering too. For example, hosting companies (public clouds are often the biggest deployments) need to offer cost-competitive services. Investments in infrastructure must be returned by customer payment, and adding HA at the infrastructure layer increases the fees demanded of customers, making the service less attractive to maybe 80 percent of prospective tenants, at least in my experience. So hosting companies might deploy some clustered hosts but many more non-clustered hosts.
As for those small businesses, maybe Windows Server 2012 R2 Hyper-V with Hyper-V Replica configured to use 30 second asynchronous replication windows might be a more economical alternative to a cluster?
I will be posting more on clustering in the Hyper-V world in the coming weeks, covering concepts like architecture, design, implementation, some of my personal best practices, and more.