Last Update: Sep 04, 2024 | Published: May 25, 2017
In this post, I will explain how storage resiliency decreases downtime to virtual machines that are running on Windows Server 2016 (WS2016) Hyper-V. This is caused by transient storage issues.
No matter how much money you spend on storage, outages will happen. Some folks think that because they have spent a fortune on switches, SAN controllers, disk trays, disks, and cables that downtime will never occur to them. I do love to burst bubbles! Sadly, no storage system is impervious to problems. I have known of a few sites, including a rumor of a certain large software and services company, that have had massive SAN outages. This can lead to corrupted data.
Those headline outages are few and far between. More commonly, you will see the transient error. This is the brief glitch in the controller software, a faulty switch port, or an operator pulling the wrong cable. This is the sort of error that even though it only lasts a few seconds, can cause significant service disruption.
Let’s pretend that there is a storage glitch in your virtualization farm. Each of your virtual machines is performing reads/writes or inputs/outputs (IO) to the storage system. As soon as the glitch happens, the guest OS of each virtual machine will detect a failed IO. It will do what every operating system does. It will protect the integrity of itself and the hosted services by crashing.
After a few seconds, the glitch ends. This is no big deal, right? Wrong. Every virtual machine that was connected to the storage system has crash-dumped and is going to take several minutes to reboot. Most will be fine. Some will require manual intervention and some might even have more severe data or service issues. Those few seconds of a storage blip have just cost the business a ton of money.
Microsoft spent a lot of time talking to customers when planning for Windows Server 2016. Windows Server 2012 and 2012 R2 went a long way to win over hosting and large enterprise customers. Unfortunately, problems remained. Many of those problems resided outside of Hyper-V and Windows Server. Through meetings and analysis of support calls, Microsoft made some discoveries. Many of the problems that Hyper-V customers were experiencing were being caused by these transient issues in storage or networking. Hyper-V needed to become more tolerant of issues outside of the host.
WS2016 Hyper-V is resilient to these transient issues. There are some variations on how the feature works but the core concept is this:
In effect, the virtual machine is frozen until the problem goes away. This saves countless crash-dumps, reboot storms, and the downtime while services return to the business. What might have been a 10-15 minute outage, is now a brief pause. Of course, there is some administrative effort to resolve critical reboot failures.
We normally use Shared VHD with guest clusters to increase service availability. This works by using redundant virtual machines. If the service fails on one virtual machine, we move the service to another virtual machine. This is done as quickly as possible.
Therefore, it makes no sense to pause a guest cluster node when IO to the shared VHD file is timing out. Shared VHD leads to some different behavior:
The following are supported for Storage Resiliency:
The following are not supported:
Storage Resiliency is managed, via PowerShell, on a per-virtual machine basis. There are two settings to note:
Set-VM VM1 -AutomaticCriticalErrorAction None
The AutomaticCriticalErrorAction setting has the following possible values:
The AutomaticCriticalErrorActionTimeout will allow a virtual machine to remain in a paused-critical state for up to 30 minutes. This is the default. You can set this between 1 minute and 1440 minutes.
Set-VM VM1 –AutomaticCriticalErrorActionTimeout 1440
Hopefully, you will not notice Storage Resiliency in action. It is the sort of thing that should pause and resume virtual machines within very short amounts of time. You probably will not ever need to configure the feature. It is on by default. This is one of those things that will increase up-time for you without you having to do anything.