Coming Soon: GET:IT Endpoint Management 1-Day Conference on September 28th at 9:30 AM ET Coming Soon: GET:IT Endpoint Management 1-Day Conference on September 28th at 9:30 AM ET
Hyper-V

WHEA Errors on Hyper-V Hosts

headache-hero-img

Hyper-V is a very resilient hypervisor, designed to protect your workloads from corruption. Corruption can come in many forms. Maybe Hyper-V pauses your virtual machines because a Cluster Shared Volume is full and it’s protecting your business or customer from dynamic virtual hard disks trying to consume space that is no longer there. Or maybe your host has failing memory DIMMS and Hyper-V is aware of this and is ensuring that your virtual machines will not consume that RAM. The latter is where Hyper-V is using hardware error detection to protect you from your hardware. I recently encountered the latter in person, and here’s what I found.

What is WHEA?

WHEA stands for Windows Hardware Error Architecture. MSDN describes WHEA as a mechanism where Windows and the firmware of the underlying hardware work together to detect hardware issues and deal with them.

The architecture of WHEA (Image Credit: Microsoft)
The architecture of WHEA (Image Credit: Microsoft)

Since Windows Vista, Windows maintains a list of discoverable hardware error sources. For each source that is discovered on a physical machine, Windows maintains a low-level hardware error handler (LLHEH). When a hardware error is reported to Windows, the LLHEH is the first piece of code to run in response. Microsoft places each LLHEH in the appropriate part of the operating system to deal with the issues of the related error source.

Sponsored Content

Say Goodbye to Traditional PC Lifecycle Management

Traditional IT tools, including Microsoft SCCM, Ghost Solution Suite, and KACE, often require considerable custom configurations by T3 technicians (an expensive and often elusive IT resource) to enable management of a hybrid onsite + remote workforce. In many cases, even with the best resources, organizations are finding that these on-premise tools simply cannot support remote endpoints consistently and reliably due to infrastructure limitations.

When an LLHEH runs, it will:

  • Acknowledge the error
  • Capture information related to the error
  • Report the error condition to the operating system

I recently saw the following WHEA error in the System Log of a Hyper-V host:

A system log error from WHEA (Image Credit: Aidan Finn)
A system log error from WHEA (Image Credit: Aidan Finn)

WHEA and Hyper-V

Traditionally, the most precious resource on a virtualization host has been RAM. Virtualization has given us tools to optimize how we use RAM to squeeze more virtual machines onto our host hardware. And over time, hosts have grown. Gone are the days of hosts with 64 GB of RAM being a sweet spot. These days, the smallest host my company sells has 128 GB RAM, and we’re more likely to see 256 GB or 512 GB RAM hosts, thanks to the ever falling price of DIMMs. But as we add more memory chips into hosts, we are creating conditions where there is a higher probability of chip degradation.

Let me be clear about that language; I said degradation, not failure. An outright failure of a DIMM is easily handled — it’s dead and unusable. But what happens if part of a DIMM becomes faulty and untrustworthy without triggering an alarm?

Hyper-V is made aware of the fault thanks to WHEA. And Hyper-V is intelligent enough to protect you from this failure by isolating the affected memory and ensuring that virtual machines don’t use it anymore.

What happened with the host that created the above alert? I knew to look for errors because there was a report of an alert on the front console of the server. I logged into the remote management of the physical server and the current status was healthy; there were no apparent hardware failures. But I had a WHEA error indicating that there was a memory degradation. I browsed into the logs of the server’s remote management and found a hardware error report to prove that there was a dodgy DIMM.

A log entry to prove DIMM degradation (Image Credit: Aidan Finn)
A log entry to prove DIMM degradation (Image Credit: Aidan Finn)

Armed with that information, I’ve been able to request a memory swap out to prevent any further issues with this host … and it was great to know that Windows and Hyper-V had my back.

Related Topics:

BECOME A PETRI MEMBER:

Don't have a login but want to join the conversation? Sign up for a Petri Account

Register
Comments (0)

Leave a Reply

Aidan Finn, Microsoft Most Valuable Professional (MVP), has been working in IT since 1996. He has worked as a consultant and administrator for the likes of Innofactor Norway, Amdahl DMR, Fujitsu, Barclays and Hypo Real Estate Bank International where he dealt with large and complex IT infrastructures and MicroWarehouse Ltd. where he worked with Microsoft partners in the small/medium business space.
Live Webinar: Active Directory Security: What Needs Immediate Priority!Live on Tuesday, October 12th at 1 PM ET

Attacks on Active Directory are at an all-time high. Companies that are not taking heed are being punished, both monetarily and with loss of production.

In this webinar, you will learn:

  • How to prioritize vulnerability management
  • What attackers are leveraging to breach organizations
  • Where Active Directory security needs immediate attention
  • Overall strategy to secure your environment and keep it secured

Sponsored by: