Troubleshooting Azure ARM Virtual Machines

cloud-hand-hero-img
This post will show you how to use the built-in tools for troubleshooting faulty virtual machines in Azure.

Background

No matter what platform that you use – physical servers, Hyper-V, vSphere, Azure, or AWS – sometimes a machine starts to act up a little. Veteran admins know the pain of working with a faraway server when remote desktop fails to connect – you either hope you still have the ability to reboot the machine remotely or that someone with administrator rights is in the locality.

But what if an Azure virtual machine starts to misbehave? What do you do then? As a tenant in Azure, you have no access to the fabric. You cannot see if a localized host issue is breaking your virtual machine. Your only ways to connect to the machine are via PowerShell and remote desktop – you don’t have the Hyper-V Connect utility for console access. Luckily, Microsoft gave us a set of tools for troubleshooting virtual machines in Azure.

ARM versus ASM

Microsoft is not putting pressure on customers to migrate from ASM deployments, but it is clear from recent developments that Microsoft is focusing feature improvements on ARM, and is either not necessarily bringing the same capabilities (lesser functionality or not at all) to ASM workloads. This is the case with troubleshooting virtual machines.
If you browse the settings of a classic (ASM) virtual machine then you will see the following troubleshooting options:

Troubleshooting options for Azure classic virtual machines [Image Credit: Aidan Finn]
Troubleshooting options for Azure classic virtual machines [Image Credit: Aidan Finn]
But if you browse the settings of a Resource Manager (ARM) virtual machine then you will notice that there is (currently) one extra option (Redeploy):
Troubleshooting options for Azure Resource Manager virtual machines [Image Credit: Aidan Finn]
Troubleshooting options for Azure Resource Manager virtual machines [Image Credit: Aidan Finn]

Diagnose and Solve Problems

The first option allows you to figure out an issue with a virtual machine. If you click Diagnose and solve problems, a blade opens up (below). This blade will tell you if there are any known issues that might affect the virtual machine. A set of common problems/solutions are listed.
For example, I can expand the solution for My VM is slow. The guidance from Microsoft will instruct me on how to diagnose the issue and resolve it. It includes instructions such as:

  • Inspect logs for glitches
  • Gather performance metrics
  • Restart the machine (turn it off and on again)
  • Change the machine spec
  • Consider using Premium Storage

Diagnose and solve Azure virtual machine problems [Image Credit: Aidan Finn]
Diagnose and solve Azure virtual machine problems [Image Credit: Aidan Finn]
If you reach the end of the trail, you are advised to consider contacting technical support. More on this later in the post.

Activity Logs

Another scenario that veteran admins have seen is the “it’s been fine for months but it just broke” problem. Very often this is because “somebody” (it’s always “somebody”) changed something but we don’t know who changed it or what they changed.
Azure logs all changes, and you can view these via the Activity Logs blade. You can filter the changes to a particular time period to see what changed, and then diagnose the issue. For example, if performance has dropped, and someone reduced the spec of the virtual machine around that time, then you know what the cause and solution are.
If you have been clever and avoided the use of a single generic user account, and forced every admin to sign in using their own account, then you will also see who caused the problem … and take the appropriate HR-approved actions.

Resource Health

As tenants of a cloud, we have no visibility behind the curtain; we cannot see if Azure is healthy. Microsoft does share some health information from the 10,000-foot level, but that doesn’t tell me if a single host or storage node might have an issue that affects my virtual machine. However, the Resource Health blade does offer information that can help at that level, offering the results of an Azure service that monitors the resources that my machine is depending on.

Resource health of an Azure virtual machine [Image Credit: Aidan Finn]
Resource health of an Azure virtual machine [Image Credit: Aidan Finn]

Boot Diagnostics

If you have enabled Boot Diagnostics in the Diagnostics settings of the virtual machine, then you can get a view of the machine as it boots up and is running – this is like watching the console of the machine without any interaction.
With Windows virtual machines, we can see if the machine has booted up normally or if the console is displaying an error. Linux displays a lot more information on the console during a boot up; you can see a trace of this information in the Boot Diagnostics troubleshooting blade.

Reset Password

I love this feature; we have the ability, via the VM Agent extension, to reset the local administrator username and password of a virtual machine. There are two options in this dialog, if using ARM virtual machines:

  • Reset Password: The remote desktop configuration (Windows) is reset, and you can change the local administrator username and password.
  • Reset Configuration Only: This only resets the remote desktop configuration. This is not available for ASM virtual machines.
Reset the password of an Azure virtual machine [Image Credit: Aidan Finn]
Reset the password of an Azure virtual machine [Image Credit: Aidan Finn]

Redploy (Only ARM Machines)

Sometimes you will find that the virtual machine just need to be moved from a host. Maybe the host is faulty. Maybe moving the machine will reset it. The redeploy action, which is not available for ASM virtual machines, will move a machine to a new host.

Redeploy an Azure Resource Manager virtual machine [Image Credit: Aidan Finn]
Redeploy an Azure Resource Manager virtual machine [Image Credit: Aidan Finn]
The machine will be restarted during the process. The temporary drive will be lost and recreated (but you knew not to keep valuable data there, right?), and the machine will start up with your OS and data disks on another host. This reset process will hopefully resolve any localized issues that you might have been affected by. Note that this process will take a few minutes to complete.

New Support Request

There are two types of support request in the world of Azure:

  • Billing support
  • Technical support

If you have acquired Azure via the Cloud Solutions Provider (CSP) channel, then your first point of contact for support should be the reseller. They have a means to escalate technical support requests to Microsoft, and because of the way that billing is done via the reseller’s private pricing file, all billing requests really need to go through the reseller too.

If you acquired Azure via any other means then your route to contact is dependent on a few variables. All channels of Azure (except CSP) include free billing support from Microsoft; that means if you have a query about consumption, billing reports, or your subscription/account, you get built-in support from Microsoft as a part of your agreement.
No server product from Microsoft comes with built-in technical support. If you wish to create a technical support ticket (non-CSP subscriptions) then you must purchase a support contract from Microsoft.
If you are not using a CSP subscription, then you can use this blade to open a support ticket.