Avoid Storage I/O Bottlenecks With vCenter and Esxtop

Overview

Storage I/O bottlenecks can have a big impact on virtual environments and can wreak havoc on the performance of the virtual machines within them. The guest operating systems and applications running inside VMs are constantly reading and writing to and from their virtual disks and anything that delays this can slow a VM to a crawl.

Of all the resources a host manages, traditional storage devices are typically the slowest resource because they rely on mechanical spinning hard disks. In addition shared storage arrays are commonly used in virtual environments because of the many features that require shared storage and as a result there is a longer path to get to storage resources. Storage I/O must leave the host through an I/O adapter and traverse a storage or traditional network to get to a storage array. This long data path creates the potential for several choke points or bottlenecks that can occur which can reduce the capacity and speed of storage I/O.

Bottlenecks dictate the speed limit of your storage I/O; for example you may have a very fast storage array but if the path to that storage array has a bottleneck you are not going to be able to take advantage of the speed of the storage array. On the flip side you may have a fast connection to your storage array but if it is not optimally configured the storage array can become a bottleneck as well. The result is that a bottleneck becomes a funnel that limits the speed of data between your hosts and your storage arrays.

figure 12

Measuring Storage Array Performance

There are two storage statistics that are good indicators of how a storage array is performing; they are IOPS and latency. Let’s first take a look at IOPS.

IOPS

IOPS stands for I/O Operations Per Second and it is a common measurement of the performance of a storage device. I/O operations occur for every read or write to a disk, so on a busy host there can be thousands of IOPS occurring at any given moment. IOPS can show us how much disk activity is occurring individually on each VM or the combined total for a host datastore. IOPS is an important measurement because every storage device has a limited number of IOPS that it can support. While there are a number of factors that determine the amount of IOPS that a storage device can handle, it is mainly determined using simple math by taking the rotational speed of a drive and multiplying it times the number of drives in a RAID group. The higher the rotational speed of a drive the more IOPS it can handle. A typical 15,000 rpm drive is capable of supporting around 175-210 IOPS; a typical 7,200 pm drive is only capable of supporting around 75-100 IOPS. So a RAID group consisting of six 15K drives would be capable of around 1,050 IOPS (175 X 6). SSD drives which are becoming increasingly popular are not bound by mechanical components and are capable of over 5,000 IOPS.

RAID levels also play a factor in IOPS as there is a RAID penalty to factor in for additional disk writes that slightly decreases the amount of IOPS available. The greater the level of RAID protection the higher this penalty is as I/O has to be written to more disks. If the IOPS statistics on your hosts are high it can indicate that the amount of I/O that is occurring might be greater than the storage device can handle. Re-arranging your workloads so they are balanced evenly across multiple datastores can help eliminate any IOPS hot spots that may be occurring on individual datastores. Sometimes re-architecting your storage configuration by putting more drives in RAID groups can help ensure that the number of IOPS that your VMs are generating does not exceed the number that your storage device is capable of.

Latency

Where IOPS is focused on how much disk activity is occurring, latency is focused on how long it takes for a host to read or write data to a storage device. Disk latency is the time it takes for a disk sector to be positioned under the drive head so it can be either read from or written to. Anytime a VM makes a read or write to its virtual disk, that request must follow a long path from the guest OS to the physical storage device. Along that path bottlenecks can occur at different points as the data goes from the guest OS, to a virtual SCSI adapter, through the VMkernel, to a physical I/O adapter and then across a storage network to get to the destination storage device. The total amount of time it takes I/O to make this trip is referred to as total guest latency and is measured in milliseconds. There is several different latency statistics that combine to calculate total guest latency which can help pinpoint which part of the storage sub-system that bottlenecks are occurring in. The below figure illustrates the path that data takes to get from the VM to the storage device and shows the different latency statistics that form total guest latency.

figure 22

Kernel latency – is the average amount of time spent by VMkernel processing each SCSI command. This value should be as close to zero as possible and less than 1ms.

Queue latency – is the average amount of time each SCSI command spends in the VMkernel queue. This value should also be as close to zero as possible and less than 1ms.

Device latency – is the average amount of time it takes to complete a SCSI command from the physical device. This is frequently the cause of high latency; depending on the storage device type this value should be between 0-10ms.

All of the latency statistics are further split into two sub-statistics for read and write so you can see on exactly which operation latency is occurring.

Viewing IOPS and Latency Statistics

So latency is a good statistic for pinpointing where bottlenecks are occurring and IOPS is a good statistic for pinpointing the source of storage I/O. These statistics can be viewed through either the esxtop command line utility or through the vCenter Server performance graphs. vCenter Server is good for looking at general reporting statistics but since it relies on set sample periods and rolled up statistics it doesn’t provide thorough and detailed reporting. You can view IOPS and latency statistics in vCenter Server by selecting a host and clicking on the Performance tab and then clicking on the Advanced button. If you click the Chart Options link you can select the information that you want displayed, under the Disk category you can select a time period (i.e. Real Time, Past Day). On the right side you can select the disk you want to report on under Objects and then select the individual statistic counters that you want to display as shown below.

figure 32

The Description column is the user friendly name of statistics; the Internal Name column is the technical name that VMware refers to it. You can view IOPS by selecting the Commands Issued counter; you can also select the various latency counters as well. Once you select the counters the graph will display the information allowing you to see average, minimum and maximum for each counter as shown below.

figure 41

This allows you to see how busy your host datastores are and how much latency is occurring which can impact performance. You can also do this with VM objects instead of host objects so you can further drill down and see individual VM statistics.

Using Esxtop

Esxtop is a command line utility that can be used with ESX or ESXi that provides continuous real-time statistics, which is handy when troubleshooting performance problems. Esxtop was a utility first introduced in the ESX Service Console and is based on the top command that is used in Linux to display CPU information. Esxtop displays information specific to virtual hosts and VMs and unlike top can display information on all resources (CPU/memory/disk/network). The utility is not part of ESXi and to use it with ESXi you need to use one of the remote command line utilities like the vSphere CLI (linux only) or the vSphere Management Assistant (vMA).

The commands for using esxtop are all single keystroke commands; you can press ? or h to get a list of them all. You can add/remove fields (columns) and/or change the display order by pressing f (add/remove fields) or o (change field order). The relevant commands for storage views are:

  • d – disk adapter i.e. HBA
  • u – disk device i.e. LUN
  • v – disk VM

Once you are in a view you can press f to make sure the latency statistics are displaying; you may need to scroll right or remove columns so you can see all the information. The disk adapter view allows you to see IOPS and latency statistics that are a combined total for a specific physical disk adapter. In the below screens we can see IOPS (CMDS/s) and latency statistics (DAVG/cmd, KAVG/cmd, QAVG/cmd and GAVG/cmd), in this case vmhba0 is a local storage adapter and vmhba34 is an iSCSI adapter.

figure 51

figure 61

The GAVG/cmd column is the total guest latency; the iSCSI adapter is very busy in this case with high IOPS (922.72) but low total latency (2.68). The local adapter has low IOPS (2.92) but high total latency (26.14), the latency can be further narrowed down to high VMkernel latency (KAVG/cmd). This can be caused by queue depths that are set too low, in this case the queue latency (QAVG/cmd) is low so that’s not the cause of the high latency. The VMkernel manages all storage adapters and since the iSCSI adapter is so busy, the VMkernel might be slow to respond to the local adapter.

You can drill down further and see statistics for individual LUNs by switching to the disk device view. Here you can see our iSCSI volume and our local disk listed by their device IDs. You can match up the device IDs to the friendly names in the vSphere Client in the Storage view as shown below.

figure 71

The iSCSI volume is showing high IOPS with the majority being disk writes as shown below.

figure 81

In this case our latencies are not too bad with most of the latency being in the physical storage device (DAVG/cmd).

figure 91

Finally you can switch to the disk VM view where you can further drill down to see individual VM statistics and see which VMs have the highest IOPS and latency as shown below.

figure 101

Here we can see the VM named Dodge City is causing almost all the I/O usage on the host; note the latency statistics (LAT/rd & LAT/wr) are different in the disk VM view as the other statistics like DAVG/cmd only apply at the host level.

Conclusion

This should get you started with understanding storage I/O bottlenecks and how to identify them. vCenter Server and esxtop can provide some good information when troubleshooting issues but can be complicated and are bit lacking to provide a complete storage monitoring solution. There are better third party tools available like SolarWinds Virtualization Manager and Storage Manager that can provide better and easier to understand reporting of storage resources so you can quickly resolve bottlenecks and proactively prevent them from occurring. The dashboards in Virtualization Manager can provide you with a wealth of critical information about your storage in a single pane of glass so you can quickly see the health of your environment. Storage I/O bottlenecks are perhaps the biggest threat in virtualization; you can’t afford to let them choke off the storage resources that your VMs require. Bottlenecks are not always obvious and you may not know you have one until you actually take the time and look for it. Having an understanding of what they are and having the right tools to eliminate them is the key to a healthy and well-performing virtual environment.