It is my experience that storage has often been the most overlooked and least understood factor when assessing performance issues. Even with desktop PCs people tend to focus on the CPU and memory, while the hard drive is just a question of capacity (although the arrival of SSDs has rectified this somewhat).
In this article, I’ll present a common storage scenario. I’ll also discuss how the problem could have been helped with IOPS (Input/Output Operations per Second), which is a key performance indicator once you start getting into more serious storage specification.
For the IT manager with servers to consider protecting your storage is very important. A RAID system to protect data against drive failure becomes a must, as well as RAID options like battery-backed write cache. Most readers should know the difference between RAID 0, 1, and 5, and perhaps variants like 1+0 and 6. (For a RAID primer, refer to our overview of RAID storage levels.)
Hard drive statistics like access time, rotational speed, and bus transfer rate can be found on manufacturers’ websites, which will give some indication of the drive performance — but how many of you consider those figures? Storage performance and specification has long been considered something of a black art; for years, the standard practice when purchasing a new server has been simply to ensure that it has enough capacity along with hot swap RAID. In the past, this has often been sufficient because of simple overkill, i.e. the storage system for a single server provides far better performance than is in fact required. The only time storage has really been an issue was with big database and Exchange servers, which are rare in the average SME.
Whether in a production or development environment, the first introduction of virtualization into an organization is as a single server, typically running ESXi or Hyper-V, which offer free basic versions. This is where the first mistakes involving storage occur. Let’s consider a typical scenario where an old production server is repurposed as a “virtual host” server, perhaps with some extra memory added. In this scenario, an IT admin — let’s call him Jeff — may have a dual quad core Xeon server with 32GB of RAM and four 146GB 10k SCSI drives in a RAID5 array plus another four 300GB SATA drives also in a RAID5 array. This gives it a total of 1.3TB or so of storage. Jeff’s planning to migrate his company network of 50 users to Windows 2008 with Exchange 2010. Sensibly, he tries a test build in his new virtual environment first.
Now, let’s take a moment to consider what this might require:
1) A Win2k8 R2 Domain Controller & File Server with DHCP role
2) A Win2k8 R2 Server running Exchange 2010
3) A Win2k8 R2 Remote Desktop Services Server
4) A Win2k8 R2 Sharepoint Server
How do you think all these would perform on the virtual host server we described earlier? Each of the servers could have two virtual cores, 8GB RAM (although we’d probably give the DC a bit less and the RDS a bit more) plus a 100GB system drive and a 250GB data drive. On paper that’s not a bad specification, and we could run a couple of test client sessions on the RDS server using Office with Outlook. Performance would be pretty impressive, too, even with a couple of large mailboxes and running several applications at once. If we checked the performance monitors in Windows 2k8 and on the hypervisor client, they might show the odd peak in CPU load and the RDS server memory usage might get quite high, but there’s nothing unexpected.
Okay, there is one big negative with the above scenario: Jeff’s putting all the critical servers on one physical server, which is enough to make any IT admin nervous. Ideally, this is an opportunity for a high-availability solution like VMware’s vSphere Essentials Plus, but for the purpose of this article let’s assume that the possibility of a day’s downtime is acceptable (for instance, if a warranty repair is required).
At this point it wouldn’t be unreasonable for Jeff to think that all the hype he has heard about virtualization is justified — instead of having to budget for four new servers, he could make do with just one high-spec system. Currently an HP DL380p Gen8 2U rack server with dual 6 core 2.3GHz Xeon CPUs and 16GB memory sells for just under $5,000. Instead of buying four of those he could just get 32GB extra RAM for $600 and eight 1TB SAS hard drives for another $5,000. Going for a non-virtual solution would mean a minimum of four servers (albeit with a lower specification), but the cost would be at least $20,000. Start taking into account additional lifetime costs such as power and aircon for these servers, and Jeff has a compelling case for going virtual.
After some hard work, Jeff’s new servers go live and his users have been migrated. Here the first few doubts might start to creep in during the migration process, as copying over the file shares and mailboxes from the old servers to the new virtual ones takes longer than one might expect, but that’s easily blamed on other factors like the network links. However, over the following weeks a terrible reality sets in: The users are finding the “new system” much slower than the old one, and the CEO is asking why they spent all this money for no obvious gain.
Given the main topic of this article, you already have pretty good idea what’s the culprit: the storage subsystem. This may not be immediately obvious though, unless our IT admin Jeff is particularly familiar with the performance monitoring tools, as it can often manifest itself with misleading symptoms like 100% CPU usage from certain processes.
Let’s take a step back for a moment and consider what we actually have running now: four virtual servers on one physical host server, providing a complete IT infrastructure for 50 users. On average, each virtual server has three 2.3GHz CPU cores and 12GB of memory, which should be plenty for even a demanding user base. The HP DL380 Gen8p server also has four 1Gb NICs, so network bandwidth is unlikely to be a major bottleneck, and that would be obvious from the performance monitor. However, we have eight 1TB 7.2k SAS drives (probably configured as RAID 5 with a hot spare to give the most capacity with good protection against drive failure) because that’s how we always used to set up our servers. After all, our IT admin has probably gone from an average of 300GB storage per server to over 1TB, which should provide plenty of room for future growth, shouldn’t it?
The problem is that although the total storage capacity has at least quadrupled, and at the same time we have gone from having perhaps four 10K SCSI drives per server to now having only 1.5 7.2k SAS drives per server.
So, how do you avoid Jeff’s mistake? Unfortunately there aren’t really any hard and fast rules – if you search the web for guidance on storage requirements for servers, the best answers you may find (apart from minimum capacity specifications) are some mentions of measuring IOPS, which stands for “Input/Output Operations per Second.” IOPS is a key performance indicator once you start getting into more serious storage specification. Hard drive specification sheets and “lab reviews” will usually just focus on read/write MB/sec, which is fine if you’re looking for a drive for your personal PC as it will give you a good idea of how fast you can copy or open files.
However when it comes to larger storage systems these figures become pretty irrelevant, largely because only a small proportion of the work done by your hard drives will involve moving large files about. In fact, when supporting multiple servers, your storage will spend nearly all of its time handling small read/write operations, which are IOPS. Consider this diagram of a typical SME vSphere storage setup:
Although the diagram shows VMware ESX servers, the principle is the same for all virtualization products – you have multiple virtual machines accessing your storage system simultaneously. At the large enterprise level we could spend many pages discussing the relative merits of different virtualization products’ file systems and storage APIs, but for the SME the differences aren’t important compared to getting your storage right in the first place.
At this point, the answer seems quite simple: You just need to get a good idea of the average and peak IOPs your virtual machines will generate and specify a storage system to suit it. Unfortunately, this is far easier said than done.
If you are intending to virtualize an existing set of servers, the tools are available to measure the IOPS of each server. Windows Server Performance Monitor includes the relevant counters you need. Its best to log the load throughout a typical work day for your network; usually you will find the servers hit their peak IOPS first thing in the morning when the users are logging in and starting work.
When upgrading to a new server infrastructure at the same time as virtualizing, it becomes more of a challenge. For example, Exchange 2010 is far more efficient with storage IOPS than Exchange 2003, but to counter that you are likely to see a greatly increased rate of capacity usage due to the changes in the way the Exchange storage engine works. There are guides available for Exchange and SQL in particular that are intended to help you estimate your IOPS requirements, but they tend to be more enterprise-focused and the numbers don’t scale down well.
Even if you are able to come up with a ballpark figure for your IOPS requirements then the second problem you will encounter is more fundamental: designing a storage system that can meet those requirements. It isn’t such an issue once you get into the enterprise SAN market, as the vendor will be able to specify a system to suit. But even with the plunging cost of SAN storage in recent years you will still be looking at $50k+ systems, which make them well out of the requirements and price range of most people.
Confusing, isn’t it? Perhaps now you understand why I started the article by saying that storage is the least understood aspect of virtualization projects. The basic principles of IOPS are fundamental to understanding how a virtual infrastructure uses its storage, and the performance you will achieve from it as a result. Unfortunately, its usually also not practical on the smaller scale to use IOPS to specify your storage system. Later, in another article I’ll look at some basic SAN rules that you can use!