Troubleshooting a Slow Database Server: Is SQL Server the Cause?
Troubleshooting a slow database server is one of the easiest and hardest things you’ll ever do in IT. While there are some anomalies that can take a real performance expert several days to track down, most of the time your performance issues are going to be fairly straight forward. The easy part is determining whether or not the issue is SQL-related and what form it’s taking. The hard part can be determining exactly what the root cause of the issue is inside the SQL Server itself. That is, does it stem from a bad query, parallelism, caching, too many compiles, bad execution plans, etc.
And while this article really applies to any database server, I’m going to discuss SQL Server specifically as a point of reference. Additionally, I’m going to discuss the high-level troubleshooting to determine whether the issue is SQL Server, and leave the internal database troubleshooting for another article.
Typically a production SQL Server will be on a dedicated box, but there are times when it’s not. You’ll need to try to determine this up front if you can. It may help guide your troubleshooting.
From an end user perspective, there’s no way of knowing where the slowdown is coming from. It could logically be CPU, memory, disk, network, or any of the more specific sub-components of each of those broad categories. So the first thing you need to do is to determine the offending component and then you can snipe the specific cause.
Let’s look at the easiest way to determine the over-arching cause of a system slowdown.
What is “Inside Microsoft Teams”?
“Inside Microsoft Teams” is a webcast series, now in Season 4 for IT pros hosted by Microsoft Product Manager, Stephen Rose. Stephen & his guests comprised of customers, partners, and real-world experts share best practices of planning, deploying, adopting, managing, and securing Teams. You can watch any episode at your convenience, find resources, blogs, reviews of accessories certified for Teams, bonus clips, and information regarding upcoming live broadcasts. Our next episode, “Polaris Inc., and Microsoft Teams- Reinventing how we work and play” will be airing on Oct. 28th from 10-11am PST.
Open Perfmon (I’m assuming in this article that you already know the extreme basics of how to work Perfmon) and you’ll notice the following counters are already loaded into the collection:
- Avg. Disk Queue Length
- % Processor Time
These are really good counters to start with because they cover 3 of the 5 major areas. The ones that are missing are memory and network. I’ll talk about network later, and memory isn’t that interesting to us here anyway so we really don’t need to investigate it in this first round. Why? Well because on a dedicated database box, SQL Server is going to use most of, if not all of the available memory by default so all you’ll see in Perfmon is that all the memory is being used. So it really doesn’t help in this case. You can get a hint that there may be something memory-related going on though by looking at the pages/sec counter.
I’m now going to talk about each one of these briefly to give you an idea of what to look for.
This is how much paging your system is doing in and out of virtual memory. If you’re doing a lot of paging then this could be an indication that you’ve got other applications on the box taking memory away from the database. There’s no real right answer on what this counter should be so you should have a baseline to tell you how it usually operates. It should be as close to 0 as possible though. It’s also notable that this counter alone won’t really do much for you if you’re on a dedicated database box as databases don’t tend to page much, if any at all. This is one reason why it’s good to know if your box is dedicated to SQL Server.
Avg. Disk Queue Length
This counter can be misleading because while you may be looking at what seems to be a high number, it may be perfectly fine for your system. The general rule is that a disk queue should be the number of spindles x 2. OK, so what does that mean? I’ll give you a really simple example. Let’s say you have only a single disk on your server; just a single SCSI hard drive and nothing else. Your acceptable disk queue is 2 because you only have 1 spindle (hard drive) and you’re generally allowed a queue of 2 per spindle. So if you have a SAN with 50 spindles in your array, your acceptable disk queue is 100 (that’s 50 x 2, right?).
So to get any real use out of this counter you’ll need to know something about the underlying storage, or you’ll have to have a baseline. And again, I can’t stress the baseline enough because even though rule of thumb may allow you a disk queue of 100, if your system typically runs with a queue of 10-15, then 100 goes back to being unacceptable. There’s simply no substitute for having a picture of normal conditions on your box.
% Processor Time
This counter can be taken at face value. If you’re running a high CPU, then it’s a pretty good indicator that CPU is the culprit. If this is the case then you’ll need to determine which process is using the lion’s share of the CPU though and that takes another counter. If you do determine you’re looking at a CPU issue, then you’ll need to add in the Process\% Processor Time counter and either add the counter for all processes (don’t use _Total) or pick the most likely suspects and keep adding them until you find the offender. If sqlserver.exe is the clear offender, then you can now either get your DBA involved or move to the internal SQL Server methods for troubleshooting performance issues. And again, it’s best to have a baseline. If your database is pegging the CPU at 93%, but it’s usually at 90-95%, then CPU probably isn’t your culprit because you’re operating within normal parameters. So know what your system looks like when it’s running normally.
The reason why we’re not really worried about monitoring networking counters is because the only counters that would be of use to us are the internal windows and NIC counters, and these are rarely at fault. If there is a network issue, it’s typically in the network itself which you can’t diagnose with any of these counters. And while this happens from time to time, it’s really something you’ll check in round 4 or 5 of your troubleshooting and isn’t really interesting to us here in round 1.
These are the very basics of how to determine whether your database is causing your system slowdown. And for every rule I gave above I can think of a handful of exceptions so nothing is written in stone. You can use Task Manager if you like, but any questions that surface force you to open up Perfmon anyway, so you’d might as well start there. Perfmon also allows you to save counter collections so you don’t have to redefine your measures every time. Don’t forget there’s no substitute for a baseline. Every troubleshooting session I’ve ever had has been made more difficult where I didn’t have anything to compare it to.