Linux and Unixes have excellent metric of system load called “loadavg”. In fact load average is is 3 numbers which correspond to “load average” calculated over one, five and 15 minutes. It is computed as exponential moving average, so the most recent load has more weight in the value than old ones.
What does Load Average correspond to ? At least on Linux it is the number of processes which are in a “running” state or in a “uninterruptable sleep” state, which typically corresponds to disk IO. You can also map LoadAvg to VMSTAT output – which is something like moving average of sum of “r” and “b” columns from VMSTAT.
Obviously the minimum value for LoadAvg is zero which corresponds to completely idle system, and there is no maximum 🙂
The first thing to understand about LoadAvg it does not really tell you if it is CPU bound load or IO bound load. For example, if you have a LoadAvg of 10 it may mean there are 10 processes/threads actively consuming CPU, or it could be there are the same 10 processes waiting on disk IO and you can see CPU utilization being close to zero.
The second thing is to understand is that LoadAvg values are relative to your system size. If you have a single CPU and 1 disk, a loadavg of 2 can be considered significant, while if you have 16 CPUs and 2 disks, a load of 4 can be light if it is CPU bound – because the system can execute much more CPU bound tasks in parallel or High if it is Disk Bound LoadAvg.
Low Load Average does not mean there are no performance problems, for example if you run a single batch job on a MySQL server, the Load Average is likely to be close to 1 even if there are a lot of CPUs and Disks – the system may be quite idle but performance will still poor because the application is not parallel enough. Similar situations can happen if there is a lot of network IO involved or if there are a lot of locks (table/row level locks) or other limiting factors such as innodb_thread_concurrency.
The most interesting question I think is how LoadAvg represents box load in terms of how much load it can handle before it starts to slow down or become completely unable to handle the load, and it is a tricky question. Both for CPUs and for Disk there are two stages that a request can be in. It can be ether currently executing or queued for further execution. The time which is needed to complete the request is the sum of time it was really executed in and the time it was spent in the queue. As the system is loaded, response time starts to increase mainly because of time requests spent waiting in various queues and waiting on locks, the time of true execution may well remain constant. This is a bit of a simplifications as there are a number of other effects coming in play, but it is good enough for the sake of explanation.
What does it mean from a LoadAvg standpoint ? You need to understand where parallel execution continues and where waiting in the queue starts. If you have a fully CPU bound workload which is rather parallel (ie many queries will run at once) and you have 4CPUs until your LoadAvg is below 4 you have low time spent waiting for CPUs to be free to do the work. There is some wait, but not much. So if you have LoadAvg of 1 and your workload scales linearly with the number of connections and CPUs (ie there are no row waits involved) you can assume the box can handle up to 3-4 times more load before response time starts to suffer.
If however the LoadAvg is already 4 it may take a rather insignificant increase to take it up to 8 and you will see some delays due to queuing. If there are 4CPUs (Cores) and the loadavg is 16 for CPU bound workload, it often means requests should take 4 times more to complete than they would on a idle box due to waiting in the queue.
The same true for pure Disk IO bound workload with a small difference for a disk not being replaceable (if you’re waiting on one drive you can’t use another drive instead), and the fact that disks can optimize multiple outstanding requests a bit better compared to requests coming one after another.
For a mixed workload, which is what we usually see in practice, you have to do some assumptions guesses or further analyzes if you want good estimates. Ie you may want to check mpstat, vmstat and iostat to see where load comes from. But the general rule remains the same – until you’re able to explore parallel abilities of the box it will perform well as soon as you need to do a lot of queuing performance starts to suffer.
Let us clarify the last point – how much more load can the box can handle before it overloads, loadavg skyrockets, and it becomes as good as down. First, for many applications the request inflow is not constant – ie a web site delivers a poor response time and so users do not spend so much time on it so the load drops. This is a temporary releif only because there are stubborn users who will not go away even with a slow responding site until their browsers actually timeout, which is as good as the site being down. There are too many variables to come with exact numbers but generally as soon as you have long queuing started it may take just 10-20% extra load to overload system, so it is better to keep loadavg low – below number of CPUs and/or disks you have.
I must note – the system load is not a perfect tool for the task. It is just almost always available unlike other metrics. It is best to have additional profiling information so you can see the response time for your requests starts to grow. As soon as it starts to grow with no good reason I would start to worry about whats happening.
P.S I acknowledge some of explanations here are simplifying things for explanation purposes.