What is CPU steal and how does it affect my Linode?

Linode Staff

My Linode is experiencing CPU steal. What does that mean for me?

2 Replies

CPU steal is the percentage of time that a virtual CPU has to wait for the physical CPU while the hypervisor is fulfilling processes from another virtual CPU. In short, CPU steal occurs when a Linode's shared CPU core is delayed in processing a request. This typically occurs when there is resource contention occurring, but that is not always the case.

My Linode's processes are not performing as expected. What can I do to see if CPU steal is occurring?

If you're noticing that your Linode's performance is suffering, there is a possibility that CPU steal is occurring. The best starting point to diagnose a potential CPU steal issue is to run the following commands.

iostat 1 10
for x in `seq 1 1 30`; do ps -eo state,pid,cmd | grep "^D"; echo "-"; sleep 2; done
top -bn 1 | head -15

Let's break down what each of these commands does, and what a happy output looks like.


iostat 1 10

The iostat command is used to determine if there is CPU steal being observed by your Linode. When running this, the 1 10 flag is used to indicate that the iostat process should be run every 1 second, 10 times. The output below shows a Linode that is not experiencing CPU steal.

Note: This output has been cut to 3 iterations for spacing purposes.

Good output:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.22    0.00    0.21    0.01    0.08   99.48

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda               0.08         0.11         6.65     367895   22208288
sdb               0.00         0.00         0.00       1172          0
sdc               0.00         0.00        21.56       2361   72064576

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00    0.00    0.00    0.00  100.00

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda               0.00         0.00         0.00          0          0
sdb               0.00         0.00         0.00          0          0
sdc               0.00         0.00         0.00          0          0

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.00    0.00    0.00    0.00    0.00   99.00

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda               0.00         0.00         0.00          0          0
sdb               0.00         0.00         0.00          0          0
sdc               0.00         0.00         0.00          0          0

As seen, there is little to no CPU steal occurring on this Linode. Next, we'll see an example of an iostat output that does show CPU steal.

Bad output:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.39    0.00   12.74    0.00   37.84   49.03

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda               0.00         0.00         0.00          0          0
sdb               0.00         0.00         0.00          0          0

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.38    0.00    6.92    0.00   35.00   57.69

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda               0.00         0.00         0.00          0          0
sdb               0.00         0.00         0.00          0          0

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.39    0.00    7.75    0.00   41.09   50.78

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda               0.00         0.00         0.00          0          0
sdb               0.00         0.00         0.00          0          0

Here, we see %steal values between 35.00-41.09, which indicate that this Linode is seeing CPU steal.

Note: Linode considers spikes in CPU steal values less than 10-15% to be within an acceptable range in a virtualized environment.


for x in `seq 1 1 30`; do ps -eo state,pid,cmd | grep "^D"; echo "-"; sleep 2; done

This command checks your Linode for processes that are in the 'D' state. A 'D' state process is stuck in uninterruptible sleep. The command is run at 1 second intervals for a total of 30 iterations.

Sample outputs:

Good output - No processes are in the state of 'D'

~$ for x in `seq 1 1 30`; do ps -eo state,pid,cmd | grep "^D"; echo "-"; sleep 2; done
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- 

Bad output:

# for x in `seq 1 1 30`; do ps -eo state,pid,cmd | grep "^D"; echo "-"; sleep 2; done
-
D  2169 [jbd2/sda-8]
-
D  2169 [jbd2/sda-8]
-
D  2169 [jbd2/sda-8]
-
D  2169 [jbd2/sda-8]
-
D  2169 [jbd2/sda-8]
-
D  2169 [jbd2/sda-8]
-
D  2169 [jbd2/sda-8]
-
-
-
D  2169 [jbd2/sda-8]
-
D  2169 [jbd2/sda-8]
-
D  2169 [jbd2/sda-8]
-
D  2169 [jbd2/sda-8]
-
D  2169 [jbd2/sda-8]
-
D  2169 [jbd2/sda-8]
-
-
D  2169 [jbd2/sda-8]
-
-
-
D  2169 [jbd2/sda-8]
-
-
D  2169 [jbd2/sda-8]
-
-
D  2169 [jbd2/sda-8]
-
D  2169 [jbd2/sda-8]
-
-
D  2169 [jbd2/sda-8]
-
-
D  2169 [jbd2/sda-8]

In the second set of outputs, we see that [jbd2/sda-8] is currently in the 'D' state, meaning that a specific process is in uninterruptible sleep. After some investigation, it looks like that process is a kernel process used to synchronize the filesystem journal to disk. If it's waiting on IO, this means that the OS is having a hard time keeping up with journaling and MySQL is becoming write-bound.

'D' state processes are usually waiting on IO, but this can indicate that a process is causing CPU steal. In this regard, the processes that are stuck would be the reason that the Linode is experiencing CPU steal. There is not enough processing power to perform any further actions, as all of the available resources are tied up in the 'D' state processes.

'D' state processes cannot be killed by standard means. Instead, a full reboot will need to be initiated to end those processes. Another option would be to wait until the IO gets caught back up, which would result in the 'D' state process resolving itself. The downside here is that CPU steal could continue to cause performance issues, thus preventing the processes from ever have the necessary resources


top -bn 1 | head -15

The top command is used to monitor the processes that are using your Linode's resources at any given time. This specific version of top will look for the top processes that are currently running, then pipe that output through the head command, producing (up to) the top 15 results.

Sample output:

# top -bn 1 | head -15
top - 18:27:41 up 1 day,  4:25,  1 user,  load average: 10.85, 11.04, 10.14
Tasks: 227 total,   1 running, 226 sleeping,   0 stopped,   0 zombie
%Cpu(s): 12.0 us,  1.0 sy,  0.0 ni, 80.0 id,  4.8 wa,  0.0 hi,  0.5 si,  1.7 st
KiB Mem:  49458404 total, 37431524 used, 12026880 free,   136744 buffers
KiB Swap:   262140 total,        0 used,   262140 free. 16347320 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 4476 mysql     20   0 24.759g 0.019t  13352 S 718.2 40.3   2454:04 mysqld
 2169 root      20   0       0      0      0 D   6.3  0.0  39:52.77 jbd2/sda-8
 2172 root       0 -20       0      0      0 S   6.3  0.0  10:28.61 *******/0:1H
    1 root      20   0   29632   5536   3132 S   0.0  0.0   0:11.05 systemd
    2 root      20   0       0      0      0 S   0.0  0.0   0:00.01 ********
    3 root      20   0       0      0      0 S   0.0  0.0   2:27.18 ksoftirqd/0
    5 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 *******/0:0H
    7 root      20   0       0      0      0 S   0.0  0.0   1:57.57 rcu_sched

In this particular sample, we see that the mysqld process is running at a %CPU value of 718.2%.

Please note that high CPU values are not always indicative of CPU steal as each virtual CPU core is allotted 100% of it's usage. For example, a plan with 8 CPU cores can utilize 800% of CPU usage.

When interpreting the results of the top command, you'll want to determine if and what processes are using your Linode's CPU resources. If the usage seems to be higher than expected, restarting or killing the process may be the best course of action to take when troubleshooting potential CPU steal.

I've used the above commands to troubleshoot my Linode, but I'm still seeing high CPU steal percentages. What can I do next?

If you've taken the time to fully troubleshoot your Linode's performance issues, then your next step is to open a Support ticket so we can look into any potential contention issues on your Linode's host machine. When opening a ticket, please include the output of the above commands so we can assist with interpreting the results.

After Support has reviewed the information that you've provided, we will offer suggestions based on what we feel the best course of action will be to minimize the performance impact that the CPU steal is having on your Linode.

In some cases, we will recommend that your Linode is migrated to another host machine that has a lighter load. To process this, we will typically default to a live migration which will not involve any downtime for your Linode. During this process, you will need to leave your Linode powered on, and Support will let you know when the process is complete. Once your Linode is running on the new host, you'll want to rerun the troubleshooting commands to see if your performance has improved.

Please note that there are some cases where a cold migration will need to be performed to migrate your Linode. In such cases, we will confirm with you that you are ready and able to have your Linode powered down for the duration of the migration.

You may also run this command to assess the CPU activity on your Linode:

mpstat -P ALL 1 10

This command will provide 10 readings every 1 second for all of your CPU cores. To adjust the frequency of the readings, change the 1 to your preferred value in seconds. If you need to adjust the total number of readings the command, simply change the second number (or remove it to continue until you press Control-C).

This command will output several sections which look like this for each reading:

10:49:21 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
10:49:22 PM  all    0.99    0.00    0.99    0.00    0.00    0.00    0.00    0.00    0.00   98.02
10:49:22 PM    0    0.99    0.00    0.99    0.00    0.00    0.00    0.00    0.00    0.00   98.02

Once the command stops running, it will display an average of its results:

Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
Average:     all    0.20    0.00    0.20    0.00    0.20    0.10    0.00    0.00    0.00   99.30
Average:       0    0.20    0.00    0.20    0.00    0.20    0.10    0.00    0.00    0.00   99.30

The headings listed in these outputs will show the system's CPU utilization across the following categories:

  • %usr - user (application) level
  • %nice - user (application) level, with nice priority
  • %sys - system (kernel) level, excluding hardware usage and software interrupts
  • %iowait - waiting for disk I/O
  • %irq - hardware interrupt CPU time
  • %soft - software interrupt CPU time
  • %steal - CPU steal from hypervisor
  • %guest - CPU time dedicated to the system's virtual guests
  • %gnice - CPU time dedicated to the system's virtual guests (niced)
  • %idle - unused CPU time

The most important figure here for our purposes is %steal, but the other figures can have meaning for the system's performance depending on what types of workloads it is handling.

Reply

Please enter an answer
Tips:

You can mention users to notify them: @username

You can use Markdown to format your question. For more examples see the Markdown Cheatsheet.

> I’m a blockquote.

I’m a blockquote.

[I'm a link] (https://www.google.com)

I'm a link

**I am bold** I am bold

*I am italicized* I am italicized

Community Code of Conduct