Debugging a flaky cpu steal time counter on a paravirtualized Xen guest

I recently noticed a strange phenomenon on a Debian Stretch server running as a paravirtualized guest on a Xen host: top showed the CPU either be 100% idle or 100% stolen. User, system, nice and waiting times were stuck at 0%. I cross-checked with vmstat and it showed 0% for all cpu time counters. Both tools are getting their information from /proc/stat which looked like the following:

cpu 5322 0 4376 12720669 37879 0 59 1198368772563 0 0
cpu0 5322 0 4376 12720669 37879 0 59 1198368772563 0 0
...

The third to last value is the steal time, denoting “stolen time, which is the time spent in other operating systems when running in a virtualized environment” [procfs(5)]. This value looked way too high and, in particular, it was counting backwards. So, if I wanted to put this system into production, some debugging was required before…

After a reboot, the steal time in /proc/stat showed up normal again, in the range of a few hundred and slowly counting upwards. To find out what happens to it I wrote a small script polling /proc/stat and informing me about unexpected behavior:

#!/bin/bash
read x x x x x x x x stolen x x < <(head -n1 /proc/stat)
triggered=0
ctr=0
while true; do
	lStolen=$stolen
	read x x x x x x x x stolen x x < <(head -n1 /proc/stat)
	diff=$(($stolen - $lStolen))
	if [ $triggered -eq 0 ]; then
		if [ $diff -lt 0 ] || [ $diff -gt 100000 ]; then
			triggered=1
			mailtext="time\tstolen\tdiff\n----\t------\t----\n0ms\t$lStolen\n"
		fi
	fi
	if [ $triggered -eq 1 ]; then
		ctr=$(($ctr + 1))
		mailtext="${mailtext}$((ctr*100))ms\t$stolen\t$diff\n"
	fi
	[ $ctr -eq 20 ] && break
	sleep 0.1
done
mailtext=$(echo -e $mailtext | column -t)
echo -e "$mailtext" | mail -s "Steal time jumped!" root

Started via systemd-run it could run in the background until some jump in the steal time was detected. After an hour I got an email. Here is part of the recorded trace:

time    stolen         diff
----    ------         ----
0ms     784
100ms   1844670130367  1844670129583
200ms   1844664564089  -5566278
300ms   1844659554439  -5009650
400ms   1844655101417  -4453022
...

There was the culprit: A large jump in steal time and starting from that event the counter would run backwards. Searching the web, similar reports can be found, caused by buggy KVM or Xen hosts that report erroneous values for the steal time to their guests during migration. Since I have no influence on the host of this system, I had to find a way to deal with this behavior on the guest side. The guest in my case is a Debian Stretch server, running the kernel provided by Debian which is based on version 4.9.30.

The code responsible for accounting steal time in the Linux kernel can be found in kernel/sched/cputime.c and in version 4.9.30 looks like the following*:

void account_steal_time(cputime_t cputime)
{
        u64 *cpustat = kcpustat_this_cpu->cpustat;

        cpustat[CPUTIME_STEAL] += (__force u64) cputime;
}

static __always_inline cputime_t steal_account_process_time(cputime_t maxtime)
{
#ifdef CONFIG_PARAVIRT
        if (static_key_false(&paravirt_steal_enabled)) {
                cputime_t steal_cputime;
                u64 steal;

                steal = paravirt_steal_clock(smp_processor_id());
                steal -= this_rq()->prev_steal_time;

                steal_cputime = min(nsecs_to_cputime(steal), maxtime);
                account_steal_time(steal_cputime);
                this_rq()->prev_steal_time += cputime_to_nsecs(steal_cputime);

                return steal_cputime;
        }
#endif
        return 0;
}

paravirt_steal_clock() ultimately gets its value via xen_steal_clock() from drivers/xen/time.c if the xen module is loaded. One important thing here is the conversion to and from cputime using nsecs_to_cputime() and cputime_to_nsecs(), corresponding to a division and a multiplication with some constant. So if steal for some reason becomes negative (i.e. a huge number being of type u64), steal_cputime would not overflow as well but instead just be increased by a huge number divided by some constant.

My first attempt at fixing this was to force steal_cputime to overflow, and therefore decrease, as well*:

--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -262,14 +262,29 @@ static __always_inline cputime_t steal_account_process_time(cputime_t maxtime)
 #ifdef CONFIG_PARAVIRT
 	if (static_key_false(&paravirt_steal_enabled)) {
 		cputime_t steal_cputime;
-		u64 steal;
+		u64 steal_time;
+		s64 steal;
 
-		steal = paravirt_steal_clock(smp_processor_id());
-		steal -= this_rq()->prev_steal_time;
+		steal_time = paravirt_steal_clock(smp_processor_id());
+		steal = steal_time - this_rq()->prev_steal_time;
 
-		steal_cputime = min(nsecs_to_cputime(steal), maxtime);
-		account_steal_time(steal_cputime);
-		this_rq()->prev_steal_time += cputime_to_nsecs(steal_cputime);
+		if (likely(steal >= 0)) {
+			steal_cputime = min(nsecs_to_cputime(steal), maxtime);
+			account_steal_time(steal_cputime);
+			this_rq()->prev_steal_time += cputime_to_nsecs(steal_cputime);
+		} else {
+			/* steal_time jumped backwards which causes steal to overflow.
+			 * Force cpustat[CPUTIME_STEAL] to overflow as well so things
+			 * match up again. */
+			steal_cputime = min(nsecs_to_cputime(-steal), maxtime);
+			account_steal_time(-steal_cputime);
+			printk(KERN_WARNING "steal_time return by paravirt_steal_clock() for "
+			       "processor %d jumped backwards: %llu --> %llu, forcing "
+			       "CPUTIME_STEAL counter to overflow by adding %llu\n",
+			       smp_processor_id(), this_rq()->prev_steal_time, steal_time,
+			       -steal_cputime);
+			this_rq()->prev_steal_time -= cputime_to_nsecs(steal_cputime);
+		}
 
 		return steal_cputime;
 	}

Indeed this fixed the issue and after some time I had a kernel log full of messages like the following:

[18653.288714] steal_time return by paravirt_steal_clock() for processor 0 jumped backwards: 11200000000 --> 11199990143, forcing CPUTIME_STEAL counter to overflow by adding 0
[18653.312010] steal_time return by paravirt_steal_clock() for processor 0 jumped backwards: 11200000000 --> 11199957815, forcing CPUTIME_STEAL counter to overflow by adding 0
[18653.316010] steal_time return by paravirt_steal_clock() for processor 0 jumped backwards: 11200000000 --> 11199947899, forcing CPUTIME_STEAL counter to overflow by adding 0
[18653.320010] steal_time return by paravirt_steal_clock() for processor 0 jumped backwards: 11200000000 --> 11199948666, forcing CPUTIME_STEAL counter to overflow by adding 0

This was rather revealing: The steal clock does not jump up and then starts to run backwards, but instead the initial jump is already caused by a small decrease of the steal clock and therefore an overflow of steal. Also interesting is the fact that the decrease in value is always small enough to equal 0 when converted into cputime, so the patch can be simplified a lot. I also reduced the severity and frequency of the output so it can be used in production without spamming the logs too much*:

--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -262,10 +262,19 @@ static __always_inline cputime_t steal_account_process_time(cputime_t maxtime)
 #ifdef CONFIG_PARAVIRT
 	if (static_key_false(&paravirt_steal_enabled)) {
 		cputime_t steal_cputime;
-		u64 steal;
-
-		steal = paravirt_steal_clock(smp_processor_id());
-		steal -= this_rq()->prev_steal_time;
+		u64 steal_time;
+		s64 steal;
+
+		steal_time = paravirt_steal_clock(smp_processor_id());
+		steal = steal_time - this_rq()->prev_steal_time;
+
+		if (unlikely(steal < 0)) {
+			printk_ratelimited(KERN_DEBUG "cputime: steal_clock for "
+				"processor %d decreased: %llu -> %llu, "
+				"ignoring\n", smp_processor_id(),
+				this_rq()->prev_steal_time, steal_time);
+			return 0;
+		}
 
 		steal_cputime = min(nsecs_to_cputime(steal), maxtime);
 		account_steal_time(steal_cputime);

So, what kernel versions are affected by this issue? I found two commits in the kernel’s git history which are fundamental for this issue to show up. Both commits were first included in Linux kernel 4.8:

The first commit simplifies code by replacing a xen-specific implementation of accounting cpu steal time by the generalized version for paravirtualized guests that is used now. The xen-specific code included a check for (stolen < 0) which is not included in the general code anymore. The second commit removed a restriction of the amount by which the cpu time counters can increase, which would have caught these overflows as well.

So, is the current kernel still affected by this issue? Since kernel version 4.11, steal_account_process_time() looks like the following*:

static __always_inline u64 steal_account_process_time(u64 maxtime)
{
#ifdef CONFIG_PARAVIRT
        if (static_key_false(&paravirt_steal_enabled)) {
                u64 steal;

                steal = paravirt_steal_clock(smp_processor_id());
                steal -= this_rq()->prev_steal_time;
                steal = min(steal, maxtime);
                account_steal_time(steal);
                this_rq()->prev_steal_time += steal;

                return steal;
        }
#endif
        return 0;
}

The conversions from and to cputime are gone. Therefore, if an overflow of steal occurs, the accounted steal time overflows, i.e. decreases, as well. this_rq()->prev_steal_time is updated accordingly, too, so subsequent erroneous behavior won’t occur. Consequently, only kernel versions 4.8, 4.9 and 4.10 are affected.

Still, this is an issue for Debian Stretch guests running on certain Xen hosts, so I reported the issue together with the simplified patch to Debian’s bugtracker: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=871608

In case you are running Stretch and are hit by this issue, I provide a repository where I push my patched Debian kernel packages. You can use it adding the following line to /etc/apt/sources.list and importing this gpg key.

deb http://0xstubs.org/apt/ stretch/

Please note that I don’t take any responsibility for whatever happens when using these packages. I will try to update them frequently but may fail to do so or stop updating the repository at all at any time.

*: The Linux kernel as well as the patches shown in this article are licensed under the GNU General Public License Version 2 (GPLv2).

Followup Sep. 12 2017:
The patch shown here cannot be included in the mainline Linux kernel since it does not apply to the current kernel version and only patches from Linus’s tree qualify for inclusion into LTS versions. See this brief discussion on the stable mailing list for more information.

Followup Oct. 13 2017:
There is now a discussion on LKML and xen-devel about this issue: https://lkml.org/lkml/2017/10/10/182.

Followup Dec. 03 2017:
A patch, fixing the underlying issue in the Xen driver, was included in Linux 4.15-rc1: 5e25f5db6abb96ca8ee2aaedcb863daa6dfcc07a. It cleanly applies to 4.9 as well so I include it in my patched kernel packages. For now, I leave my own patch in as well as an additional safety net.

Followup Dec. 05 2017:
It seems that at least in my environment the steal time still decreases from time to time even with the patched Xen driver.

Followup Oct. 01 2018:
The most recent update of our Xen host fixed the described issue for me. Therefore I won’t update the patched Debian packages anymore.

2 Responses to Debugging a flaky cpu steal time counter on a paravirtualized Xen guest

James says:

November 30, 2017 at 20:22

To make sure I’m understanding this issue correctly, there are two potential problems,
1) It is possible for the steal counter to go below 0
2) It is possible for the steal counter to be decremented at all

Is this right?

My next question is I’ve got a number of servers on Amazon Linux, which the latest AMI is only at Kernel version 4.9 and I’m not sure I will be able to upgrade them. Is there a workaround? You showed that rebooting the system didn’t fix it. If I migrate the vm will that potentially correct the issue? Is there a simpler workaround?

Thanks for the detailed information on your blog. I haven’t seen this explained as well as you have explained it.
- Michael says:
  
  November 30, 2017 at 22:52
  
  The guest kernel calculates the difference between the steal time returned by the host and the steal time that was already known to the guest before. This difference is stored in a variable that cannot represent negative numbers because the kernel does not expect the steal time to ever decrease. But some Xen hosts actually do return decreasing values to the guest, leading to an overflow of the variable and therefore a huge jump in the steal time counter of the guest. Following this the counter may run backwards, jump around or do other weird things. You won’t ever see a negative steal counter since the datatype cannot represent negative numbers.
  
  I am not aware of any workaround other than patching the kernel. Rebooting is only a temporary solution until the host again returns a decreasing value for the steal time. Following the linked mailing list posts, it seems that migration of the guest is a possible cause for this issue. I don’t see how it could correct the issue after it has hit the guest.
  
  I do not know if Amazon’s EC2 is affected by this issue. I would assume their images to be properly tested in their own environment though. I also think that there would be numerous reports of this issue if it would be present in such a widely used environment. But to be sure, you may just set up an additional VM for a few days and observe the values in /proc/stat.

0xStubs

Debugging a flaky cpu steal time counter on a paravirtualized Xen guest

2 Responses to Debugging a flaky cpu steal time counter on a paravirtualized Xen guest

Leave a Reply Cancel reply