Timer issue #14

shenki · 2018-10-24T06:38:42Z

torvalds/linux@4451d3f causes the aspeed qemu machines to misbehave. The symptoms are extremely slow boot with openbmc userspace. It can be replicated with simpler userspace by typing 'dmesg' and observing the output 'stutter'.

This commit was backported to 4.18 as part of 4.18.16. Current dev-4.18 (d653b87e2a26221) contains this commit.

The text was updated successfully, but these errors were encountered:

shenki · 2018-10-24T09:53:30Z

I went looking to check that we were updating count with the reload value when starting the timer.

By mistake, I added some code to aspeed_timer_ctrl_enable appears to fix resolve the issue. Something as simple as a usleep(1) results in a cleanly booting openbmc userspace:

diff --git a/hw/timer/aspeed_timer.c b/hw/timer/aspeed_timer.c
index 54b400b94aa9..60b80c7fb349 100644
--- a/hw/timer/aspeed_timer.c
+++ b/hw/timer/aspeed_timer.c
@@ -289,6 +289,7 @@ static void aspeed_timer_ctrl_enable(AspeedTimer *t, bool enable)
     trace_aspeed_timer_ctrl_enable(t->id, enable);
     if (enable) {
         t->start = qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL);
+        usleep(1);
         aspeed_timer_mod(t);
     } else {
         timer_del(&t->timer);

shenki · 2018-10-25T09:26:02Z

https://ozlabs.org/~joel/0001-aspeed_timer-Fix-behaviour-running-Linux.patch

This fix is more comprehensive. We should cause the aspeed timer to expire and reload instead of creating a new qemu timer in the future (the last few lines of the diff)

legoater · 2018-10-25T10:27:14Z

I tested the latest. Results are good but I haven't taken a close look at the problem yet.

amboar · 2018-10-25T11:29:27Z

We should cause the aspeed timer to expire and reload instead of creating a new qemu timer in the future (the last few lines of the diff)

I'd like to see the rework of the patch along these lines

bluecmd · 2018-11-11T18:18:56Z

When kexec'ing (and thus not resetting the device) I'm seeing what I believe is some memory corruption, see u-root/u-bmc#114 (comment). Note the value of aaaaaaae8074285b as t->start

EDIT: Not memory corruption, but I guess some signness and things getting negative where they where thought to be that they wouldn't.

What I'm observing is the following:

Kexec'd linux kernel writes 0x10 = 0, setting status of timer2 to 0.
This results in a negative delta, and muldiv64 is only capable of operating on unsigned numbers.
This results in a bogus value like 0xaaaaa...

This is my home-made dump of the action:

!! TIMER_REG_STATUS for timer 1:                                                 
 now: 3e37c1a3e, value: 0, ticks: eb85f3d5                                       
 delta: ffffffff147a0c2b, rate: 16e3600, t->start: aaaaaa84e37c1a66

I patched my local qemu with this:

diff --git a/hw/timer/aspeed_timer.c b/hw/timer/aspeed_timer.c
index 9acd1de485..1a54d85e9d 100644
--- a/hw/timer/aspeed_timer.c
+++ b/hw/timer/aspeed_timer.c
@@ -253,7 +253,7 @@ static void aspeed_timer_set_value(AspeedTimerCtrlState *s, int timer, int reg,
             int64_t delta = (int64_t) value - (int64_t) calculate_ticks(t, now);
             uint32_t rate = calculate_rate(t);
 
-            t->start += muldiv64(delta, NANOSECONDS_PER_SECOND, rate);
+            t->start = (int64_t)t->start + ((__int128_t)delta * NANOSECONDS_PER_SECOND / rate);
             aspeed_timer_mod(t);
         }
         break;

And now Linux boots :-)

shenki · 2018-11-13T05:06:50Z

Let's move @bluecmd's discussion to a different issue.

Back on the "go slow" issue; we committed a fix cb98173 that resolved the issue on x86 hosts.

The same code running on ppc64le still shows the "go slow" behaviour.

I believe we need to write some tests for the existing code, and perhaps re-work it to make the aspeed timer state clearer. In my mind we have the following:

ENABLE: timer is enabled and is counting "down" into the future
MATCH: timer has matched one of the two match registers, and should cause an irq to fire. It stays in this state until both matches have occurred
ZERO: timer has counted down to zero and should reload itself and go to the ENABLE state

Notes:

the ENABLE will always be set to reload when enabling
the MATCH states may be the same value
the MATCH state can be the same as zero

Linux uses TIMER2 with MATCH1 = MATCH2 = ZERO, and reload is 0xFFFFFFFF as a free running timer.

It uses TIMER1 with MATCH1 = MATCH2 = ZERO, and reload is set to:

24MHz/HZ when used as a periodic timer
0xFFFFFFFF when used as a oneshot timer (MATCH1 irq is enabled)
'cycles' when used as event timer

(For non-aspeed variants there is an irq enabled (on overflow) for the perioidc timer. It's not clear why the aspeed version shouldn't also have an IRQ)

geissonator · 2018-12-04T19:18:41Z

Some notes on things I've tried since this went in. On all of our builder (x86) systems, we saw an approximate increase in QEMU test time by 50%. We disabled our builder (ppc64le) system due to the issues noted above.

The interesting thing is that on some systems, like our Internal Red Hat, or even my laptop Mac running VirtualBox Ubuntu, we saw no differences in timing. I did a variety of things to help our builder (x86) systems. Mostly it was just updating the Ubuntu level and/or the Kernel on the systems. These changes helped and got things at least passing most of the time on our builder (x86) systems.

There are still weird discrepancies, for instance builder2 consistently passes, and takes the shortest amount of time, whereas builder1 consistently takes the longest and fails about 20% of the time due to performance related issues. I had builder1 with identical software levels as builder2 but saw these discrepancies. In fact, builder1 wouldn't work at all once the new c++ mapper came in until I updated it to Ubuntu 18.04. My only guess at this point is it has to do with the hardware (CPU model?) in the different systems. https://github.com/openbmc/openbmc/wiki/OpenBMC-Infrastructure-Workgroup#current-infrastructure has details on each system. https://openpower.xyz/job/openbmc-test-qemu-ci/buildTimeTrend can be used to look at how long each builder is taking and on which one's we're getting the intermittent failure.

Our builder4 (ppc64le) system was on Ubuntu 14 and a really old kernel. So I went ahead and updated it to Ubuntu 18 and 4.19.6-041906-generic kernel. With this, it can once again run QEMU, but there's a huge delay in the kernel startup still:

Dec 04 18:57:09 romulus systemd[1]: Startup finished in 2min 1.069s (kernel) + 1min 48.801s (userspace) = 3min 49.871s.

vs. this on builder1:

Dec 04 19:01:24 romulus systemd[1]: Startup finished in 14.458s (kernel) + 1min 29.867s (userspace) = 1min 44.325s

So still some optimizations to do. I tried grabbing the QEMU 3.1 rc branch but didn't see any improvements in using it vs. what we have out on openpower.xyz.

We are hitting a bug in the aspeed qemu model: openbmc/qemu#14 Signed-off-by: Joel Stanley <[email protected]>

amboar · 2019-03-19T01:19:10Z

This issue is resolved as of 151081f. The patches are making their way upstream thanks to @legoater

bluecmd mentioned this issue Nov 11, 2018

Aspeed QEMU locks up u-root/u-bmc#114

Closed

shenki mentioned this issue Nov 13, 2018

Timer register corruption when kexec'ing #16

Open

shenki added a commit to shenki/continuous-integration that referenced this issue Dec 14, 2018

driver: Increase qemu timeout to 2m

91a2a8c

We are hitting a bug in the aspeed qemu model: openbmc/qemu#14 Signed-off-by: Joel Stanley <[email protected]>

shenki added a commit to shenki/continuous-integration that referenced this issue Dec 14, 2018

driver: Increase qemu timeout to 2m

a8f062d

We are hitting a bug in the aspeed qemu model: openbmc/qemu#14 Signed-off-by: Joel Stanley <[email protected]>

shenki added a commit to shenki/continuous-integration that referenced this issue Dec 14, 2018

driver: Increase qemu timeout to 2m

10c21d3

We are hitting a bug in the aspeed qemu model: openbmc/qemu#14 Signed-off-by: Joel Stanley <[email protected]>

shenki added a commit to shenki/continuous-integration that referenced this issue Dec 28, 2018

driver: Increase qemu timeout to 2m

dadeb0a

We are hitting a bug in the aspeed qemu model: openbmc/qemu#14 Signed-off-by: Joel Stanley <[email protected]>

shenki added a commit to shenki/continuous-integration that referenced this issue Dec 29, 2018

driver: Increase qemu timeout to 2m

ac58cbc

We are hitting a bug in the aspeed qemu model: openbmc/qemu#14 Signed-off-by: Joel Stanley <[email protected]>

geissonator mentioned this issue Jan 10, 2019

Intermittent start up issues openbmc/openbmc#3369

Closed

amboar self-assigned this Mar 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timer issue #14

Timer issue #14

shenki commented Oct 24, 2018

shenki commented Oct 24, 2018

shenki commented Oct 25, 2018

legoater commented Oct 25, 2018

amboar commented Oct 25, 2018

bluecmd commented Nov 11, 2018 •

edited

Loading

shenki commented Nov 13, 2018

geissonator commented Dec 4, 2018

amboar commented Mar 19, 2019

Timer issue #14

Timer issue #14

Comments

shenki commented Oct 24, 2018

shenki commented Oct 24, 2018

shenki commented Oct 25, 2018

legoater commented Oct 25, 2018

amboar commented Oct 25, 2018

bluecmd commented Nov 11, 2018 • edited Loading

shenki commented Nov 13, 2018

geissonator commented Dec 4, 2018

amboar commented Mar 19, 2019

bluecmd commented Nov 11, 2018 •

edited

Loading