Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timer issue #14

Open
shenki opened this issue Oct 24, 2018 · 8 comments
Open

Timer issue #14

shenki opened this issue Oct 24, 2018 · 8 comments
Assignees

Comments

@shenki
Copy link
Member

shenki commented Oct 24, 2018

torvalds/linux@4451d3f causes the aspeed qemu machines to misbehave. The symptoms are extremely slow boot with openbmc userspace. It can be replicated with simpler userspace by typing 'dmesg' and observing the output 'stutter'.

This commit was backported to 4.18 as part of 4.18.16. Current dev-4.18 (d653b87e2a26221) contains this commit.

@shenki
Copy link
Member Author

shenki commented Oct 24, 2018

I went looking to check that we were updating count with the reload value when starting the timer.

By mistake, I added some code to aspeed_timer_ctrl_enable appears to fix resolve the issue. Something as simple as a usleep(1) results in a cleanly booting openbmc userspace:

diff --git a/hw/timer/aspeed_timer.c b/hw/timer/aspeed_timer.c
index 54b400b94aa9..60b80c7fb349 100644
--- a/hw/timer/aspeed_timer.c
+++ b/hw/timer/aspeed_timer.c
@@ -289,6 +289,7 @@ static void aspeed_timer_ctrl_enable(AspeedTimer *t, bool enable)
     trace_aspeed_timer_ctrl_enable(t->id, enable);
     if (enable) {
         t->start = qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL);
+        usleep(1);
         aspeed_timer_mod(t);
     } else {
         timer_del(&t->timer);

@shenki
Copy link
Member Author

shenki commented Oct 25, 2018

https://ozlabs.org/~joel/0001-aspeed_timer-Fix-behaviour-running-Linux.patch

This fix is more comprehensive. We should cause the aspeed timer to expire and reload instead of creating a new qemu timer in the future (the last few lines of the diff)

@legoater
Copy link

I tested the latest. Results are good but I haven't taken a close look at the problem yet.

@amboar
Copy link
Member

amboar commented Oct 25, 2018

We should cause the aspeed timer to expire and reload instead of creating a new qemu timer in the future (the last few lines of the diff)

I'd like to see the rework of the patch along these lines

@bluecmd
Copy link

bluecmd commented Nov 11, 2018

When kexec'ing (and thus not resetting the device) I'm seeing what I believe is some memory corruption, see u-root/u-bmc#114 (comment). Note the value of aaaaaaae8074285b as t->start

EDIT: Not memory corruption, but I guess some signness and things getting negative where they where thought to be that they wouldn't.

What I'm observing is the following:

  1. Kexec'd linux kernel writes 0x10 = 0, setting status of timer2 to 0.
  2. This results in a negative delta, and muldiv64 is only capable of operating on unsigned numbers.
  3. This results in a bogus value like 0xaaaaa...

This is my home-made dump of the action:

!! TIMER_REG_STATUS for timer 1:                                                 
 now: 3e37c1a3e, value: 0, ticks: eb85f3d5                                       
 delta: ffffffff147a0c2b, rate: 16e3600, t->start: aaaaaa84e37c1a66

I patched my local qemu with this:

diff --git a/hw/timer/aspeed_timer.c b/hw/timer/aspeed_timer.c
index 9acd1de485..1a54d85e9d 100644
--- a/hw/timer/aspeed_timer.c
+++ b/hw/timer/aspeed_timer.c
@@ -253,7 +253,7 @@ static void aspeed_timer_set_value(AspeedTimerCtrlState *s, int timer, int reg,
             int64_t delta = (int64_t) value - (int64_t) calculate_ticks(t, now);
             uint32_t rate = calculate_rate(t);
 
-            t->start += muldiv64(delta, NANOSECONDS_PER_SECOND, rate);
+            t->start = (int64_t)t->start + ((__int128_t)delta * NANOSECONDS_PER_SECOND / rate);
             aspeed_timer_mod(t);
         }
         break;

And now Linux boots :-)

@shenki
Copy link
Member Author

shenki commented Nov 13, 2018

Let's move @bluecmd's discussion to a different issue.

Back on the "go slow" issue; we committed a fix cb98173 that resolved the issue on x86 hosts.

The same code running on ppc64le still shows the "go slow" behaviour.

I believe we need to write some tests for the existing code, and perhaps re-work it to make the aspeed timer state clearer. In my mind we have the following:

ENABLE: timer is enabled and is counting "down" into the future
MATCH: timer has matched one of the two match registers, and should cause an irq to fire. It stays in this state until both matches have occurred
ZERO: timer has counted down to zero and should reload itself and go to the ENABLE state

Notes:

  • the ENABLE will always be set to reload when enabling
  • the MATCH states may be the same value
  • the MATCH state can be the same as zero

Linux uses TIMER2 with MATCH1 = MATCH2 = ZERO, and reload is 0xFFFFFFFF as a free running timer.

It uses TIMER1 with MATCH1 = MATCH2 = ZERO, and reload is set to:

  • 24MHz/HZ when used as a periodic timer
  • 0xFFFFFFFF when used as a oneshot timer (MATCH1 irq is enabled)
  • 'cycles' when used as event timer

(For non-aspeed variants there is an irq enabled (on overflow) for the perioidc timer. It's not clear why the aspeed version shouldn't also have an IRQ)

@geissonator
Copy link

Some notes on things I've tried since this went in. On all of our builder (x86) systems, we saw an approximate increase in QEMU test time by 50%. We disabled our builder (ppc64le) system due to the issues noted above.

The interesting thing is that on some systems, like our Internal Red Hat, or even my laptop Mac running VirtualBox Ubuntu, we saw no differences in timing. I did a variety of things to help our builder (x86) systems. Mostly it was just updating the Ubuntu level and/or the Kernel on the systems. These changes helped and got things at least passing most of the time on our builder (x86) systems.

There are still weird discrepancies, for instance builder2 consistently passes, and takes the shortest amount of time, whereas builder1 consistently takes the longest and fails about 20% of the time due to performance related issues. I had builder1 with identical software levels as builder2 but saw these discrepancies. In fact, builder1 wouldn't work at all once the new c++ mapper came in until I updated it to Ubuntu 18.04. My only guess at this point is it has to do with the hardware (CPU model?) in the different systems. https://github.com/openbmc/openbmc/wiki/OpenBMC-Infrastructure-Workgroup#current-infrastructure has details on each system. https://openpower.xyz/job/openbmc-test-qemu-ci/buildTimeTrend can be used to look at how long each builder is taking and on which one's we're getting the intermittent failure.

Our builder4 (ppc64le) system was on Ubuntu 14 and a really old kernel. So I went ahead and updated it to Ubuntu 18 and 4.19.6-041906-generic kernel. With this, it can once again run QEMU, but there's a huge delay in the kernel startup still:

Dec 04 18:57:09 romulus systemd[1]: Startup finished in 2min 1.069s (kernel) + 1min 48.801s (userspace) = 3min 49.871s.

vs. this on builder1:

Dec 04 19:01:24 romulus systemd[1]: Startup finished in 14.458s (kernel) + 1min 29.867s (userspace) = 1min 44.325s

So still some optimizations to do. I tried grabbing the QEMU 3.1 rc branch but didn't see any improvements in using it vs. what we have out on openpower.xyz.

shenki added a commit to shenki/continuous-integration that referenced this issue Dec 14, 2018
We are hitting a bug in the aspeed qemu model:

 openbmc/qemu#14

Signed-off-by: Joel Stanley <[email protected]>
shenki added a commit to shenki/continuous-integration that referenced this issue Dec 14, 2018
We are hitting a bug in the aspeed qemu model:

 openbmc/qemu#14

Signed-off-by: Joel Stanley <[email protected]>
shenki added a commit to shenki/continuous-integration that referenced this issue Dec 14, 2018
We are hitting a bug in the aspeed qemu model:

 openbmc/qemu#14

Signed-off-by: Joel Stanley <[email protected]>
shenki added a commit to shenki/continuous-integration that referenced this issue Dec 28, 2018
We are hitting a bug in the aspeed qemu model:

 openbmc/qemu#14

Signed-off-by: Joel Stanley <[email protected]>
shenki added a commit to shenki/continuous-integration that referenced this issue Dec 29, 2018
We are hitting a bug in the aspeed qemu model:

 openbmc/qemu#14

Signed-off-by: Joel Stanley <[email protected]>
@amboar
Copy link
Member

amboar commented Mar 19, 2019

This issue is resolved as of 151081f. The patches are making their way upstream thanks to @legoater

@amboar amboar self-assigned this Mar 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants