Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Romstage TPM measuring cause a 50 second delay on power on, S3 resume and reboot on all sandy/ivy devices except X230. #541

Closed
snmcmillan opened this issue Mar 21, 2019 · 37 comments
Assignees

Comments

@snmcmillan
Copy link
Contributor

snmcmillan commented Mar 21, 2019

Update: Completely redid this issue. Original text at the end.

Upon powering on, one is greeted with a long black screen. After waiting 50 seconds, they are finally greeted with a menu. After collaboration with @flammit, I've figured out that they delay comes from TPM measurement of the romstage. This delay is also seen on S3 resume and reboot.

This issue affects all sandy/ivy based devices EXCEPT X230.

This also leads me to wonder if romstage actually gets measured, as it seems like a timeout more than it being slow. Although I have not tested my theory yet, I worry that it may not actually be performing the measurements.

Original Issue Below:

Tested on X220 and T420 (#539), both exhibit this issue.

The time it takes to POST is around 1-1.5 minutes long. After that, the display is initialized and then heads loads almost instantly.

Payload isn't the issue, as ThinkLight is able to be toggled on/off anytime after POST, including during the Payload phase, thus isolating it to before the kernel loads.

It's seemingly a set delay as well, which makes me think that something's timing out. Doesn't seem to affect normal coreboot builds on T420 (I haven't run a previous coreboot build on my X220) or X230(t). I will build and flash heads onto my X230t to see if it exhibits the problem, and if it does, this problem is certainly platform independent.

Another bug report (#538) mentions a long pause before giving a beep code, although that build is likely faulty in itself, it still seemingly exhibits the same delay.

On stock firmware, when one runs a modified bios with TPM enabled, it has a long delay (not as long as this delay, if I remember correctly) and also gives the user a beep code indicating that the TPM sees a mismatched signature for the bios. But outside of that and, although incredibly unlikely, IME timing out is the only other thing that comes to my mind.

@snmcmillan
Copy link
Contributor Author

snmcmillan commented Mar 21, 2019

I've done some more testing with my T420, and it takes roughly 50 seconds to wake up from S3 state.
Let me reboot to test power-on time, and I'll edit this message to add the time I get with my stopwatch.
Edit: Getting 50 seconds for the time between the power button lighting up to seeing the bluetooth indicator light up, signifying that POST has finished. This is almost certainly something waiting until it times out.

@snmcmillan snmcmillan changed the title Incredibly slow POST times. Incredibly slow POST times and S3 Resume times. Mar 21, 2019
@tlaurion
Copy link
Collaborator

tlaurion commented Mar 21, 2019

@SebastianMcMillan my builds at https://gitlab.com/tlaurion/heads/-/jobs results in a black screen on my test x230t and x230. Testing to see if laptop is starting up after longer delays.

Thought I screwed up something but no. I've done a clean build from fresh git clone of my branch locally on QubesOS fedora-29 based qube without problems?!?

EDIT: heads-ci_fedora-29 is causing the problem. Initrd is xz compressed while the kernel tries to uncompress with gzip.

Sent from my Galaxy S3 using FastHub-Libre

@snmcmillan
Copy link
Contributor Author

snmcmillan commented Mar 21, 2019

I checked cbmem console logs and it seems that it is indeed related to management engine.

Edit: I've attached the on-boot messages as well as cbmem -c | more output.
IMG_20190321_142318
IMG_20190321_142133

@tlaurion is your cbmem output similar?

Also, does your repo have this issue as well?

@tlaurion
Copy link
Collaborator

tlaurion commented Mar 23, 2019

@SebastianMcMillan nope. I edited precedent comment to not mislead. Had a kernel panic observable through x230-flash image, since that build doesn't ask coreboot to not have GFX unit and actually shows output to the console, where the x230 image just shuts it down until Linux is fully booted inside heads.

Will come back with my own screen captures once I kill the culprit and have a heads-ci that actually builds bootable images.

Sent from my Galaxy S3 using FastHub-Libre

@paulmenzel
Copy link
Contributor

paulmenzel commented Mar 24, 2019 via email

@snmcmillan
Copy link
Contributor Author

snmcmillan commented Mar 24, 2019

The logs are attached above in an image file.

Edit: I do not get any timestamps found in the table. I manually configured it to print timestamps to console as well with make coreboot.menuconfig BOARD=t420 to no success.

On my X220, I get that delay right into a payload error (the constant, high pitch beep), that error I'm still trying to sort out, but I'll test on my X230t and see if it will work without delay.

Edit: Issue not present on X230t.

Another Edit: Keep in mind that my T420 is already running coreboot, and it isn't exhibiting these issues when it is running my builds. I am currently building 4.8.1 for the X220 and seeing if it's seeing any of these issues with clean coreboot.

Another Edit 2: X220 doesn't get delay when posting with clean coreboot 4.8.1.

@snmcmillan
Copy link
Contributor Author

snmcmillan commented Mar 27, 2019

I've figured out what is causing this issue. 0030-sandybridge.patch is causing this issue on all Sandy bridge devices from the looks of it. When I compiled heads without said patch, there isn't any delay.

It looks like those changes were made to ensure the TPM was initialized, but it doesn't seem to make any difference.

@snmcmillan
Copy link
Contributor Author

Issue is resolved by #539, which is awaiting merge.

@snmcmillan
Copy link
Contributor Author

OK, so uhhh on second thought, that patch will not do. However, I've isolated it to these lines being the problem:

extern char _romstage, _eromstage;
tlcl_measure(1, &_romstage, &_eromstage - &_romstage);

Now to figure out how on earth this isn't an issue on the X230, but an issue on literally any other sandybridge/ivybridge platform.

@snmcmillan
Copy link
Contributor Author

Instead of calling the pointers of _romstage and _eromstage, calling them not as pointers have resolved the delay. Whether or not this breaks the measurement of romstage is outside the scope of my knowledge.

@tlaurion
Copy link
Collaborator

I'll test those tomorrow. Have a x220 to test on.

@tlaurion tlaurion self-assigned this Mar 28, 2019
@snmcmillan
Copy link
Contributor Author

Well, I guess something happened and I didn't build with those changes, and when I build with changes, it needs the pointers. Comment them out and it should work, but you do not measure romstage, which would be a problem.

I cannot figure out why on earth this impacts all sandy/ivy boards except X230, as there are not any direct changes to X230 coreboot code that would affect it.

@tlaurion
Copy link
Collaborator

@SebastianMcMillan : not an option :)

I'll check the changes proposed by @flammit and play around in the next coming days.

@snmcmillan
Copy link
Contributor Author

Now that I think about it, is it actually measuring romstage on non-X230 platforms? If it's a timeout, then it could already be skipping the romstage measurements. If you can debug over serial or look at cbmem, I would add a printk showing a hash of the romstage, and take a look at that on two different romstages.

@snmcmillan snmcmillan changed the title Incredibly slow POST times and S3 Resume times. Romstage TPM measuring cause a 50 second delay on power on, S3 resume and reboot on all sandy/ivy devices except X230. Apr 1, 2019
@snmcmillan
Copy link
Contributor Author

Edited original post to include current information. This problem is high on my priority list.

@snmcmillan
Copy link
Contributor Author

snmcmillan commented Apr 9, 2019

Took a ton of fighting, but we finally got cbmem output.
VERBOSE.log

@flammit mentioning you so you can take a look at this.

edit: increasing CBMEM buffer size as recommended by @tlaurion, will reupload

@snmcmillan
Copy link
Contributor Author

@osresearch you may be able to provide some insight as well.

@snmcmillan
Copy link
Contributor Author

snmcmillan commented Apr 10, 2019

almost-complete-cbmem.log
@tlaurion @flammit @osresearch here is an almost complete log. Buffer size is max.
Edit: Just looked over it and it doesn't have all of romstage... defeating the purpose.

@tlaurion
Copy link
Collaborator

tlaurion commented Apr 10, 2019

Beagle bone logs are even less interesting, they start too late.

output.txt
coreboot-x220.config which outputs to beaglebone and cbmem with maximum buffer size

On beaglebone black screwdriver:
fix network config to access it from network , then from ssh:

rmmod g_multi -f && modprobe g_dbgp
cat /dev/ttyGS0 > /tmp/output

@tlaurion
Copy link
Collaborator

@SebastianMcMillan time to hack MAX_MEM or delete cbmem limitation of 256mb....

Sent from my Galaxy S3 using FastHub-Libre

@snmcmillan
Copy link
Contributor Author

hopefully it's as simple as increasing MAX_MEM from 1<<28 to 1<<56. I don't want to test it during the week as I do need my device for school, but I'll certainly test it out this weekend at the latest.

@snmcmillan
Copy link
Contributor Author

snmcmillan commented Apr 10, 2019

I can increase max size up to 1<<30 at the max. Any higher and it doesn't build.

Update:
It boots with the change, but it doesn't help. I discovered CONFIG_PRERAM_CBMEM_CONSOLE_SIZE and found the default value for that. I am upping that and hoping for the best.

Update 2: I increased to the max value for preram cbmem and it didn't change anything

@snmcmillan
Copy link
Contributor Author

snmcmillan commented Apr 10, 2019

I moved pcr 1 measurements towards the end of romstage, and it does in fact properly measure the romstage.

Now to figure out why it delays.

@snmcmillan
Copy link
Contributor Author

If anyone has a T430 that they are able to flash coreboot with or one that already has coreboot, please build heads from my repo for the T430 (I have t430 and t430-flash configs just like the X230).

If the T430 has this issue, then it'll help narrow down the issue a lot.

@tlaurion
Copy link
Collaborator

@SebastianMcMillan : look here

Put this into coreboot config:
CONFIG_COLLECT_TIMESTAMPS=y

The issue linked above involved delays resulting from improper IFD. Might be the cause here too. Or not. Just a thought.

@tlaurion
Copy link
Collaborator

tlaurion commented Apr 11, 2019

Multiple hypothesis/though exercices coming from the top of my head:

  • If it was TPM related, the issue wouldn't also be there at S3 resume, since the TPM is not "supposed" to be involved.
  • The MRC cache could, and yet again, the IFD parsing would occur to try to find the MRC cache, and if not found, retrain memory. There is an option to tackle in coreboot there. Might want to try that.

@tlaurion
Copy link
Collaborator

@SebastianMcMillan
Confirmed that the ME is the culprit on 68e2c76 with CONFIG_COLLECT_TIMESTAMPS=y set in coreboot config

cbmem -t | sort:


   0:1st timestamp                                     769
   1:start of romstage                                 66,579 (7,378,697,531,575,522)
   2:before ram initialization                         40,870,942 (7,378,697,574,641,644)
   3:after ram initialization                          91,412,517 (5,167,720)
   4:end of romstage                                   91,419,086 (6,568)
   8:starting to load ramstage                         91,430,782 (11,696)
   9:finished loading ramstage                         93,802,660 (93,801,890)
  10:start of ramstage                                 95,709,506 (1,906,846)
  15:starting LZMA decompress (ignore for x86)         91,435,896 (5,113)
  16:finished LZMA decompress (ignore for x86)         93,258,233 (1,822,336)
  30:device enumeration                                95,713,119 (3,612)
  40:device configuration                              95,987,757 (7,378,697,626,090,836)
  50:device enable                                     96,438,903 (96,372,323)
  60:device initialization                             96,526,939 (88,035)
  70:device setup done                                 97,938,300 (1,950,542)
  75:cbmem post                                        97,938,302 (2)
  80:write tables                                      97,938,304 (2)
  85:finalize chips                                    97,974,875 (36,570)
  90:load payload                                      97,974,878 (2)
  99:selfboot jump                                     99,380,742 (58,509,799)
 940:waiting for ME acknowledgement of raminit         86,244,797 (7,378,697,624,333,906)
 941:finished waiting for ME response                  91,394,711 (7,378,697,624,351,593)
21 entries total:
Total Time: 36,893,488,240,676,557

@snmcmillan
Copy link
Contributor Author

So it was as I originally thought. But why would the romstage measuring cause the ME to freak out? I mean, I removed that and it would boot normally.

Nonetheless, I'm verifying this myself and taking a more in-depth look at this.

@snmcmillan
Copy link
Contributor Author

Still doesn't fix it, even after I directly use me_cleaner to shrink me and modify ifd and ifdtool to get the layout.txt

@aphick
Copy link

aphick commented May 29, 2019

Is there a known working commit? I’ve tried a much much older commit that works but comparing it with master is really not useful.

@snmcmillan
Copy link
Contributor Author

snmcmillan commented May 31, 2019 via email

@csritu
Copy link

csritu commented Jun 1, 2019

Is there anything I can do to help with this? I have no familiarity with the code of either heads or coreboot, but I have an i5 x220 that I can test on, and I've disabled the measurements and gotten it to boot without the delay.

@snmcmillan
Copy link
Contributor Author

@csritu I'll gladly have another tester if you're up for it, I need others that can do some testing. As long as you're able to do troubleshooting or are willing to learn. Please contact me on the u-root slack, I can discuss this further their in private.

@snmcmillan
Copy link
Contributor Author

This issue is now solved in #578. Will close once changes make there way into master.

@snmcmillan
Copy link
Contributor Author

Fixes merged, closing.

@tlaurion
Copy link
Collaborator

Just as a closing note for this ticket, the problem was not TPM, but CBFS size in coreboot configuration of the boards in question. The IFD needed to be modified so that CBFS size in coreboot matched the IFD produced. I think the most relevant ticket, outside of PR associated with x220 support, would be #307.

Coreboot got stucked, in an undebuggable state per USB debug dongles, where coreboot finally left its uncertainty 50 seconds later to continue even though IFD and CBFS regions didnt matched.

@SebastianMcMillan : if you want to complete this closing statement, that may be useful in the future.

@snmcmillan
Copy link
Contributor Author

@tlaurion you've pretty much hit all the points regarding this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants