-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tons of spurious failures due to build scripts failing to run rustc #663
Comments
I don't recall exactly how crater/rustwide work, but I believe they use rustup. If so, and if the crater machine has been updated to rustup 1.25, then this is likely caused by rust-lang/rustup#3031. |
That sounds like a deterministic problem, whereas this build script failure only occurs for a tiny fraction of all crates that use libc. (At least I would be very surprised if only 600 crates used libc.) |
Yes, and I manually verified that libc itself built fine on master and on try, so it's definitely not fully deterministic. It might be worth trying to run these through again to try and re-confirm and see if failures are deterministic for this subset of ~600. |
Interestingly, based on unpacking the all.tar.gz from both #99447 and #100043, and running One run failed twice in libc (https://crater-reports.s3.amazonaws.com/pr-100043/try%23a8425075041d4b9af9bf32e2a267dfb9db2cfffb/gh/dalloriam.worktimer/log.txt), presumably there's a build-dep on libc and a regular dependency there? But not sure there's much information we can gain from that. |
In the rerun, the libc issue still occurs a few times, but the majority of crates passes now. So, definitely non-deterministic. To me this really looks like it just failed to spawn the process due to resource limits... |
We run on a large number of machines and don't regularly collect detailed logs from each one right now, but I'm not sure what kind of resource limit could have caused this to just recently start happening. We are adding new machine types to our build fleet -- but those aren't being oversubscribed to any great extent, at least from my perspective. I'm going to try and locate which machines are having this error and pull logs, but our machines are ephemeral, so that might be a little hard. |
I located one machine which had a few cases of this, but I don't see any adjacent suspicious events in the syslog (e.g., OOM kill or similar). It's possible it's another kind of resource exhaustion (file descriptor limits or whatever), though. I also queried the all.tar logs for which workers had problems, and there is a suspicious correlation there, presuming I'm getting reasonable data. (Note that these workers may be on different machines: I'm not sure there's any good way to recover that from the current logs and other data we have, though it seems like a good idea to add some support for that). worker-9 I think has to be on the larger, oversubscribed, Azure machines -- the GCP machines only have 8 workers. I definitely found 3 cases on one of the GCP machines though, so it's not strictly limited. In any case, given these results for now I've turned off the Azure capacity -- currently it looks like I can't directly ssh onto those machines, but they are definitely oversubscribed so I'd not be too surprised if there's some resourcing issue there. I don't think it fully explains the problem (we should see errors on both master# and try#, IMO, and have seen a similar rate historically if not lower now since we have more GCP capacity than we did).
Edit: and the worker counts don't seem related to the number of jobs that worker executed, at least for worker-9 -- so I don't think we can explain this with just "that worker is executing more work". worker-6 though is more of an unknown.
|
rust-lang/rust#100046 (comment) is another crater run, mostly run after we shut down the Azure (oversubscribed) instances. It looks like it had ~9 libc issues; I didn't look for other cases. |
My script has a few other patterns that will match other build scripts and found 38 instances in
Which is on-par with what was seen previously. Copying some comments from the zulip thread for posterity: I scanned the old logs. This seems to have been an issue since at least February (the oldest one I checked). One strange thing I noticed in pr-100043 (Ralf's PR) and pr-99880 is that there is one build script which prints the stderr when it fails to run rustc and it printed: /opt/rustwide/rustup-home/toolchains/a8425075041d4b9af9bf32e2a267dfb9db2cfffb/bin/rustc: symbol lookup error: /opt/rustwide/rustup-home/toolchains/a8425075041d4b9af9bf32e2a267dfb9db2cfffb/lib/librustc_driver-c25a516c3d8f3eb8.so: undefined symbol: _ZN9hashbrown3raw11Fallibility17capacity_overflow17hf1df634019334dcdE In all 7 cases it is the same symbol error. matthiaskrgr was also observing that same exact error at https://rust-lang.zulipchat.com/#narrow/stream/131828-t-compiler/topic/master.20toolchain.20broken.3F/near/291873065 with icemaker. Number of "failed to launch rustc -V" -like errors over time: February is just the last one you gave me in the list, I assume it has been happening before then. Here's the frequency over time: 2022-02-10 31 Raw per-worker occurrence data: https://gist.github.com/ehuss/2a0ebce9fe9bc1440705f6fe86fe3094 |
This should help with pinpointing the issue in rust-lang/crater#663.
Report the actual error when failing to get the rustc version This should help with pinpointing the issue in rust-lang/crater#663.
One crate which has incorrect parsing logic: https://gitlab.com/leonhard-llc/ops/-/blob/main/build-data/src/lib.rs#L379 (found in https://crater-reports.s3.amazonaws.com/beta-1.66-1.2/beta-2022-11-01/gh/c0repwn3r.mangrove/log.txt). |
One unfortunate side-effect of this is that it can mask actual regressions -- e.g. in https://crater-reports.s3.amazonaws.com/pr-104429-1/index.html, skreborn.bluetooth-sys.9d5b3dcba1a98824bb765ca5f69a1a6c789589b4 should be listed as a regular regression but instead its build script failed. |
This should help with pinpointing the issue in rust-lang/crater#663.
build.rs: print rustc stderr if exit status != 0 I was trying to run benchmarks locally with rustc-perf and found that many of them failed to build with a message from libc's build.rs "Failed to get rustc version." I made this change locally to help debug, and I think it would be generally useful. In my case it quickly revealed that rustc was failing to find libLLVM and so `rustc --version` was emitting nothing on stdout. I think this may have been part of what was intended with #3000 and might help debug rust-lang/crater#663.
build.rs: print rustc stderr if exit status != 0 I was trying to run benchmarks locally with rustc-perf and found that many of them failed to build with a message from libc's build.rs "Failed to get rustc version." I made this change locally to help debug, and I think it would be generally useful. In my case it quickly revealed that rustc was failing to find libLLVM and so `rustc --version` was emitting nothing on stdout. I think this may have been part of what was intended with #3000 and might help debug rust-lang/crater#663.
build.rs: print rustc stderr if exit status != 0 I was trying to run benchmarks locally with rustc-perf and found that many of them failed to build with a message from libc's build.rs "Failed to get rustc version." I made this change locally to help debug, and I think it would be generally useful. In my case it quickly revealed that rustc was failing to find libLLVM and so `rustc --version` was emitting nothing on stdout. I think this may have been part of what was intended with #3000 and might help debug rust-lang/crater#663.
The report at https://crater-reports.s3.amazonaws.com/pr-100043/index.html is 100% spurious regressions. Almost all of them are due to the libc build script failing to detect the rustc version, a bunch are due to other build scripts failing to run rustc, and then there are a few other random strange things (in syn some trait is supposedly not implemented any more, and the rest is handled by #664).
The text was updated successfully, but these errors were encountered: