-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: frequent SIGSEGV in readgstatus in bootstrap toolchain #49686
Comments
Also, CC – @mknyszek |
I'll take a look. |
I poked a little bit more and there's one thing I don't understand: why is the failing address exactly zero? Shouldn't the address be slightly offset for the Also, the code around this is kind of fishy:
Note the unsynchronized access to It's unclear to me whether this is a beta blocker, though. As @cherrymui pointed out, it seems old. |
I looked into this a while ago. Yes, there is a nil check. At least on some architectures, atomic load instruction is not marked "fault on nil" in the compiler, so the compiler does not eliminate the nil check. We probably want to fix that (but that won't fix the issue). |
I can definitely see this failing on a weak memory machine. I don't think it even requires subtlety around STW.
On a weak memory machine, thread 2 could observe write B before write A (or, equivalently, the reads could be reordered), leading to thread 2 reading a nil from (Edited to add: Actually, I don't think the reads can be reordered because they're dependent, but the writes could be.) |
I've got a couple ideas for how to fix this. Probably the most straightforward is to save a copy of allgs during gcMarkRootPrepare and use that in markroot. |
Does it work if we use |
Probably, though then we also have to call |
Change https://golang.org/cl/368134 mentions this issue: |
2022-09-05T08:28:34-83b083e-bd5595d/linux-mips-rtrk This looks like it may be related? It's a nil pointer dereference in |
It looks like cmd/dist crashed, which is built from the bootstrap toolchain, which is go1.17.13, which doesn't seem to include the fix above. |
Given that it affects the bootstrap toolchain, should the fix be backported to |
I think we should backport the fix. Normally we just add a workaround to tip's toolchain to avoid bootstrap bugs, but I suspect there isn't a feasible workaround in this case. |
Go 1.17 is no longer supported. Not sure how we're going to backport. I guess we could cherry-pick to release-branch.go1.17. We still need to manually update the bootstrap toolchains on the builders, right? |
We can release another 1.17.x even though it is not supported. Didn't we do that for 1.4.x? Or maybe I am misremembering. |
We never did an updated 1.4.x release. We only backported things to the release branch and posted new src tar files. That said, we're in a brave new world and we can chart whatever course we think best. Since this seems to be entirely the relaxed memory model architectures, one option is to bump all those builders to Go 1.18 or Go 1.19. We'll still have plenty on Go 1.17 to keep us honest (a few builders are already bumped for other reasons, like the port being new since Go 1.17). If people generally agree with bumping the bootstrap toolchain forward on those systems, I'm happy to do the work or advise someone who wants to learn the process. |
Sounds like we have 3 paths forward here:
I think this just needs a decision and someone to do the work (or no one to do no work). |
I talked with Austin briefly about this. I think (2) is probably wrong; it's strictly more work than (3). |
Re "assuming flakes are still happening", taking a closer look at the history, it looks like there were many failures during run.bash originally, and that got fixed and released. Now we have a single flake where the build faulted in readgstatus in a similar way, during bootstrap, which suggests maybe older toolchains are problematic. But it's just one flake. It might be best to wait for a couple more to really be sure that this is the problem. I did a greplogs for runtime.readgstatus, and I found this fault too, which is from a commit back in May (not too old) and was during run.bash, so not bootstrap-dependent. It's interesting because it is pprof faulting on readgstatus, while the world is ostensibly stopped. 2022-05-09T20:57:58-3f43096/openbsd-arm-jsing Maybe there is something lurking here, or maybe not. Let's keep accumulating failures mentioning runtime.readgstatus and see what we find. |
Since we're waiting, I put this in WaitingForInfo. At the very least when it closes we'll have a signal that we might want to look back here. I'll keep it assigned to myself to make sure I get notified. |
Moved to #55304 to not confuse the old and new bug. Closing this. |
greplogs --dashboard -md -l -e '(?ms)^fatal error: .*^runtime\.readgstatus' --since=2021-01-01
2021-11-18T22:40:42-2375b6e/linux-mips64-rtrk
2021-11-18T19:16:38-c4aae23/linux-mips64le-rtrk
2021-11-15T23:01:05-d156101/linux-mips64-rtrk
2021-11-12T16:58:34-95d0657/linux-mips64le-rtrk
2021-11-11T11:47:33-c49627e/linux-mips64le-rtrk
2021-11-10T05:08:25-17980df/linux-mips64le-rtrk
2021-11-09T22:11:33-4aa0746/linux-arm64-packet
2021-11-09T16:20:19-e90fd9a/linux-arm64-packet
2021-11-04T20:43:07-8248152/linux-arm64-packet
2021-11-02T16:12:23-599de4b/linux-arm64-packet
2021-11-01T21:27:46-036812b-f9cb33c/linux-arm64-packet
2021-10-28T16:54:29-5fce1d9/linux-arm64-packet
2021-10-26T05:02:53-ec6c004/linux-arm64-packet
2021-10-15T07:13:43-1cbec68/linux-arm64-packet
2021-10-11T22:32:23-c1b0ae4/linux-arm64-packet
2021-10-08T17:58:41-59d4e92-16a3cef/android-amd64-emu
2021-10-08T00:24:52-59d4e92-78d749f/android-amd64-emu
2021-10-08T00:18:29-59d4e92-7cef831/android-amd64-emu
2021-10-07T21:39:17-59d4e92-6b4cf2b/darwin-arm64-11_0-toothrot
2021-10-07T18:58:33-7286502/darwin-arm64-11_0-toothrot
2021-10-07T18:58:33-7286502/linux-arm64-packet
2021-10-07T18:58:33-59d4e92-7286502/android-amd64-emu
2021-10-07T18:45:53-77f2750/android-arm64-corellium
2021-10-07T18:45:53-59d4e92-77f2750/android-amd64-emu
2021-10-07T14:01:52-ecb2f23/android-386-emu
2021-10-07T03:41:40-62292e8-6f74ed0/android-amd64-emu
2021-10-07T02:30:47-62292e8-39bbf08/darwin-arm64-11_0-toothrot
2021-10-07T00:39:17-62292e8-812a33d/aix-ppc64
2021-10-06T23:01:01-62292e8-4ffa2f1/android-amd64-emu
2021-10-06T22:42:28-62292e8-4002616/darwin-arm64-11_0-toothrot
2021-10-06T21:17:08-62292e8-2e107b4/android-amd64-emu
2021-10-06T20:30:12-62292e8-4a37a1d/android-amd64-emu
2021-10-06T19:59:27-62292e8-17c513e/android-amd64-emu
2021-10-06T19:59:27-62292e8-17c513e/darwin-arm64-11_0-toothrot
2021-10-06T17:38:01-d2e5035-415f0a3/android-amd64-emu
2021-10-06T16:46:09-d2e5035-3160571/android-amd64-emu
2021-10-05T23:58:29-d2e5035-ac60900/android-amd64-emu
2021-10-05T21:25:51-6ae3afa/darwin-arm64-11_0-toothrot
2021-10-05T20:53:02-d4b1ae0-695a59b/android-amd64-emu
2021-10-05T20:35:54-d4b1ae0-990c9c6/darwin-arm64-11_0-toothrot
2021-10-05T20:15:01-d4b1ae0-0b4d499/android-amd64-emu
2021-10-05T20:15:01-d4b1ae0-0b4d499/darwin-arm64-11_0-toothrot
2021-10-05T19:28:36-d4b1ae0-7ae83c8/android-amd64-emu
2021-09-12T01:06:53-915f620-0d8a4bf/linux-arm64-packet
2021-08-31T08:41:16-3e0d083-f118d14/linux-arm64-packet
2021-08-08T18:37:08-507cc34/linux-mips64-rtrk
2021-06-25T17:31:39-d01bc57/linux-mips64-rtrk
2021-06-07T21:12:46-909dd5e/linux-arm64-packet
2021-05-26T16:11:00-bfd7798/linux-mips64le-rtrk
2021-05-21T21:07:03-3c65644/linux-mips64-rtrk
2021-05-17T18:03:56-a2c07a9/linux-mips64-rtrk
2021-05-07T18:14:25-0714010-af6123a/linux-mips64-rtrk
2021-05-07T17:43:31-b44c78b/linux-arm64-packet
2021-04-23T20:57:54-691e1b8/linux-mips64le-rtrk
2021-04-16T21:20:31-02a2ff4/linux-mips64-rtrk
2021-03-25T13:57:08-d1beb07-9f4d5c9/linux-mips64-rtrk
2021-03-17T00:17:33-77fc1ea-68f8e1a/linux-mips64le-rtrk
2021-03-12T01:47:01-a607408/linux-mips64le-rtrk
2021-03-08T20:41:06-a08adda/linux-mips64le-rtrk
2021-03-05T18:46:36-51d8d35/linux-mips64le-rtrk
2021-02-23T10:04:13-a78b0e6/linux-mips64-rtrk
2021-01-27T12:18:00-cd176b3/linux-mips64-rtrk
2021-01-19T12:49:13-d047c91/linux-arm64-packet
Note that several of these failures are on
linux/arm64
anddarwin/arm64
, which are first-class ports.(CC @cherrymui @jeremyfaller)
The text was updated successfully, but these errors were encountered: