-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: darwin memory corruption? #22988
Comments
Very weird that it failed only on darwin-amd64. I just saw this runtime flake on darwin-amd64: #22987 I wonder if there's a darwin-amd64 memory corruption issue. A trybot retry would be nice to see if the failure was a flake or not. |
darwin-amd64 is a pool of 10 physical machines running VMware, running max 20 VMs. It's possible that it's bad memory, but that wouldn't be my first guess. |
I was thinking software memory corruption, not hardware. |
Ah, indeed. |
The retry has passed darwin-amd64, so I'm thinking it's not a compiler issue. |
Reassign to @aclements for runtime memory corruption? |
Unexpected darwin segfault: https://storage.googleapis.com/go-build-log/d5865919/darwin-amd64-10_11_bcefed3a.log |
It looks like either clang or testp1 crashed with a segmentation fault without printing anything. Unfortunately, it seems the test's logs don't appear to make it possible to discern which happened. However, either way something seems very broken: either clang (provided by the system) segfaulted, or the Go test program's SIGSEGV handler messed up. |
The darwin-amd64-10_11 builder has been pretty stable on build.golang.org. Are there any notable differences between the darwin-amd64-10_11 builders and trybots? |
@mdempsky, they're identical (same VM images). Only difference is trybot runs shard out over 4 VMs and build.golang.org runs shard out over 3 VMs. |
Change https://golang.org/cl/83016 mentions this issue: |
Change https://golang.org/cl/83015 mentions this issue: |
heapBits.bits is used during bulkBarrierPreWrite via heapBits.isPointer, which means it must not be preempted. If it is preempted, several bad things can happen: 1. This could allow a GC phase change, and the resulting shear between the barriers and the memory writes could result in a lost pointer. 2. Since bulkBarrierPreWrite uses the P's local write barrier buffer, if it also migrates to a different P, it could try to append to the write barrier buffer concurrently with another write barrier. This can result in the buffer's next pointer skipping over its end pointer, which results in a buffer overflow that can corrupt arbitrary other fields in the Ps (or anything in the heap, really, but it'll probably crash from the corrupted P quickly). Fix this by marking heapBits.bits go:nosplit. This would be the perfect use for a recursive no-preempt annotation (#21314). This doesn't actually affect any binaries because this function was always inlined anyway. (I discovered it when I was modifying heapBits and make h.bits() no longer inline, which led to rampant crashes from problem 2 above.) Updates #22987 and #22988 (but doesn't fix because it doesn't actually change the generated code). Change-Id: I60ebb928b1233b0613361ac3d0558d7b1cb65610 Reviewed-on: https://go-review.googlesource.com/83015 Run-TryBot: Austin Clements <[email protected]> Reviewed-by: Matthew Dempsky <[email protected]> Reviewed-by: Rick Hudson <[email protected]> TryBot-Result: Gobot Gobot <[email protected]>
Currently, wbBufFlush does nothing if the goroutine is dying on the assumption that the system is crashing anyway and running the write barrier may crash it even more. However, it fails to reset the buffer's "next" pointer. As a result, if there are later write barriers on the same P, the write barrier will overflow the write barrier buffer and start corrupting other fields in the P or other heap objects. Often, this corrupts fields in the next allocated P since they tend to be together in the heap. Fix this by always resetting the buffer's "next" pointer, even if we're not doing anything with the pointers in the buffer. Updates #22987 and #22988. (May fix; it's hard to say.) Change-Id: I82c11ea2d399e1658531c3e8065445a66b7282b2 Reviewed-on: https://go-review.googlesource.com/83016 Run-TryBot: Austin Clements <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Rick Hudson <[email protected]> Reviewed-by: Matthew Dempsky <[email protected]>
I ran 430 runs on the darwin-amd64-10_11 builder. It turns out after ~350 runs it will start reliably failing with
But if we ignore those, the few other failures I got all look like completely plausible flakes (net timeouts, etc) and not like memory corruption. @bradfitz (or anyone) are there significant differences between how the trybots run all.bash on darwin-amd64-10_11 and me just firing up a gomote and running all.bash in a loop? |
Well, they don't run But that's basically what |
I've been running the following at 23aefcd for the past few days:
Of 738 runs, there are a few that may be memory corruption:
There are also 2 instances of
though I suspect that's just a network flake, plus 8 network timeouts, and 4 file system failures. The file system failures are interesting, since they manifest as missing files, but I think think these are an infrastructure problem, since they include two |
It's interesting that these are all in cmd/compile or cmd/link with the exception of https://storage.googleapis.com/go-build-log/7d48d376/darwin-amd64-10_11_f834bf71.log. It might be possible to reproduce this faster by just running lots of builds without tests. |
Given that these are all cmd/compile and cmd/link, it might be worth running a stress test with concurrent compilation off, to rule that out—or not. (A race condition in the compiler could in theory lead to object file data corruption that would cause a linker crash.) |
(And/or run a stress test with a race-enabled toolchain.) |
I believe this may be a hardware problem. I started another run, this time recording the host name of failures. I've reproduced three failures that appear to be memory corruption and all three of them happened on host |
I got two more failures that look like memory corruption, but weren't on So I'd still like to try stress testing on |
ETIMEOUT |
What is going on here?
https://storage.googleapis.com/go-build-log/5914916e/darwin-amd64-10_11_7e43df8a.log
The text was updated successfully, but these errors were encountered: