Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime,cmd/compile: exit status 0xc0000374 (STATUS_HEAP_CORRUPTION) on windows-amd64-longtest #52647

Closed
bcmills opened this issue May 2, 2022 · 38 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Windows

Comments

@bcmills
Copy link
Contributor

bcmills commented May 2, 2022

#!watchflakes
post <- builder ~ `windows` && `0xc0000374`
XXXBANNERXXX:Test execution environment.
# GOARCH: amd64
# CPU: Intel(R) Xeon(R) CPU @ 2.20GHz
# GOOS: windows
# OS Version: 10.0.14393
go tool compile: exit status 0xc0000374

go tool dist: FAILED: go list -f={{if .Stale}}	STALE {{.ImportPath}}: {{.StaleReason}}{{end}} std: exit status 1

According to https://docs.microsoft.com/en-us/openspecs/windows_protocols/ms-erref/596a1078-e883-4972-9bbc-49e60bebca55, this exit code means:

0xC0000374
STATUS_HEAP_CORRUPTION
A heap has been corrupted.


greplogs --dashboard -md -l -e \(\?ms\)\\Awindows-.\*0xc0000374

2022-04-27T14:23:28-f0c0e0f/windows-amd64-longtest

Since this has only been seen once, leaving on the backlog to see whether this is a recurring pattern or a one-off fluke.
(CC @golang/runtime)

@bcmills bcmills added OS-Windows NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels May 2, 2022
@bcmills bcmills added this to the Backlog milestone May 2, 2022
@bcmills
Copy link
Contributor Author

bcmills commented May 9, 2022

Re-running the scan due to the possibility of failures masked by #52591:

greplogs --dashboard -md -l -e \(\?ms\)\\Awindows-.\*0xc0000374
2022-05-06T14:33:48-119da63/windows-amd64-longtest
2022-04-27T14:23:28-f0c0e0f/windows-amd64-longtest

@bcmills
Copy link
Contributor Author

bcmills commented May 9, 2022

That's two matching failures within the Go 1.19 cycle (and even within the past couple weeks!) on windows/amd64, which is a first-class port, and none before that. Looks like a release-blocking regression to me.

(CC @golang/windows)

@bcmills bcmills modified the milestones: Backlog, Go1.19 May 9, 2022
@bcmills
Copy link
Contributor Author

bcmills commented May 11, 2022

One more:
greplogs --dashboard -md -l -e \(\?ms\)\\Awindows-.\*0xc0000374 --since=2022-05-07
2022-05-10T15:43:40-8fdd277/windows-amd64-longtest

@prattmic
Copy link
Member

Three days of continuous testing on 25 windows gomotes has gotten me zero of these failures, so I suspect I am missing some required component of the failure.

@prattmic prattmic removed their assignment May 23, 2022
@bcmills
Copy link
Contributor Author

bcmills commented May 23, 2022

None on the dashboard for the past week or so, although that's somewhat to be expected with the CL rate decreasing from the freeze.

greplogs --dashboard -md -l -e \(\?ms\)\\Awindows-.\*0xc0000374 --since=2022-05-11

(0 matching logs)


Note that this has only been observed in the -longtest configuration, which is surprising because these failures all occurred during compilation! 😅

Maybe it has something to do with the shape of the machine? IIRC the -longtest builders have more RAM and perhaps also more CPUs. 🤔

(Or maybe it's some sort of crosstalk between tests and builds somehow? But that seems even more weird.)

@prattmic prattmic added the okay-after-beta1 Used by release team to mark a release-blocker issue as okay to resolve either before or after beta1 label May 25, 2022
@gopherbot gopherbot removed the okay-after-beta1 Used by release team to mark a release-blocker issue as okay to resolve either before or after beta1 label Jun 10, 2022
@prattmic
Copy link
Member

Still no new cases since 2022-05-10. I ran another set of 25 builders over the weekend, this time creating a new VM for each test run. Nothing. I am inclined to close this and reopen if anyone discovers new cases.

@prattmic prattmic closed this as not planned Won't fix, can't repro, duplicate, stale Jun 13, 2022
Repository owner moved this from Todo to Done in Go Compiler / Runtime Jun 13, 2022
@prattmic prattmic moved this to Done in Release Blockers Jun 13, 2022
@bcmills
Copy link
Contributor Author

bcmills commented Jun 14, 2022

Still no new cases since 2022-05-10.

SInce the freeze started 2022-05-07 and the rate of CLs (and thus dashboard test runs) is much lower during the freeze, it's not surprising and not necessarily meaningful to have fewer (or no) failures during that interval, and looking at the runtime and cmd/compile commit history since then I don't see anything that particularly stands out as a likely fix.

Running gomote builders can help if it also reproduces the initial failure (perhaps at one of the commits at which the failure was observed) as a control case, but unfortunately the actual builder configuration is subtle enough that a failure to reproduce a rare test failure on gomote instances could just mean that we haven't captured some critical aspect of the configuration (compare #32430).

I would be more comfortable closing out this issue if we have a plausible (even if unconfirmed) theory for how it could have been fixed by a code or configuration change since the last failure.

@prattmic
Copy link
Member

Fair enough, reopened. However beyond simply waiting for builders, I'm out of ideas for trying to reproduce this.

Perhaps someone on @golang/windows has more context about this error and what may trigger it (I've been assuming memory corruption in the C allocator)?

@prattmic prattmic reopened this Jun 14, 2022
@prattmic prattmic moved this from Done to Todo in Go Compiler / Runtime Jun 14, 2022
@bcmills
Copy link
Contributor Author

bcmills commented Jun 14, 2022

Looking for common factors in the two three data points so far, I notice that they are all missing the environment and bootstrapping headers for the run (#51050), and all end with the following line:

go tool dist: FAILED: go list -f={{if .Stale}}	STALE {{.ImportPath}}: {{.StaleReason}}{{end}} std: exit status 1

That suggests that the cmd/compile failure occurred as a subprocess of go list, and I would hope that there aren't that many compile actions that could occur there. 🤔

@bcmills
Copy link
Contributor Author

bcmills commented Jun 14, 2022

The staleness check for std occurs at the start of each cmd/dist test run:
https://cs.opensource.google/go/go/+/master:src/cmd/dist/test.go;l=1286;drc=c29be2d41c6c3ed78a76b4d8d8c1c22d7e0ad5b7
https://cs.opensource.google/go/go/+/master:src/cmd/dist/test.go;l=246;drc=c29be2d41c6c3ed78a76b4d8d8c1c22d7e0ad5b7

I believe that that function runs once per dist test invocation, which may help to explain why we see it on the dashboard builders but not in local runs: the -longtest builders invoke dist test once per test shard, and IIRC the test shards are very small (compare #49343), whereas a local all.bash run only invokes dist test something like once.
https://cs.opensource.google/go/x/build/+/master:cmd/coordinator/buildstatus.go;l=1426-1486;drc=6219a16d3f2922994f4e84212473904f942eb53b

@bcmills
Copy link
Contributor Author

bcmills commented Jun 14, 2022

The go tool compile: log line may come from here:
https://cs.opensource.google/go/go/+/master:src/cmd/go/internal/tool/tool.go;l=109;drc=d922c0a8f5035b0533eb6e912ffd7b85487e3942

But I don't know how that line could possibly be executed as part of go list. 🤔

@prattmic
Copy link
Member

One thing I haven't tried is testing at exactly one of the commits that previously failed. To that end, I'll test at f0c0e0f (commit from the 2022-04-27 failure).

I've instrumented checkNotStale and you are right that we don't run it very often in standard all.bash (once per ##### test block). With sharding it should be running every few packages I believe. So I can try increasing the number of staleness checks. That said, by my envelope calculations I think I've run ~5000 all.bash runs, so I've still run the staleness check quite a bit. (I have 578 other windows test failure logs sitting in /tmp!)

@bcmills
Copy link
Contributor Author

bcmills commented Sep 6, 2022

An intriguing clue (2022-08-23T03:09:07-0a52d80/windows-amd64-longtest):

# GOARCH: amd64
# CPU: Intel(R) Xeon(R) CPU @ 2.20GHz
# GOOS: windows
# OS Version: 10.0.14393
fatal error: advapi32.dll not found
runtime: panic before malloc heap initialized
go: error obtaining buildID for go tool compile: exit status 0xc0000005

go tool dist: FAILED: go list -f={{if .Stale}}	STALE {{.ImportPath}}: {{.StaleReason}}{{end}} std: exit status 1

@aclements
Copy link
Member

No failures since the last one @bcmills reported.

@mknyszek mknyszek added the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Nov 30, 2022
@gopherbot
Copy link
Contributor

Timed out in state WaitingForInfo. Closing.

(I am just a bot, though. Please speak up if this is a mistake or you have the requested information.)

@bcmills
Copy link
Contributor Author

bcmills commented Aug 11, 2023

Another failure during compile -V=full. Looks like heap corruption but it didn't get to the point of exiting with STATUS_HEAP_CORRUPTION.

runtime: s.allocCount= 1 s.nelems= 5
fatal error: s.allocCount != s.nelems && freeIndex == s.nelems
fatal error: unexpected signal during runtime execution
panic during panic
[signal 0xc0000005 code=0x0 addr=0x4 pc=0x348aa5]

runtime stack:
runtime.throw({0xcc0608, 0x2a})
	runtime/panic.go:859 +0x4d fp=0x20f998 sp=0x20f984 pc=0x338c6d
runtime.sigpanic()
	runtime/signal_windows.go:358 +0x2bd fp=0x20f9bc sp=0x20f998 pc=0x350c6d
runtime.preemptall()
	runtime/proc.go:5774 +0x35 fp=0x20f9d8 sp=0x20f9bc pc=0x348aa5
runtime.freezetheworld()
	runtime/proc.go:962 +0xd6 fp=0x20f9f0 sp=0x20f9d8 pc=0x33d326
runtime.startpanic_m()
	runtime/panic.go:1113 +0x157 fp=0x20fa04 sp=0x20f9f0 pc=0x3393c7
runtime.fatalthrow.func1()
	runtime/panic.go:1018 +0x2d fp=0x20fa24 sp=0x20fa04 pc=0x3390cd
runtime.fatalthrow(0x2)
	runtime/panic.go:1013 +0x69 fp=0x20fa3c sp=0x20fa24 pc=0x339089
runtime.throw({0xcc4fde, 0x31})
	runtime/panic.go:859 +0x4d fp=0x20fa50 sp=0x20fa3c pc=0x338c6d
runtime.(*mcache).nextFree(...)
	runtime/malloc.go:921
runtime.mallocgc(0x11c0, 0xc8a720, 0x1)
	runtime/malloc.go:1110 +0xbdd fp=0x20fac4 sp=0x20fa50 pc=0x301cad
runtime.newobject(0xc8a720)
	runtime/malloc.go:1322 +0x2a fp=0x20fad8 sp=0x20fac4 pc=0x301dfa
runtime.procresize(0x8)
	runtime/proc.go:5263 +0x367 fp=0x20fb3c sp=0x20fad8 pc=0x3471e7
runtime.schedinit()
	runtime/proc.go:762 +0x208 fp=0x20fb60 sp=0x20fb3c pc=0x33cb38
runtime.rt0_go()
	runtime/asm_386.s:243 +0x15f fp=0x20fb64 sp=0x20fb60 pc=0x37019f
go: error obtaining buildID for go tool compile: exit status 2
go tool dist: Failed logging metadata: exit status 1


Error: tests failed: dist test failed: {go_test:go/types go/types}: exit status 1

(In a TryBot on https://go.dev/cl/518776.)

@bcmills bcmills reopened this Aug 11, 2023
@gopherbot gopherbot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 11, 2023
@bcmills bcmills removed the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Aug 11, 2023
@bcmills bcmills removed this from the Go1.20 milestone Aug 11, 2023
@bcmills bcmills reopened this Aug 11, 2023
@bcmills
Copy link
Contributor Author

bcmills commented Aug 11, 2023

Given the signal 0xc0000005 in the trace above, I wonder if this is also related to #54187.

@mknyszek
Copy link
Contributor

@bcmills I think that failure is different. This failure is about STATUS_HEAP_CORRUPTION which AFAICT is an error from some Windows memory manager, not us. It could be our fault, but given that we haven't seen this failure in a while, I think the failure you referenced probably needs a new issue. Closing this for now.

@bcmills
Copy link
Contributor Author

bcmills commented Aug 16, 2023

Fair enough. Filed as #62079.

@mknyszek
Copy link
Contributor

Thank you!

@golang golang locked and limited conversation to collaborators Aug 15, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Windows
Projects
Status: Done
Archived in project
Development

No branches or pull requests

6 participants