-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
image/gif: TestDecodeMemoryConsumption flakes #35166
Comments
I reproduced by using dragonfly gomote, and repeatedly calling:
Reproduced about 4 times in 50 runs. I changed the test so that when failure happens (because heap is more than 30MB bigger at end of decode than at the beginning), the test does a runtime.GC() and then measures the heap difference. This new code shows that GC fully recovers the 77MB and actually 4MB more. So, I'm guessing this is not a bug, just a case where sometimes memory is not quite scanned/freed in the same way by the initial GC call. If this happens a lot, we should probably just change the test threshold from 30MB to 100MB, or something like that. |
There was a GC pacing bug floating around that I fixed in the 1.14 cycle (#28574) wherein assists didn't kick in often enough in GOMAXPROCS=1 situations and you ended up with accidental, unnecessary heap growth. There's a chance this is something in that vein? |
Hmmm, I should have said that I reproduced it at the change where the bug was originally reported, so that is at 316fb95 (Around Oct 25th). So the pacing fix for #28574 (submitted Sept 4th) would have been included, but the problem still occurred. But I justed tried with most recent master (today), and I can't seem to reproduce the problem. So, was there any other GC pacing change made recently (between Oct. 25th and Sept 4th)? |
@danscales I should clarify: I figured this is unrelated to the fix from a while ago given timelines, but I did recently land a GC pacing change as part of the allocator work, which may account for the difference you're seeing here? In particular, we prevent the mark phase from starting too early which leads to more assists, but prevents a pacing-related heap size explosion if you're allocating rapidly. I figured this wouldn't have much of an effect without my other patches, but maybe it's doing something here. |
@golang/runtime: this looks like possibly a GC pacing or heap-measurement bug? |
If I recall correctly, this is one of the tests I looked into for #52433 as well. It doesn't quite allocate in a loop but I think the same thing can happen because the generated gif used for testing is trivial, so a lot of time is spent in allocation anyway. I'll confirm this. |
Also got one on the (very slow)
|
Change https://go.dev/cl/407139 mentions this issue: |
Well, the GC trace for
(There are only so many GC cycles because I was running with For instance, note the 10 ms wall-clock time on GC 1920 with that not reflected at all in the CPU time. That strongly suggests to me a ton of time spent trying to finish a GC cycle in mark termination, just like #52433. I'm not sure this is enough evidence to 100% confirm that that's the issue, I'd need to actually get a kernel-level scheduler trace to be sure, but this looks very platform-specific, at least for NetBSD. That's not to mention the really, really high amount of time spent in sweep termination, which is reminiscent of what was going wrong on OpenBSD when I was trying to land #48409. And it looks like there's potentially a lot going wrong on NetBSD, performance-wise. FWIW, it looks like the same fix for OpenBSD also helps NetBSD a lot (#52475). Sent a CL for that, it might help a little, but I still see some pretty large heap goal overruns. The easiest thing to do would be to skip this test on NetBSD (and maybe FreeBSD+ARM). |
NetBSD appears to have the same issue OpenBSD had in runqgrab. See issue #52475 for more details. For #35166. Change-Id: Ie53192d26919b4717bc0d61cadd88d688ff38bb4 Reviewed-on: https://go-review.googlesource.com/c/go/+/407139 Run-TryBot: Michael Knyszek <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Reviewed-by: Michael Pratt <[email protected]>
OK actually, looking back over the logs for my NetBSD runs, the wall-clock times of the GC cycles are much less egregious, and there's a much lower chance that something goes wrong here. I'm inclined to call this fixed? Don't know about freebsd/arm, but we haven't seen similar issues on FreeBSD in general, I don't think. Would it be reasonable to blame the builder? I can add a skip for that platform if it happens again. |
Seems ok to close it as maybe-fixed and let |
Seen on the
dragonfly-amd64
builder (https://build.golang.org/log/40aa63372648fd03bc608538deaf94c00c314369):Possibly a GC bug? (CC @aclements @mknyszek @danscales)
The text was updated successfully, but these errors were encountered: