Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use GC.enable_logging(true) at start of test suite #1432

Merged
merged 1 commit into from
Sep 29, 2023
Merged

Conversation

fingolfin
Copy link
Member

We've been seeing crashes with Julia 1.10 and nightly, which may be GC
related. Enabling this logging may reveal whether abnormal GC behavior
is involved.

CC @thofma @benlorenz

@codecov
Copy link

codecov bot commented Sep 13, 2023

Codecov Report

Merging #1432 (1a21def) into master (146a03d) will decrease coverage by 2.24%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #1432      +/-   ##
==========================================
- Coverage   84.69%   82.46%   -2.24%     
==========================================
  Files         110      110              
  Lines       29361    29316      -45     
==========================================
- Hits        24868    24174     -694     
- Misses       4493     5142     +649     

see 57 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

test/runtests.jl Outdated Show resolved Hide resolved
@fingolfin fingolfin marked this pull request as draft September 13, 2023 06:36
@fingolfin fingolfin changed the title HACK: GC.enable_logging(true) at start of test suite HACK: GC.enable_logging(true) at start of test suite (DO NOT MERGE) Sep 13, 2023
@fingolfin
Copy link
Member Author

Looking at the log with the failed Julia nightly run, we note that the last reported GC runs occur here:

Test Summary:            | Pass  Total  Time
Generic.MPoly.exceptions |    3      3  0.0s

GC: pause 125.76ms. collected 864.270363MB. incr 
Heap stats: bytes_mapped 3520.86 MB, bytes_resident 1400.77 MB,
heap_size 1435.48 MB, heap_target 1697.98 MB, Fragmentation 3.536

GC: pause 274.34ms. collected 537.460175MB. incr 
Heap stats: bytes_mapped 3520.86 MB, bytes_resident 1407.23 MB,
heap_size 1508.68 MB, heap_target 1683.68 MB, Fragmentation 3.361

GC: pause 231.96ms. collected 398.666481MB. incr 
Heap stats: bytes_mapped 3520.86 MB, bytes_resident 1450.83 MB,
heap_size 1581.33 MB, heap_target 1843.83 MB, Fragmentation 3.324

GC: pause 325.54ms. collected 299.081589MB. incr 
Heap stats: bytes_mapped 3520.86 MB, bytes_resident 1537.69 MB,
heap_size 1776.08 MB, heap_target 2694.83 MB, Fragmentation 2.995

GC: pause 672.68ms. collected 222.312836MB. incr 
Heap stats: bytes_mapped 3520.86 MB, bytes_resident 1873.08 MB,
heap_size 2617.11 MB, heap_target 4629.61 MB, Fragmentation 2.146

GC: pause 3305.56ms. collected 534.305489MB. incr 
Heap stats: bytes_mapped 3520.86 MB, bytes_resident 2521.42 MB,
heap_size 4246.78 MB, heap_target 8140.53 MB, Fragmentation 1.559
Test Summary:                    |  Pass  Total   Time
Generic.LaurentMPoly.conformance | 10928  10928  44.0s
...

This is up to line 2305 of the logs; then it proceeds to do a lot of test without any GC, and then in line 2439ff:

Test Summary:                           | Pass  Total  Time
Generic.FactoredFrac.ZZ._bases_are_nice |  300    300  9.4s

[1828] signal (15): Terminated
in expression starting at /home/runner/work/_actions/julia-actions/julia-runtest/v1/test_harness.jl:7

So what we see is that the GC repeatedly doubles the heap_target until it is 8GB. My suspicion is that the crash happens when it tries to double the heap_target again to 16 GB, at which point it crashes as the CI runner don't have that.

In contrast in the passing 1.10 job log, the heap grows more slowly. E.g. the section corresponding to what we have above does:

Test Summary:            | Pass  Total  Time
Generic.MPoly.exceptions |    3      3  0.0s

GC: pause 100.40ms. collected 545.834167MB. incr 
Heap stats: bytes_mapped 3520.86 MB, bytes_resident 1346.64 MB,
heap_size 1387.46 MB, heap_target 1562.46 MB, Fragmentation 3.416

GC: pause 222.62ms. collected 487.214649MB. incr 
Heap stats: bytes_mapped 3520.86 MB, bytes_resident 1346.64 MB,
heap_size 1433.27 MB, heap_target 1652.02 MB, Fragmentation 3.323

GC: pause 248.17ms. collected 417.205734MB. incr 
Heap stats: bytes_mapped 3520.86 MB, bytes_resident 1401.66 MB,
heap_size 1533.22 MB, heap_target 1795.72 MB, Fragmentation 3.181

GC: pause 308.56ms. collected 485.730560MB. incr 
Heap stats: bytes_mapped 3520.86 MB, bytes_resident 1498.80 MB,
heap_size 1663.80 MB, heap_target 2451.30 MB, Fragmentation 3.003
Test Summary:                    |  Pass  Total   Time
Generic.LaurentMPoly.conformance | 10951  10951  23.1s

I.e. the heap target is just ~2.5 GB instead of ~8GB.

There are then multiple further GC runs afterwards before and after the place in which the Julia nightly log reports a crash. The heap size fluctuates a bit afterwards -- notably it even decreases sometimes. See e.g. this excerpt:

Test Summary:                     | Pass  Total  Time
Generic.UnivPoly.unsafe_operators |   10     10  0.1s

GC: pause 655.54ms. collected 1850.825077MB. incr 
Heap stats: bytes_mapped 3520.86 MB, bytes_resident 1761.80 MB,
heap_size 1877.97 MB, heap_target 2227.97 MB, Fragmentation 2.996

GC: pause 378.54ms. collected 548.819504MB. incr 
Heap stats: bytes_mapped 3520.86 MB, bytes_resident 1687.61 MB,
heap_size 1838.37 MB, heap_target 2057.12 MB, Fragmentation 3.056
Test Summary:           | Pass  Total  Time
MPolyEvaluate.ZZ(ZZ[x]) | 1008   1008  4.8s
Test Summary:        | Pass  Total  Time
MPolyEvaluate.ZZ(QQ) |  918    918  0.3s
Test Summary:        | Pass  Total  Time
MPolyEvaluate.QQ(ZZ) |  918    918  0.1s
Test Summary: | Pass  Total  Time
MPolyFactor   |    5      5  1.8s

GC: pause 391.83ms. collected 569.662781MB. incr 
Heap stats: bytes_mapped 3520.86 MB, bytes_resident 1722.41 MB,
heap_size 1889.83 MB, heap_target 2589.83 MB, Fragmentation 3.028
Test Summary: |  Pass  Total  Time

which is from before the point where the Julia nightly run crashed.

Now we can speculate what is causing this. I am guessing some differences in heuristics for the GC? But of course it could also have non-GC related compounding factors, e.g. something maybe takes up more heap space in nightly in 1.10/1.9 (but what)?

@lgoettgens
Copy link
Collaborator

I would disagree that this is a pure nightly thing. In the last few runs I found two cases, where nightly succeeded but 1.10 failed. https://github.com/Nemocas/AbstractAlgebra.jl/actions/runs/6165888371/job/16734460315 and https://github.com/Nemocas/AbstractAlgebra.jl/actions/runs/6165878185/job/16734433308

@fingolfin
Copy link
Member Author

I see. In that case I'll restart the 1.10 job here a couple times to see if it will fail eventually...

@fingolfin
Copy link
Member Author

(No worries, the logs for older runs are then still available)

@gbaraldi
Copy link

Could you set the --heap-size-hint to something like 5GB? I'm slightly confused why it's not detecting that it is in a constrained environment, and then sets a heap target of 8GB, where I think the runner has like 7GB

@fingolfin
Copy link
Member Author

@gbaraldi hmmm, is there a way to tell the julia-actions/julia-runtest action to pass additional arguments to Julia? Otherwise we'd have to hack something up, I guess.

Is there a way to get Julia to print what it thinks are the heap limits?

@benlorenz
Copy link
Collaborator

You can get and set the limits with ccalls instead:

  jlmax = @ccall jl_gc_get_max_memory()::UInt64
  totalmem = @ccall uv_get_total_memory()::UInt64
  constrmem = @ccall uv_get_constrained_memory()::UInt64
  println("Memory:   max: ", Base.format_bytes(jlmax))
  println("        total: ", Base.format_bytes(totalmem))
  println("       constr: ", Base.format_bytes(constrmem))

  memenv = parse(Int, get(ENV, "OSCARCI_MAX_MEM_GB", "5")) * 2^30
  println("Setting heap size hint to ", Base.format_bytes(memenv))
  @ccall jl_gc_set_max_memory(memenv::UInt64)::Cvoid

And then use the default of 5 or pass OSCARCI_MAX_MEM_GB in the github environment.

The defaults are set like this:

#ifdef _P64
    total_mem = uv_get_total_memory();
    uint64_t constrained_mem = uv_get_constrained_memory();
    if (constrained_mem > 0 && constrained_mem < total_mem)
        jl_gc_set_max_memory(constrained_mem - 250*1024*1024); // LLVM + other libraries need some amount of memory
#endif
    if (jl_options.heap_size_hint)
        jl_gc_set_max_memory(jl_options.heap_size_hint - 250*1024*1024);

@fingolfin
Copy link
Member Author

OK thanks @benlorenz I've now added this to this PR, let's see

@fingolfin
Copy link
Member Author

With the 5GB limit 1.10 passes just fine, but nightly is stuck with a full heap, collecting tiny amounts of memory in never ending collections:

...
Heap stats: bytes_mapped 3840.94 MB, bytes_resident 1555.33 MB,
heap_size 5129.76 MB, heap_target 5120.00 MB, Fragmentation 348.055
GC: pause 449.67ms. collected 0.551529MB. full 
Heap stats: bytes_mapped 3840.94 MB, bytes_resident 1555.33 MB,
heap_size 5123.29 MB, heap_target 5120.00 MB, Fragmentation 349.884
GC: pause 486.68ms. collected 0.322746MB. full 
Heap stats: bytes_mapped 3840.94 MB, bytes_resident 1555.33 MB,
heap_size 5120.00 MB, heap_target 5120.00 MB, Fragmentation 351.261
GC: pause 485.41ms. collected 0.235336MB. full 
Heap stats: bytes_mapped 3840.94 MB, bytes_resident 1555.33 MB,
heap_size 5125.18 MB, heap_target 5120.00 MB, Fragmentation 352.193
GC: pause 460.01ms. collected 0.345070MB. full 
Heap stats: bytes_mapped 3840.94 MB, bytes_resident 1555.33 MB,
heap_size 5120.42 MB, heap_target 5120.00 MB, Fragmentation 353.894
GC: pause 467.79ms. collected 0.347397MB. full 
Heap stats: bytes_mapped 3840.94 MB, bytes_resident 1555.33 MB,
heap_size 5122.13 MB, heap_target 5120.00 MB, Fragmentation 355.076
GC: pause 445.63ms. collected 0.510376MB. full 
...

That suggests memory usage has grown considerably from Julia 1.10 to nightly/master. Would be good to find out what; perhaps we should file a separate Julialang issue for that?

@gbaraldi
Copy link

Yeah, this does't look like a GC issue but a GC symptom ;). If you can minimize something that reproduces the regression, or at least bisect it, it would be awesome!

@benlorenz
Copy link
Collaborator

benlorenz commented Sep 14, 2023

I tried to look at the memory issue but right now this is rather difficult because of the random nature of many testsets. This would easily explain why it sometimes fails, but it doesn't really explain why it happens more often with 1.10 and 1.11.
See #1437 for more details.

Edit: For a similar crash with julia 1.9 see for example: https://github.com/Nemocas/AbstractAlgebra.jl/actions/runs/6275271908/job/17042406344#step:6:1289

@fingolfin
Copy link
Member Author

@gbaraldi I am not so sure I agree that "this doesn't look like a GC issue but a GC symptom", at least not entirely. After all we also saw/see crashes in 1.10 without the 5GB heap limit. We can of course manually set such a limit, but it seems like a GC bug that it tries to exceed the RAM...

Point in case, here is the output of the debug code given to me by @benlorenz which I inserted:

In Julia 1.9:

Memory:   max: 4.171 GiB
        total: 6.760 GiB
       constr: 16384.000 PiB

In Julia 1.10 and nightly:

Memory:   max: 2.000 PiB
        total: 6.769 GiB
       constr: 16384.000 PiB
Setting heap size hint to 5.000 GiB

Note how in 1.9 it uses ~4GB as "max" while in 1.10 and nightly it uses 2 petabytes? Is that intentional?

@fingolfin fingolfin changed the title HACK: GC.enable_logging(true) at start of test suite (DO NOT MERGE) Use GC.enable_logging(true) at start of test suite Sep 29, 2023
@fingolfin fingolfin marked this pull request as ready for review September 29, 2023 12:35
@fingolfin
Copy link
Member Author

Perhaps we should just merge this (possibly with the max heap adjustment disabled): getting this GC output would possible be helpful in debugging further CI failures.

Thoughts?

Copy link
Collaborator

@benlorenz benlorenz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to have this by default, as it produces more runs with the debug output to compare against. (And usually it isn't that much output, unless the GC is thrashing and about to die anyway)

And I would change the jl_gc_set_max_memory to off for now, I think this needs more testing.

- call `GC.enable_logging(true)` at start of test suite
- print information about what the GC thinks is the available memory

We've been seeing crashes with Julia 1.10 and nightly, which may be GC
related. Enabling this logging may reveal whether abnormal GC behavior
is involved.
@fingolfin
Copy link
Member Author

I've disabled the call to jl_gc_set_max_memory. So from my POV we could merge this now.

I'll wait a bit to see if @thofma has concerns.

@thofma
Copy link
Member

thofma commented Sep 29, 2023

Looks good. I agree that we should not mess with the heap-size-hint in the hope that this will be figured out eventually by the julia devs.

@thofma thofma merged commit cf9289f into master Sep 29, 2023
13 of 15 checks passed
@thofma thofma deleted the mh/HACK-gc-logging branch September 29, 2023 15:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants