Memory allocations #686

simonbyrne · 2022-07-21T21:52:17Z

Memory allocations are currently cause performance issues as they trigger the garbage collector. This is worse when using MPI, as it can be triggered at different times in different processes, causing the communication to get out of sync. For example, this profile:

You can see that the GC pause (in red) on the top process means that the MPI_Waitall on the bottom process takes longer (as it is waiting on data to be sent from the top process). This effect will get worse at higher core counts, hurting scaling.

Potential solutions

Minimize memory allocations

The memory allocations seem to be caused by a few things:

LU factorization in the linear solver (memory allocations in tridiagonal LU factorization #648)
Use of Refs inside bycolumn closures. For some reason these aren't being elided. The easiest fix seems to be to use an immutable object as a scalar container for broadcasting (will require some changes to ClimaCore).

Synchronize calls to the garbage collector

This gets a little bit tricky, but we can disable Julia's GC (GC.enable(false)), and then periodically call GC.gc() manually. We would need to experiment with calling full vs incremental GC (see https://bkamins.github.io/julialang/2021/06/11/vecvec.html)

The text was updated successfully, but these errors were encountered:

tapios · 2022-07-21T22:00:47Z

I suggest you focus on a fully explicit model first, before dealing with the implicit part (linear solver).

charleskawczynski · 2022-09-18T23:27:10Z

Our allocation table in #836 for each step! looks like:

┌───────────────────────────────────────────────────────────────────────┬─────────────┬───────────────┐
│ <file>:<line number>                                                  │ Allocations │ Allocations % │
│                                                                       │   (bytes)   │    (xᵢ/∑x)    │
├───────────────────────────────────────────────────────────────────────┼─────────────┼───────────────┤
│ ClimaCore/jUYzD/src/Fields/broadcast.jl:82                            │   1853472   │      84       │
│ ClimaAtmos.jl/examples/hybrid/schur_complement_W.jl:248               │   221184    │      10       │
│ ClimaAtmos.jl/examples/hybrid/sphere/baroclinic_wave_utilities.jl:171 │    56000    │       3       │
│ ClimaAtmos.jl/examples/hybrid/staggered_nonhydrostatic_model.jl:780   │    55296    │       3       │
│ SciMLBase/chsnh/src/scimlfunctions.jl:1608                            │    8064     │       0       │
│ OrdinaryDiffEq/QXAKd/src/perform_step/rosenbrock_perform_step.jl:43   │    5376     │       0       │
│ OrdinaryDiffEq/QXAKd/src/perform_step/rosenbrock_perform_step.jl:69   │    4032     │       0       │
│ ClimaAtmos.jl/examples/hybrid/sphere/baroclinic_wave_utilities.jl:435 │    2720     │       0       │
│ OrdinaryDiffEq/QXAKd/src/perform_step/rosenbrock_perform_step.jl:65   │    2688     │       0       │
│ ClimaCore/jUYzD/src/Fields/fieldvector.jl:228                         │    1344     │       0       │
│ ClimaAtmos.jl/examples/hybrid/sphere/baroclinic_wave_utilities.jl:121 │     256     │       0       │
└───────────────────────────────────────────────────────────────────────┴─────────────┴───────────────┘

I looked at each of these lines, and here's the summary:

linsolve! needs to be slightly modified to fix allocations due to broadcasting Refs for the identity. I think this is slightly different than other examples and I have an idea of what form is needed to fix this.
vertical∫_col (see vertical∫_col is allocating, and is called inside a broadcast expression #714)
to_scalar_coefs, which uses map, seems to be allocating
SurfaceFluxes call (not sure what's wrong here):

            surface_conditions .=
                constant_T_saturated_surface_conditions.(
                    T_sfc,
                    Spaces.level(ᶜts, 1),
                    Geometry.UVVector.(Spaces.level(Y.c.uₕ, 1)),
                    Fields.Field(ᶜz_interior, axes(z_bottom)),
                    Fields.Field(z_surface, axes(z_bottom)),
                    Cd,
                    Ch,
                    params,
                )

Surprisingly, Base.fill!(dest::FieldVector, value) = dest .= value. Maybe return nothing missing?
p_sfc = reshape(parent(p_sfc), (1, p_int_size[2:end]...)) reshape probably incurs allocations
Upstream allocations:

Lines                                                                allocs (bytes)
SciMLBase/chsnh/src/scimlfunctions.jl:1608                           8064
OrdinaryDiffEq/QXAKd/src/perform_step/rosenbrock_perform_step.jl:43  5376
OrdinaryDiffEq/QXAKd/src/perform_step/rosenbrock_perform_step.jl:69  4032
OrdinaryDiffEq/QXAKd/src/perform_step/rosenbrock_perform_step.jl:65  2688

The OrdinaryDiffEq lines point to:

The first one may very well be due to the call to fill!, the other two seem to be related to @.. on FieldVectors.

The SciMLBase line points to:

https://github.com/SciML/SciMLBase.jl/blob/e1e58d73714f09046600f3c877dbe4b168935ebd/src/scimlfunctions.jl#L1670

Now that I've looked a bit more closely, there are still a few places where we are allocating fields (in, e.g., held_suarez_tendency!)

962: Add support for definite column integrals r=simonbyrne a=charleskawczynski This PR adds `column_integral_definite!`, a non-allocating definite integral for columns. The idea is that it will replace the existing (allocating) definite integral [here](https://github.com/CliMA/ClimaAtmos.jl/blob/44cee55def2a51433c5dcdcdae010b7777cd741b/examples/hybrid/sphere/baroclinic_wave_utilities.jl#L167-L186). However, this implementation has several advantages: - It's compatible with `bycolumn`, so the computation can occur in parallel - It's allocation-free - It's _tested_ on single column and sphere configurations Looking at the flame graph for allocations (using the Julia 1.8 Allocs module with `sample_rate=1`), this is responsible for most of the remaining allocations in ClimaAtmos: <img width="1993" alt="Screen Shot 2022-09-26 at 7 57 05 AM" src="https://user-images.githubusercontent.com/1880641/192353757-03101c41-2c0b-4ccb-a8b9-43faa78680f8.png"> The interface for the added function captures two cases: ```julia function column_integral_definite!(col∫field::Field, field::Field) bycolumn(axes(field)) do colidx column_integral_definite!(col∫field[colidx], field[colidx]) nothing end return nothing end ``` and ```julia function column_integral_definite!( col∫field::PointField, field::ColumnField, ) `@inbounds` col∫field[] = column_integral_definite(field) return nothing end ``` A step towards closing #943, #748, [CA#686](CliMA/ClimaAtmos.jl#686). ## A note on an alternative approach Ideally, we would write this function as `column_integral_definite!(fn, args...)` where we might be able to write a broadcast statement like: ```julia `@.` f2D = column_integral_definite(f3D) do z f3D.x * z^2 end ``` however, this would require us to define broadcasting between planar and 3D domains, which is not trivial (or maybe not possible) because of ambiguities. The ambiguities arise because (2D, 3D) broadcasting may want different things in different cases, for example: - (f2D, f3D) -> f3D: mul full 3D field by planar surface value - (f2D, f3D) -> f2D: perform reduction over z-coordinate to yield 2D field The situation is similar when thinking about what happens when we make views. For example, ```julia Fields.bycolumn(axes(f3D)) do colidx `@.` f2D[colidx] = column_integral_definite(f3D[colidx]) do z f3D[colidx].x * z^2 end end ``` Now, we have to define how `DataF` data layouts broadcast with `VF`. Again, we have two cases: - (f0D, f1D) -> f1D: mul full 1D field by 0D field - (f0D, f1D) -> f0D: perform reduction over z-coordinate to yield 0D field My vote/preference is to support the first cases (which is partially supported already) and write custom functions (e.g., reductions) that operate on single fields for the second case. Co-authored-by: Charles Kawczynski <[email protected]> Co-authored-by: Simon Byrne <[email protected]>

962: Add support for definite column integrals r=charleskawczynski a=charleskawczynski This PR adds `column_integral_definite!`, a non-allocating definite integral for columns. The idea is that it will replace the existing (allocating) definite integral [here](https://github.com/CliMA/ClimaAtmos.jl/blob/44cee55def2a51433c5dcdcdae010b7777cd741b/examples/hybrid/sphere/baroclinic_wave_utilities.jl#L167-L186). However, this implementation has several advantages: - It's compatible with `bycolumn`, so the computation can occur in parallel - It's allocation-free - It's _tested_ on single column and sphere configurations Looking at the flame graph for allocations (using the Julia 1.8 Allocs module with `sample_rate=1`), this is responsible for most of the remaining allocations in ClimaAtmos: <img width="1993" alt="Screen Shot 2022-09-26 at 7 57 05 AM" src="https://user-images.githubusercontent.com/1880641/192353757-03101c41-2c0b-4ccb-a8b9-43faa78680f8.png"> The interface for the added function captures two cases: ```julia function column_integral_definite!(col∫field::Field, field::Field) bycolumn(axes(field)) do colidx column_integral_definite!(col∫field[colidx], field[colidx]) nothing end return nothing end ``` and ```julia function column_integral_definite!( col∫field::PointField, field::ColumnField, ) `@inbounds` col∫field[] = column_integral_definite(field) return nothing end ``` A step towards closing #943, #748, [CA#686](CliMA/ClimaAtmos.jl#686). ## A note on an alternative approach Ideally, we would write this function as `column_integral_definite!(fn, args...)` where we might be able to write a broadcast statement like: ```julia `@.` f2D = column_integral_definite(f3D) do z f3D.x * z^2 end ``` however, this would require us to define broadcasting between planar and 3D domains, which is not trivial (or maybe not possible) because of ambiguities. The ambiguities arise because (2D, 3D) broadcasting may want different things in different cases, for example: - (f2D, f3D) -> f3D: mul full 3D field by planar surface value - (f2D, f3D) -> f2D: perform reduction over z-coordinate to yield 2D field The situation is similar when thinking about what happens when we make views. For example, ```julia Fields.bycolumn(axes(f3D)) do colidx `@.` f2D[colidx] = column_integral_definite(f3D[colidx]) do z f3D[colidx].x * z^2 end end ``` Now, we have to define how `DataF` data layouts broadcast with `VF`. Again, we have two cases: - (f0D, f1D) -> f1D: mul full 1D field by 0D field - (f0D, f1D) -> f0D: perform reduction over z-coordinate to yield 0D field My vote/preference is to support the first cases (which is partially supported already) and write custom functions (e.g., reductions) that operate on single fields for the second case. A step towards closing [CA#686](CliMA/ClimaAtmos.jl#686) Co-authored-by: Charles Kawczynski <[email protected]>

880: Cache flux BCs r=charleskawczynski a=charleskawczynski This PR caches flux boundary conditions, for - `dif_flux_uₕ` - `dif_flux_energy` - `dif_flux_ρq_tot` to address [this comment](#686 (comment)) (`Operators.SetValue(.-dif_flux_uₕ)` is allocating). Note that it seems that the coupler may be impacted by this, due to the `coupled` flag. cc `@LenkaNovak` A step towards #686. Co-authored-by: Charles Kawczynski <[email protected]>

821: make GC deterministic in distributed r=simonbyrne a=simonbyrne # PULL REQUEST ## Purpose and Content This should reduce MPI Waitall time by manually triggering the GC across all processes at the same time. ## Benefits and Risks The number of steps will require some tuning to avoid out-of-memory errors ## Linked Issues - Item 3 of #635 - Mentioned in #686 - Supersedes #687 ## PR Checklist - [x] This PR has a corresponding issue OR is linked to an SDI. - [x] I have followed CliMA's codebase [contribution](https://clima.github.io/ClimateMachine.jl/latest/Contributing/) and [style](https://clima.github.io/ClimateMachine.jl/latest/DevDocs/CodeStyle/) guidelines OR N/A. - [x] I have followed CliMA's [documentation policy](https://github.com/CliMA/policies/wiki/Documentation-Policy). - [x] I have checked all issues and PRs and I certify that this PR does not duplicate an open PR. - [x] I linted my code on my local machine prior to submission OR N/A. - [x] Unit tests are included OR N/A. - [x] Code used in an integration test OR N/A. - [x] All tests ran successfully on my local machine OR N/A. - [x] All classes, modules, and function contain docstrings OR N/A. - [x] Documentation has been added/updated OR N/A. Co-authored-by: Simon Byrne <[email protected]>

charleskawczynski · 2022-10-05T16:26:15Z

Using the same allocation script, out main branch has the following allocs table:

[ Info: allocs_perf_target_unthreaded: 1 unique allocating sites, 43008 total bytes
┌─────────────────────────────────────────────────────────────────────┬─────────────┬───────────────┐
│ <file>:<line number>                                                │ Allocations │ Allocations % │
│                                                                     │   (bytes)   │    (xᵢ/∑x)    │
├─────────────────────────────────────────────────────────────────────┼─────────────┼───────────────┤
│ ClimaAtmos.jl/examples/hybrid/staggered_nonhydrostatic_model.jl:330 │    43008    │      100      │
└─────────────────────────────────────────────────────────────────────┴─────────────┴───────────────┘

And this line points to an @nvtx macro, which I think is mis-attributed. I've added a PR (#894) that widens the allocation monitoring to more packages (including ClimaTimeSteppers), and this is the updated table:

[ Info: allocs_perf_target_unthreaded: 20 unique allocating sites, 10918647 total bytes (truncated)
┌─────────────────────────────────────────────────────────────────────┬─────────────┬───────────────┐
│ <file>:<line number>                                                │ Allocations │ Allocations % │
│                                                                     │   (bytes)   │    (xᵢ/∑x)    │
├─────────────────────────────────────────────────────────────────────┼─────────────┼───────────────┤
│ ClimaTimeSteppers/y3D2E/src/ClimaTimeSteppers.jl:44                 │   5840151   │      53       │
│ ClimaTimeSteppers/y3D2E/src/solvers/newtons_method.jl:46            │   3014144   │      28       │
│ ClimaTimeSteppers/y3D2E/src/integrators.jl:147                      │   1510496   │      14       │
│ ClimaTimeSteppers/y3D2E/src/solvers/imex_ark.jl:398                 │   341440    │       3       │
│ ClimaTimeSteppers/y3D2E/src/integrators.jl:54                       │    62400    │       1       │
│ ClimaTimeSteppers/y3D2E/src/solvers/imex_ark.jl:172                 │    28480    │       0       │
│ ClimaTimeSteppers/y3D2E/src/solvers/imex_ark.jl:159                 │    20528    │       0       │
│ ClimaTimeSteppers/y3D2E/src/solvers/imex_ark.jl:257                 │    16128    │       0       │
│ ClimaTimeSteppers/y3D2E/src/solvers/newtons_method.jl:60            │    12096    │       0       │
│ ClimaTimeSteppers/y3D2E/src/solvers/imex_ark.jl:143                 │    11392    │       0       │
│ ClimaTimeSteppers/y3D2E/src/solvers/imex_ark.jl:219                 │    10592    │       0       │
│ ClimaTimeSteppers/y3D2E/src/solvers/imex_ark.jl:167                 │    9360     │       0       │
│ ClimaTimeSteppers/y3D2E/src/solvers/imex_ark.jl:351                 │    7328     │       0       │
│ ClimaTimeSteppers/y3D2E/src/solvers/imex_ark.jl:227                 │    7280     │       0       │
│ ClimaTimeSteppers/y3D2E/src/solvers/imex_ark.jl:166                 │    6240     │       0       │
│ ClimaAtmos.jl/examples/hybrid/staggered_nonhydrostatic_model.jl:344 │    5376     │       0       │
│ ClimaTimeSteppers/y3D2E/src/solvers/imex_ark.jl:225                 │    4160     │       0       │
│ ClimaTimeSteppers/y3D2E/src/solvers/imex_ark.jl:185                 │    4096     │       0       │
│ ClimaTimeSteppers/y3D2E/src/solvers/imex_ark.jl:211                 │    3504     │       0       │
│ ClimaTimeSteppers/y3D2E/src/solvers/imex_ark.jl:216                 │    3456     │       0       │
└─────────────────────────────────────────────────────────────────────┴─────────────┴───────────────┘

Sometimes (I'm not sure when / how despite using Profile.clear() and Profile.clear_malloc_data()), Coverage picks up allocations of load times, and so we can ignore those line. However, 351 is inside step!, and is the leading allocator over anything in ClimaAtmos.

charleskawczynski · 2022-10-05T17:32:04Z

This build seems to have properly filtered out the load times:

[ Info: allocs_perf_target_unthreaded: 10 unique allocating sites, 410944 total bytes
┌─────────────────────────────────────────────────────────────────────┬─────────────┬───────────────┐
│ <file>:<line number>                                                │ Allocations │ Allocations % │
│                                                                     │   (bytes)   │    (xᵢ/∑x)    │
├─────────────────────────────────────────────────────────────────────┼─────────────┼───────────────┤
│ ClimaTimeSteppers/y3D2E/src/solvers/imex_ark.jl:398                 │   338464    │      82       │
│ ClimaAtmos.jl/examples/hybrid/staggered_nonhydrostatic_model.jl:330 │    43008    │      10       │
│ ClimaTimeSteppers/y3D2E/src/solvers/imex_ark.jl:257                 │    16128    │       4       │
│ ClimaTimeSteppers/y3D2E/src/solvers/newtons_method.jl:60            │    12096    │       3       │
│ ClimaTimeSteppers/y3D2E/src/integrators.jl:103                      │     464     │       0       │
│ ClimaTimeSteppers/y3D2E/src/solvers/imex_ark.jl:384                 │     448     │       0       │
│ ClimaTimeSteppers/y3D2E/src/integrators.jl:101                      │     144     │       0       │
│ ClimaTimeSteppers/y3D2E/src/integrators.jl:106                      │     96      │       0       │
│ ClimaTimeSteppers/y3D2E/src/integrators.jl:102                      │     64      │       0       │
│ ClimaTimeSteppers/y3D2E/src/integrators.jl:138                      │     32      │       0       │
└─────────────────────────────────────────────────────────────────────┴─────────────┴───────────────┘

This points to ClimaTimeSteppers 0.2.4.

charleskawczynski · 2022-10-05T17:34:04Z

Opened CTS#68

898: Suppress allocation table r=charleskawczynski a=charleskawczynski This PR suppresses the allocation table job to reduce our CI times, since it takes almost twice as long as the other jobs. For now, we can use the allocation tests in the flame script to help ensure that we keep an eye on allocations. Most of the remaining allocations (at least in the performance target) seem to be due to allocations in dependencies. I'll open issues for each of those items from the table (related issue #686). Co-authored-by: Charles Kawczynski <[email protected]>

simonbyrne · 2022-10-06T17:51:10Z

The main remaining item is making the GC deterministic. We had to revert the initial attempt at this in #821, as the heuristic it was using wasn't correct: it seems that the values reported by Sys.free_memory and Sys.total_memory don't take into account the cgroup limits imposed by Slurm.

We can query the current memory cgroup of a process by:

MEM_CGROUP=$(sed -n 's/.*:memory:\(.*\)/\1/p' /proc/$$/cgroup)

($$ expands to the current PID)

and then query limits:

cat /sys/fs/cgroup/memory/$MEM_CGROUP/memory.limit_in_bytes
cat /sys/fs/cgroup/memory/$MEM_CGROUP/memory.usage_in_bytes

The question is how to this works with multiple tasks and MPI launchers.

simonbyrne · 2022-10-06T18:58:44Z

For reference, the problem is that libuv (which is used by Julia) doesn't yet support cgroups: this however was added recently: libuv/libuv#3744 JuliaLang/julia#46796

charleskawczynski · 2023-06-20T12:47:02Z

@simonbyrne, is this still an issue? Also, I recently noticed that the solve function first calls step, then gc, then solve. I assume that this is to gc allocations made during compiling methods? If so, one thing I realized is that this doesn’t (necessarily) call the callbacks. We could (and perhaps should) add a trigger_callbacks function and add this call before gc-ing

simonbyrne added the Performance label Jul 21, 2022

simonbyrne mentioned this issue Jul 21, 2022

try moving GC to a callback #687

Closed

10 tasks

charleskawczynski mentioned this issue Sep 26, 2022

Add support for definite column integrals CliMA/ClimaCore.jl#962

Merged

charleskawczynski mentioned this issue Oct 1, 2022

Cache flux BCs #880

Merged

simonbyrne mentioned this issue Oct 2, 2022

make GC deterministic in distributed #821

Merged

10 tasks

charleskawczynski mentioned this issue Oct 5, 2022

Memory allocations CliMA/ClimaTimeSteppers.jl#68

Closed

charleskawczynski mentioned this issue Oct 5, 2022

Suppress allocation table #898

Merged

simonbyrne mentioned this issue Oct 9, 2023

Investigate causes of poor scaling in multi-GPU runs #2222

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory allocations #686

Memory allocations #686

simonbyrne commented Jul 21, 2022 •

edited by charleskawczynski

Loading

tapios commented Jul 21, 2022

charleskawczynski commented Sep 18, 2022 •

edited

Loading

charleskawczynski commented Oct 5, 2022 •

edited

Loading

charleskawczynski commented Oct 5, 2022 •

edited

Loading

charleskawczynski commented Oct 5, 2022

simonbyrne commented Oct 6, 2022

simonbyrne commented Oct 6, 2022 •

edited

Loading

charleskawczynski commented Jun 20, 2023

Memory allocations #686

Memory allocations #686

Comments

simonbyrne commented Jul 21, 2022 • edited by charleskawczynski Loading

Potential solutions

Minimize memory allocations

Synchronize calls to the garbage collector

tapios commented Jul 21, 2022

charleskawczynski commented Sep 18, 2022 • edited Loading

charleskawczynski commented Oct 5, 2022 • edited Loading

charleskawczynski commented Oct 5, 2022 • edited Loading

charleskawczynski commented Oct 5, 2022

simonbyrne commented Oct 6, 2022

simonbyrne commented Oct 6, 2022 • edited Loading

charleskawczynski commented Jun 20, 2023

simonbyrne commented Jul 21, 2022 •

edited by charleskawczynski

Loading

charleskawczynski commented Sep 18, 2022 •

edited

Loading

charleskawczynski commented Oct 5, 2022 •

edited

Loading

charleskawczynski commented Oct 5, 2022 •

edited

Loading

simonbyrne commented Oct 6, 2022 •

edited

Loading