Reduce memory consumption for buffer pool #1150

pgrete · 2024-08-08T19:30:46Z

PR Summary

Tries to reduce the memory footprint of the object pool by precalculating the number of buffers required for each type.
Also fixes an original overflow for very large buffer sizes (and the default 200 number of buffers).

PR Checklist

pgrete · 2024-08-08T19:32:30Z

@lroberts36 I came up with a different approach (not using a user input variable and keeping the allocations in large chunks).
Is there anything wrong with this approach (i.e., is sth going wrong in extending an existing pool this way)?
I haven't tested this yet and just want to make sure this is conceptually possible given the object pool approach.

lroberts36 · 2024-08-08T20:02:36Z

@pgrete: I think this will work but it is maybe not ideal. First, I think this counts all buffers even for variables that are not allocated, so when running with pack size = -1 this will allocate buffers for every field wether or not it is allocated. Second, the number of buffers allocated will depend on the partition of the block list since this routine operates per MeshData.

One other thing I remembered is that the buffer sizes are not calculated as accurately as they could be. Rather than calculate the actual index range, it just uses a heuristic for the size that is guaranteed to be larger but is possibly too large by a factor of a couple.

pgrete · 2024-08-09T08:12:40Z

@pgrete: I think this will work but it is maybe not ideal. First, I think this counts all buffers even for variables that are not allocated, so when running with pack size = -1 this will allocate buffers for every field wether or not it is allocated. Second, the number of buffers allocated will depend on the partition of the block list since this routine operates per MeshData.

How about the updated version? ~~I now only count buffers for allocated variables.~~ I reverted again to count buffer for all vars as IIRC we always allocated the comms (as they're also used to communicate the allocations for sparse vars, or did this change?)

Regarding the second point, I don't think this is going to be a huge issue in practice because the extreme case with pack_size=few and many (small) blocks per rank is only occurring on host runs for which memory allocation are much less of an issue than on device runs.
Similarly, in cases with many buffers I can imagine that even the 200 could have easily been exhausted so that all subsequent buffer requests are serial (which now remain serial in the worst case -- with slight overhead of additional initial allocations but fewer than 200 given that I think it's safe to assume that a few buffers always have the same size in practice -- and in the best case even reduce the overhead as more buffers are created in one go).

Finally, I think that this is generally a tricky issue to optimize, which is why I'm trying to go for a default that covers most use cases and not introducing a parameter that the (inexperienced) end user might easily miss (or misconfigure).

What do you think?

One other thing I remembered is that the buffer sizes are not calculated as accurately as they could be. Rather than calculate the actual index range, it just uses a heuristic for the size that is guaranteed to be larger but is possibly too large by a factor of a couple.

I don't think it's an issue. The differences should be small-ish (especially for large block sizes and for very small block sizes the overhead in terms of active to passive cells is huge anyway).

pgrete · 2024-08-09T09:29:49Z

Okay, I need more eyes/input here as I don't understand what's going on.
The current code crashes.
The stack trace points back to line 75 (the lambda function that is in the pool_map, which eventually goes back through some object_pool Get() to the buf.Allocate() call in line 155 of bvals_utils.hpp.
How are those related? I'm starting to wonder if my picture of how the object pools work is completely off.
Also, if I count the number of buffers required twice (i.e. doing nbufs[buf_size] += 2) everything works as usual again.

…-alloc

pgrete · 2024-08-12T13:07:50Z

Alright, I think I figured out (and understand) more now.
The original BuildBoundaryBufferSubset never actually added objects to the pool, it only created a pool with a callback lambda that would eventually add objects to the pool dynamically, right?

So I updated the logic, that the pool creation remains (with a default number of buffer to be dynamically being added scaled by the number of packs per rank) and added routine to immediately add object to the pools (matching the required number of buffers for the given meshdata container).

For the ./example/advection/parthinput.advection this now means at

pack_size=-1:

Setting up a new pool for buffers of size 18
Reserving 64 new buffers of size 18 to pool with 0 buffers because 64 are required in total.
Setting up a new pool for buffers of size 126
Reserving 64 new buffers of size 126 to pool with 0 buffers because 64 are required in total.
Reserving 32 new buffers of size 18 to pool with 64 buffers because 96 are required in total.
Reserving 56 new buffers of size 126 to pool with 64 buffers because 120 are required in total.
Setting up a new pool for buffers of size 378
Reserving 16 new buffers of size 378 to pool with 0 buffers because 16 are required in total.
Reserving 208 new buffers of size 18 to pool with 96 buffers because 304 are required in total.
Reserving 256 new buffers of size 126 to pool with 120 buffers because 376 are required in total.
Reserving 32 new buffers of size 378 to pool with 16 buffers because 48 are required in total.
### Warning in Mesh::Initialize
The number of MeshBlocks increased more than twice during initialization.
More computing power than you expected may be required.
...
This run is a restart: 0

cycle=0 time=0.0000000000000000e+00 dt=1.7578125000000000e-03 zone-cycles/wsec_step=0.00e+00 wsec_total=1.17e-02 wsec_step=2.72e+00 zone-cycles/wsec=0.00e+00 wsec_AMR=0.00e+00
---------------------- Current Mesh structure ----------------------
Number of Trees = 16
Total number of MeshBlocks = 88
Number of physical refinement levels = 2
Number of logical  refinement levels = 2
  Physical level = 0 (logical level = 0): 4 MeshBlocks, cost = 4
  Physical level = 1 (logical level = 1): 36 MeshBlocks, cost = 36
  Physical level = 2 (logical level = 2): 48 MeshBlocks, cost = 48
--------------------------------------------------------------------
cycle=1 time=1.7578125000000000e-03 dt=1.7578125000000000e-03 zone-cycles/wsec_step=1.66e+06 wsec_total=2.53e-02 wsec_step=1.36e-02 zone-cycles/wsec=1.66e+06 wsec_AMR=3.88e-06
cycle=2 time=3.5156250000000001e-03 dt=1.7578125000000000e-03 zone-cycles/wsec_step=3.72e+06 wsec_total=3.14e-02 wsec_step=6.06e-03 zone-cycles/wsec=3.72e+06 wsec_AMR=2.81e-06
Reserving 24 new buffers of size 18 to pool with 304 buffers because 328 are required in total.
Reserving 24 new buffers of size 126 to pool with 376 buffers because 400 are required in total.
cycle=3 time=5.2734375000000003e-03 dt=1.7578125000000000e-03 zone-cycles/wsec_step=3.30e+06 wsec_total=4.42e-02 wsec_step=6.83e-03 zone-cycles/wsec=1.76e+06 wsec_AMR=5.99e-03
...
cycle=565 time=9.9316406249998979e-01 dt=1.7578125000000000e-03 zone-cycles/wsec_step=2.51e+06 wsec_total=6.96e+00 wsec_step=1.27e-02 zone-cycles/wsec=2.48e+06 wsec_AMR=1.38e-04
Reserving 20 new buffers of size 18 to pool with 460 buffers because 480 are required in total.
Reserving 8 new buffers of size 126 to pool with 532 buffers because 540 are required in total.
cycle=566 time=9.9492187499998974e-01 dt=1.7578125000000000e-03 zone-cycles/wsec_step=2.55e+06 wsec_total=6.98e+00 wsec_step=1.25e-02 zone-cycles/wsec=1.37e+06 wsec_AMR=1.07e-02

pack_size=1

Setting up a new pool for buffers of size 18
Reserving 4 new buffers of size 18 to pool with 0 buffers because 4 are required in total.
Setting up a new pool for buffers of size 126
Reserving 4 new buffers of size 126 to pool with 0 buffers because 4 are required in total.
Dynamically adding 16 buffers of size 18 to a pool with current size 4 and future size 20
Dynamically adding 16 buffers of size 126 to a pool with current size 4 and future size 20
Dynamically adding 16 buffers of size 18 to a pool with current size 20 and future size 36
Dynamically adding 16 buffers of size 126 to a pool with current size 20 and future size 36
Dynamically adding 16 buffers of size 18 to a pool with current size 36 and future size 52
Dynamically adding 16 buffers of size 126 to a pool with current size 36 and future size 52
Dynamically adding 16 buffers of size 18 to a pool with current size 52 and future size 68
Dynamically adding 16 buffers of size 126 to a pool with current size 52 and future size 68
Setting up a new pool for buffers of size 378
Reserving 2 new buffers of size 378 to pool with 0 buffers because 2 are required in total.
Dynamically adding 16 buffers of size 126 to a pool with current size 68 and future size 84
Dynamically adding 16 buffers of size 126 to a pool with current size 84 and future size 100
Dynamically adding 16 buffers of size 18 to a pool with current size 68 and future size 84
...
Dynamically adding 16 buffers of size 126 to a pool with current size 372 and future size 388
### Warning in Mesh::Initialize
The number of MeshBlocks increased more than twice during initialization.
More computing power than you expected may be required.
...
This run is a restart: 0

cycle=0 time=0.0000000000000000e+00 dt=1.7578125000000000e-03 zone-cycles/wsec_step=0.00e+00 wsec_total=1.47e-02 wsec_step=2.73e+00 zone-cycles/wsec=0.00e+00 wsec_AMR=0.00e+00
---------------------- Current Mesh structure ----------------------
Number of Trees = 16
Total number of MeshBlocks = 88
Number of physical refinement levels = 2
Number of logical  refinement levels = 2
  Physical level = 0 (logical level = 0): 4 MeshBlocks, cost = 4
  Physical level = 1 (logical level = 1): 36 MeshBlocks, cost = 36
  Physical level = 2 (logical level = 2): 48 MeshBlocks, cost = 48
--------------------------------------------------------------------
Dynamically adding 28 buffers of size 378 to a pool with current size 4 and future size 32
Dynamically adding 28 buffers of size 378 to a pool with current size 32 and future size 60
cycle=1 time=1.7578125000000000e-03 dt=1.7578125000000000e-03 zone-cycles/wsec_step=9.20e+05 wsec_total=3.92e-02 wsec_step=2.45e-02 zone-cycles/wsec=9.20e+05 wsec_AMR=4.52e-06
cycle=2 time=3.5156250000000001e-03 dt=1.7578125000000000e-03 zone-cycles/wsec_step=1.48e+06 wsec_total=5.44e-02 wsec_step=1.52e-02 zone-cycles/wsec=1.48e+06 wsec_AMR=5.91e-06
Dynamically adding 16 buffers of size 18 to a pool with current size 308 and future size 324
Dynamically adding 16 buffers of size 126 to a pool with current size 388 and future size 404
Dynamically adding 16 buffers of size 18 to a pool with current size 324 and future size 340
cycle=3 time=5.2734375000000003e-03 dt=1.7578125000000000e-03 zone-cycles/wsec_step=1.47e+06 wsec_total=7.82e-02 wsec_step=1.53e-02 zone-cycles/wsec=9.48e+05 wsec_AMR=8.48e-03
...
cycle=565 time=9.9316406249998979e-01 dt=1.7578125000000000e-03 zone-cycles/wsec_step=1.37e+06 wsec_total=2.00e+01 wsec_step=2.31e-02 zone-cycles/wsec=1.36e+06 wsec_AMR=2.34e-04
Dynamically adding 16 buffers of size 18 to a pool with current size 468 and future size 484
Dynamically adding 16 buffers of size 126 to a pool with current size 532 and future size 548
cycle=566 time=9.9492187499998974e-01 dt=1.7578125000000000e-03 zone-cycles/wsec_step=1.43e+06 wsec_total=2.00e+01 wsec_step=2.22e-02 zone-cycles/wsec=9.09e+05 wsec_AMR=1.28e-02

In total, we have 25 separate allocation events (all reserving, non dynamic) over the runtime for pack_szize=-1 and 70 (4 reserving and 66 dynamic) for pack_size=1.

How does that solution look?
If people are happy, I'll cleaup the PR (i.e., add a changelog and remove the debug std:cerr).

Finally, small related question: What's going on with this line (I understand the first part of the conditional, but not the second -- or more specifically, what are the circumstances where the second one is false):
if (pmb->gid == nb.gid && nb.offsets.IsCell()) buf_size = 0;

lroberts36

This approach makes sense to me.

lroberts36 · 2024-08-12T18:25:56Z

src/bvals/comms/build_boundary_buffers.cpp

+      std::cerr << "Reserving " << new_buffers_req << " new buffers of size " << buf_size
+                << " to pool with " << pool.NumBuffersInPool() << " buffers because "
+                << nbufs.at(buf_size) << " are required in total.\n";


Maybe add a flag to possibly suppress the std::cerr outputs?

lroberts36 · 2024-08-12T18:31:34Z

Alright, I think I figured out (and understand) more now. The original BuildBoundaryBufferSubset never actually added objects to the pool, it only created a pool with a callback lambda that would eventually add objects to the pool dynamically, right?

Yes, this is correct.

How does that solution look? If people are happy, I'll cleaup the PR (i.e., add a changelog and remove the debug std:cerr).

This solution looks reasonable to me. Sorry, saw the comment about removing std::cerr after reviewing.

Finally, small related question: What's going on with this line (I understand the first part of the conditional, but not the second -- or more specifically, what are the circumstances where the second one is false): if (pmb->gid == nb.gid && nb.offsets.IsCell()) buf_size = 0;

Multigrid logic requires blocks sending messages to themselves (since the same block can show up on two multigrid levels). This doesn't require any data transfer, so the message size can be zero. It is essentially just a flag to show that the block is done being used on one level and can be used on the next level.

…-alloc

pgrete · 2024-08-22T14:28:02Z

I cleaned up the PR. This is ready for a second review now.

Yurlungur

This seems reasonable to me. I appreciate the new detailed comments.

Reduce memory consumption for buffer pool

57b7304

Only count vars for allocated variables

5522e48

printf debugging...

4bef050

pgrete added 2 commits August 12, 2024 15:07

Fix allocation patterns

3027cd0

Merge remote-tracking branch 'origin/develop' into pgrete/update-pool…

103ef22

…-alloc

pgrete requested review from Yurlungur, lroberts36 and bprather August 12, 2024 13:07

lroberts36 approved these changes Aug 12, 2024

View reviewed changes

pgrete added 2 commits August 22, 2024 12:27

Merge remote-tracking branch 'origin/develop' into pgrete/update-pool…

365fbf2

…-alloc

Cleanup PR and include changelog

ae73a4a

Yurlungur approved these changes Aug 22, 2024

View reviewed changes

Yurlungur enabled auto-merge August 22, 2024 16:45

Yurlungur merged commit 43380cd into develop Aug 22, 2024
53 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce memory consumption for buffer pool #1150

Reduce memory consumption for buffer pool #1150

pgrete commented Aug 8, 2024 •

edited

Loading

pgrete commented Aug 8, 2024

lroberts36 commented Aug 8, 2024

pgrete commented Aug 9, 2024 •

edited

Loading

pgrete commented Aug 9, 2024

pgrete commented Aug 12, 2024

lroberts36 left a comment

lroberts36 Aug 12, 2024

lroberts36 commented Aug 12, 2024

pgrete commented Aug 22, 2024

Yurlungur left a comment

Reduce memory consumption for buffer pool #1150

Reduce memory consumption for buffer pool #1150

Conversation

pgrete commented Aug 8, 2024 • edited Loading

PR Summary

PR Checklist

pgrete commented Aug 8, 2024

lroberts36 commented Aug 8, 2024

pgrete commented Aug 9, 2024 • edited Loading

pgrete commented Aug 9, 2024

pgrete commented Aug 12, 2024

lroberts36 left a comment

Choose a reason for hiding this comment

lroberts36 Aug 12, 2024

Choose a reason for hiding this comment

lroberts36 commented Aug 12, 2024

pgrete commented Aug 22, 2024

Yurlungur left a comment

Choose a reason for hiding this comment

pgrete commented Aug 8, 2024 •

edited

Loading

pgrete commented Aug 9, 2024 •

edited

Loading