Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

base: Refactor base/random to avoid memory leak #1945

Merged
merged 6 commits into from
Mar 5, 2025

Conversation

BobbyRBruce
Copy link
Member

@BobbyRBruce BobbyRBruce commented Jan 23, 2025

There was a memory leak occurring prior to this fix which can be reproduced with:

scons build/VEGA_X86/gem5.opt -j `nproc`

/usr/bin/time -l ./build/VEGA_X86/gem5.opt configs/example/ruby_gpu_random_test.py --WB_L2 --test-length 5000000 --num-dmas 0

The /usr/bin/time -l will output the maximum memory consumption of this process. On my machines It was getting to 27GB.

With this change gem5 now peaks at around 6GB of memory usage.

This incorporates two changes:

  1. Instances of static RandomPtr were not creating singular RNG
    instances like assumed. Ergo in casees where this was depended
    upon to stop the over creation of RNGs, multiple were being
    created. The RandomPtr object was static but the RNG it pointed
    towards was always changing on every use. This has been fixed.
  2. The RNG instance list has been refactored to avoid accidental
    inclusion of the same RNG more than once and better checks have
    been added as to explicitly remove RNGs from this instance list
    when they are deleted.

Tests updated accordingly.

@BobbyRBruce BobbyRBruce added the base Regards gem5's base code. Found in "src/base" label Jan 23, 2025
@powerjg
Copy link
Contributor

powerjg commented Jan 23, 2025

Two things:

  1. This wasn't a "leak" (in that it wasn't detected by valgrind --leak-check) because pointers to the random numbers were never out of scope. However, it was a "leak" in the sense that the heap was growing indefinitely when running the ruby_gpu_random_test. This was found using the massif tool in valgrind.
  2. tagging the people involved in the previous refactor: @aperais @odanrc. Tagging @giactra for visibility.

Copy link
Contributor

@mattsinc mattsinc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell this does not specifically involve a problem with the GPU, but rather VEGA is just being used to demonstrate a specific problem. So I'll let @powerjg lead the review, but let me know if you need something from me.

if (instances == nullptr)
instances = new Instances();

auto randpoint = r ? std::shared_ptr<Random>(r, [](Random*){})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a call to the constructor with a deleter? I think this line deserves some documentation of why we provide a nop deleter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BobbyRBruce make sure you address this concern :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR as it currently stands doesn't provide a nop deleter (I did in a previous iteration but I simplified it). However I have added a comment explaining this if-else setting of randpoint to alleviate any confusion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is purpose of the second argument to the constructor of shared_ptr here? It looks like an empty deleter, but maybe I'm not understanding correctly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a lambda function which servers as the deletion operator pointer. The purpose here is for the case where the deletion of the pointer is managed elsewhere, which in this case it is as this is.

@powerjg
Copy link
Contributor

powerjg commented Feb 13, 2025

@BobbyRBruce can we get this in soon?

BobbyRBruce and others added 3 commits February 20, 2025 06:17

Verified

This commit was signed with the committer’s verified signature.
BobbyRBruce Bobby R. Bruce
There was a memory leak occuring prior to this fix which can be
reproduced with:

```sh
scons build/VEGA_X86/gem5.opt -j12 `nproc`

/usr/bin/time -l ./build/VEGA_X86/gem5.opt configs/example/ruby_gpu_random_test.py --WB_L2 --test-length 5000000 --num-dmas 0
```

The `/usr/bin/time -l` will output the maximum memory consumption of this
process. On my machines It was getting to 27GB.

With this change gem5 now peaks at around 6GB of memory usage.

This incorporates two changes:

1) Instances of `static RandomPtr` were not creating singular RNG
   instances like assumed. Ergo in casees where this was depended
   upin to stop the over creation of RNGs, multiple were being
   creates. The `RandomPtr` object was static but the RNG it pointed
   towards was always changing on every use.
2) The RNG instance list has been refactored to avoid accidental
   inclusion of the same RNG more than once and better checks have
   been added as to explicitly remove RNGs from this instance list
   when they are deleted.

Tests updated accordingly.

Verified

This commit was signed with the committer’s verified signature.
BobbyRBruce Bobby R. Bruce
This commit attempts to fix the compilation failure by restoring
an overloaded function that was previously deleted.

Verified

This commit was signed with the committer’s verified signature.
BobbyRBruce Bobby R. Bruce

Verified

This commit was signed with the committer’s verified signature.
BobbyRBruce Bobby R. Bruce
@BobbyRBruce BobbyRBruce marked this pull request as draft February 20, 2025 14:29
@BobbyRBruce BobbyRBruce marked this pull request as ready for review February 20, 2025 14:41
if (instances == nullptr)
instances = new Instances();

auto randpoint = r ? std::shared_ptr<Random>(r, [](Random*){})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BobbyRBruce make sure you address this concern :)

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
@BobbyRBruce
Copy link
Member Author

I think i addressed all the comments and consider this PR ready for final review and merging. This should fix the Daily Tests.

@erin-le erin-le merged commit 70ea818 into gem5:develop Mar 5, 2025
38 checks passed
erin-le added a commit that referenced this pull request Mar 10, 2025

Unverified

This commit is not signed, but one or more authors requires that any commit attributed to them is signed.
There was a memory leak occurring prior to this fix which can be
reproduced with:

```sh
scons build/VEGA_X86/gem5.opt -j `nproc`

/usr/bin/time -l ./build/VEGA_X86/gem5.opt configs/example/ruby_gpu_random_test.py --WB_L2 --test-length 5000000 --num-dmas 0
```

The `/usr/bin/time -l` will output the maximum memory consumption of
this process. On my machines It was getting to 27GB.

With this change gem5 now peaks at around 6GB of memory usage.

This incorporates two changes:

1) Instances of `static RandomPtr` were not creating singular RNG
   instances like assumed. Ergo in casees where this was depended
   upon to stop the over creation of RNGs, multiple were being
   created. The `RandomPtr` object was static but the RNG it pointed
   towards was always changing on every use. This has been fixed.
2) The RNG instance list has been refactored to avoid accidental
   inclusion of the same RNG more than once and better checks have
   been added as to explicitly remove RNGs from this instance list
   when they are deleted.

Tests updated accordingly.

---------

Co-authored-by: Erin Le <ejle@ucdavis.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
base Regards gem5's base code. Found in "src/base"
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants