Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hotfix(core): don't hold on Data #3926

Merged
merged 3 commits into from
Nov 14, 2024
Merged

hotfix(core): don't hold on Data #3926

merged 3 commits into from
Nov 14, 2024

Conversation

Wondertan
Copy link
Member

@Wondertan Wondertan commented Nov 8, 2024

The Data field of core.ResultSignedBlock was retained even after we constructed the respective header and moved on. This was due to the header consisting of pointers pointing to fields of core.ResultSignedBlock, retaining the whole structure, including the Data field. A BN, by default, has a header store cache of size 4096; thus, holding on Data with 8 MB blocks took around 32GiB of RAM.

Initially, we believed that the issue was with JSON unmarshalling. However, we were wrong about this being the whole story, as while profiles did confirm that the allocations did originate there, the allocated data wasn't cleaned up. Kudos to @rach-id for helping us figuring this out!

This is a hotfix and is meant to be replaced with a better solution described on TODO. Also, the origin of this bug is yet to be confirmed by @rach-id by testing RAM usage with disabled header cache.


🎱mb certified

@Wondertan Wondertan self-assigned this Nov 8, 2024
@Wondertan Wondertan added the kind:fix Attached to bug-fixing PRs label Nov 8, 2024
@Wondertan Wondertan force-pushed the core/dangling-pointer branch from b798cab to 445992a Compare November 8, 2024 02:06
Copy link
Member

@rach-id rach-id left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Wondertan
Copy link
Member Author

Wondertan commented Nov 8, 2024

The idea is to make hotfix for this in a nearest patch, as releasing grpc in core and here is gonna take a bunch of time.

@rach-id
Copy link
Member

rach-id commented Nov 8, 2024

The idea is to make hotfix for this in a nearest patch, as releasing grpc in core and here is gonna take a bunch of time.

Makes sense 👍 Then I guess applying the change in #3926 (comment) is better to avoid messing with memory ourselves. Instead, make the copy and let the GC do its job without us forcing the freeing of memory. Also, GC might collect additional objects that might be linked to SignedBlock instead of collecting just the Data.

renaynay
renaynay previously approved these changes Nov 8, 2024
Copy link
Member

@renaynay renaynay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as per #3926 (comment), I am okay with this hotfix until @rach-id 's solution lands.

Copy link
Member

@renaynay renaynay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also okay with the explicit approach in #3926 (comment)

@Wondertan
Copy link
Member Author

Wondertan commented Nov 12, 2024

@rach-id, I applied your suggestion to get you as a coauthor on the PR. It doesn't matter which solution to go with for the hotfix

@Wondertan Wondertan enabled auto-merge (squash) November 12, 2024 10:50
@Wondertan Wondertan changed the title fix(core): don't hold on Data fix(core): copy ValidatorSet and Commit to not hold on Data Nov 12, 2024
@Wondertan Wondertan disabled auto-merge November 12, 2024 11:29
vgonkivs
vgonkivs previously approved these changes Nov 12, 2024
@Wondertan
Copy link
Member Author

Well, the suggestion has a bug. It does not copy the RawHeader, so the problem preserves.

Copying every field of RawHeader is gonna be bulky, so I will reverse to the old fix

@rach-id
Copy link
Member

rach-id commented Nov 12, 2024

You dont need to copy the raw header. Did you try running it and see the memory usage?

@Wondertan
Copy link
Member Author

You need to copy it as well, because it is also a value that we create pointer to, keeping the ResultSignedBlock around

@Wondertan Wondertan enabled auto-merge (squash) November 12, 2024 11:50
@rach-id
Copy link
Member

rach-id commented Nov 12, 2024

If I remember correctly, I tried that change and ran a bridge node, and it didn't have the issue.

Could you double check please in case I missed something when running?

@Wondertan
Copy link
Member Author

Wondertan commented Nov 12, 2024

Well, this is how memory and pointers work. We don't need to test that.

@Wondertan
Copy link
Member Author

But I also simply don't have time to run the test during the devcon, and we should go with the solution we know works well, which is setting empty data.

@Wondertan Wondertan changed the title fix(core): copy ValidatorSet and Commit to not hold on Data hotfix(core): don't hold on Data Nov 14, 2024
@Wondertan Wondertan disabled auto-merge November 14, 2024 09:53
@Wondertan Wondertan enabled auto-merge (squash) November 14, 2024 09:53
@Wondertan Wondertan merged commit 14cd8d6 into main Nov 14, 2024
24 checks passed
@Wondertan Wondertan deleted the core/dangling-pointer branch November 14, 2024 10:56
Wondertan added a commit that referenced this pull request Nov 15, 2024
The `Data` field of `core.ResultSignedBlock` was retained even after we constructed the respective header and moved on. This was due to the header consisting of pointers pointing to fields of `core.ResultSignedBlock`, retaining the whole structure, including the `Data` field. A BN, by default, has a header store cache of size 4096; thus, holding on `Data` with 8 MB blocks took around 32GiB of RAM.

Initially, we believed that the issue was with JSON unmarshalling. However, we were wrong about this being the whole story, as while profiles did confirm that the allocations did originate there, the allocated data wasn't cleaned up.  Kudos to @rach-id for helping us figuring this out!

This is a hotfix and is meant to be replaced with a better solution described on TODO. Also, the origin of this bug is yet to be confirmed by @rach-id by testing RAM usage with disabled header cache. 

---

🎱mb certified
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:fix Attached to bug-fixing PRs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants