-
Notifications
You must be signed in to change notification settings - Fork 261
[Pal] Loudly fail if depleted POOL_SIZE in slab memory allocator #1768
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r1.
Reviewable status: all files reviewed, 1 unresolved discussion, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: Intel) (waiting on @dimakuv)
Pal/src/slab.c, line 60 at r1 (raw file):
#if 1 /* FIXME: At this point, we depleted the pre-allocated memory pool of POOL_SIZE. Previously,
Please add this link somewhere in here: https://github.com/oscarlab/graphene/issues/1072
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r1.
Reviewable status: all files reviewed, 2 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: Intel) (waiting on @dimakuv)
Pal/src/slab.c, line 45 at r1 (raw file):
/* This function is protected by g_slab_mgr_lock. */ static inline void* __malloc(int size) { assert(!SYSTEM_LOCKED());
please also assert that g_slab_mgr_lock
is taken (and remove the comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 2 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: Intel) (waiting on @boryspoplawski and @mkow)
Pal/src/slab.c, line 45 at r1 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
please also assert that
g_slab_mgr_lock
is taken (and remove the comment)
Done. However, you understood it wrong: !SYSTEM_LOCKED()
verifies that g_slab_mgr_lock
was released, and I also added lock/unlock below when working on a global g_bump
. For our beautiful macros of SYSTEM_LOCKED
, SYSTEM_LOCK
and SYSTEM_UNLOCK
see the beginning of this file.
The confusion stems from me. In our Slack discussion, I wanted to re-work the caller of this function (slabmgr.h
) to not release the lock, but it is used and assumed in couple other places in Graphene, so I gave up and decided to re-lock here. In the future, slab manager logic should be de-macrofied and re-factored.
Pal/src/slab.c, line 60 at r1 (raw file):
Previously, boryspoplawski (Borys Popławski) wrote…
Please add this link somewhere in here: https://github.com/oscarlab/graphene/issues/1072
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r2.
Reviewable status: all files reviewed, 1 unresolved discussion, not enough approvals from maintainers (1 more required), "fixup! " found in commit messages' one-liners (waiting on @mkow)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r2.
Reviewable status: all files reviewed, all discussions resolved, "fixup! " found in commit messages' one-liners
Pal/src/slab.c, line 45 at r1 (raw file):
However, you understood it wrong: !SYSTEM_LOCKED() verifies that g_slab_mgr_lock was released
I rather thought that these are two distinct locks, because the comment used a different name and that it should be locked, not unlocked.
Anyways, looks good now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r2.
Reviewable status: all files reviewed, all discussions resolved, "fixup! " found in commit messages' one-liners
51b8f73
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 files reviewed, all discussions resolved, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @boryspoplawski, @mkow, and @yamahata)
Pal/src/slab.c, line 45 at r1 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
However, you understood it wrong: !SYSTEM_LOCKED() verifies that g_slab_mgr_lock was released
I rather thought that these are two distinct locks, because the comment used a different name and that it should be locked, not unlocked.
Anyways, looks good now.
Oops, Jenkins caught my stupid mistake. It is wrong to write asserts on "is global lock released?", because some other thread may grab this lock in the meantime, and assert fails. I removed this assert and simply left a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r3.
Reviewable status: all files reviewed, all discussions resolved, not enough approvals from maintainers (1 more required), "fixup! " found in commit messages' one-liners
Pal/src/slab.c, line 45 at r1 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
Oops, Jenkins caught my stupid mistake. It is wrong to write asserts on "is global lock released?", because some other thread may grab this lock in the meantime, and assert fails. I removed this assert and simply left a comment.
Hah, lol, this was quite obvious but I also missed it :)
If PAL memory allocator depleted the pre-allocated memory pool of POOL_SIZE, then previously PAL would allocate more pages (for its internal purposes e.g. for event objects), but LibOS had no idea about it. This led to LibOS allocating pages at same addresses and ultimately to subtle memory corruptions. Fixing it requires complete re-write of PAL memory allocation. This commit simply modifies slab allocator to loudly terminate. It also fixes small bug of lock-unprotected search in memory pool.
51b8f73
to
0e98fc4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r3.
Reviewable status: all files reviewed, all discussions resolved, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! all files reviewed, all discussions resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! all files reviewed, all discussions resolved
Description of the changes
If PAL memory allocator depleted the pre-allocated memory pool of
POOL_SIZE
, then previously PAL would allocate more pages (for its internal purposes e.g. for event objects), but LibOS had no idea about it. This led to LibOS allocating pages at same addresses and ultimately to subtle memory corruptions.Fixing it requires complete re-write of PAL memory allocation. This PR simply modifies slab allocator to loudly terminate.
It also fixes small bug of lock-unprotected search in memory pool.
See gramineproject/gramine#50 for additional context.
How to test this PR?
This was found (again) on a very intensive PyTorch workload with many threads created and destroyed many-many times. This is hard to reproduce.
This change is