Skip to content
This repository has been archived by the owner on Jan 20, 2022. It is now read-only.

[Pal] Loudly fail if depleted POOL_SIZE in slab memory allocator #1768

Merged
merged 1 commit into from
Aug 19, 2020

Conversation

dimakuv
Copy link

@dimakuv dimakuv commented Aug 18, 2020

Description of the changes

If PAL memory allocator depleted the pre-allocated memory pool of POOL_SIZE, then previously PAL would allocate more pages (for its internal purposes e.g. for event objects), but LibOS had no idea about it. This led to LibOS allocating pages at same addresses and ultimately to subtle memory corruptions.

Fixing it requires complete re-write of PAL memory allocation. This PR simply modifies slab allocator to loudly terminate.

It also fixes small bug of lock-unprotected search in memory pool.

See gramineproject/gramine#50 for additional context.

How to test this PR?

This was found (again) on a very intensive PyTorch workload with many threads created and destroyed many-many times. This is hard to reproduce.


This change is Reviewable

dimakuv pushed a commit that referenced this pull request Aug 19, 2020
Copy link
Contributor

@boryspoplawski boryspoplawski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r1.
Reviewable status: all files reviewed, 1 unresolved discussion, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: Intel) (waiting on @dimakuv)


Pal/src/slab.c, line 60 at r1 (raw file):

#if 1
    /* FIXME: At this point, we depleted the pre-allocated memory pool of POOL_SIZE. Previously,

Please add this link somewhere in here: https://github.com/oscarlab/graphene/issues/1072

Copy link
Member

@mkow mkow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r1.
Reviewable status: all files reviewed, 2 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: Intel) (waiting on @dimakuv)


Pal/src/slab.c, line 45 at r1 (raw file):

/* This function is protected by g_slab_mgr_lock. */
static inline void* __malloc(int size) {
    assert(!SYSTEM_LOCKED());

please also assert that g_slab_mgr_lock is taken (and remove the comment)

Copy link
Author

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 2 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: Intel) (waiting on @boryspoplawski and @mkow)


Pal/src/slab.c, line 45 at r1 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

please also assert that g_slab_mgr_lock is taken (and remove the comment)

Done. However, you understood it wrong: !SYSTEM_LOCKED() verifies that g_slab_mgr_lock was released, and I also added lock/unlock below when working on a global g_bump. For our beautiful macros of SYSTEM_LOCKED, SYSTEM_LOCK and SYSTEM_UNLOCK see the beginning of this file.

The confusion stems from me. In our Slack discussion, I wanted to re-work the caller of this function (slabmgr.h) to not release the lock, but it is used and assumed in couple other places in Graphene, so I gave up and decided to re-lock here. In the future, slab manager logic should be de-macrofied and re-factored.


Pal/src/slab.c, line 60 at r1 (raw file):

Previously, boryspoplawski (Borys Popławski) wrote…

Please add this link somewhere in here: https://github.com/oscarlab/graphene/issues/1072

Done.

boryspoplawski
boryspoplawski previously approved these changes Aug 19, 2020
Copy link
Contributor

@boryspoplawski boryspoplawski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r2.
Reviewable status: all files reviewed, 1 unresolved discussion, not enough approvals from maintainers (1 more required), "fixup! " found in commit messages' one-liners (waiting on @mkow)

mkow
mkow previously approved these changes Aug 19, 2020
Copy link
Member

@mkow mkow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r2.
Reviewable status: all files reviewed, all discussions resolved, "fixup! " found in commit messages' one-liners


Pal/src/slab.c, line 45 at r1 (raw file):

However, you understood it wrong: !SYSTEM_LOCKED() verifies that g_slab_mgr_lock was released

I rather thought that these are two distinct locks, because the comment used a different name and that it should be locked, not unlocked.

Anyways, looks good now.

yamahata
yamahata previously approved these changes Aug 19, 2020
Copy link
Contributor

@yamahata yamahata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r2.
Reviewable status: all files reviewed, all discussions resolved, "fixup! " found in commit messages' one-liners

@dimakuv dimakuv dismissed stale reviews from yamahata, mkow, and boryspoplawski via 51b8f73 August 19, 2020 17:12
Copy link
Author

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 files reviewed, all discussions resolved, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @boryspoplawski, @mkow, and @yamahata)


Pal/src/slab.c, line 45 at r1 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

However, you understood it wrong: !SYSTEM_LOCKED() verifies that g_slab_mgr_lock was released

I rather thought that these are two distinct locks, because the comment used a different name and that it should be locked, not unlocked.

Anyways, looks good now.

Oops, Jenkins caught my stupid mistake. It is wrong to write asserts on "is global lock released?", because some other thread may grab this lock in the meantime, and assert fails. I removed this assert and simply left a comment.

mkow
mkow previously approved these changes Aug 19, 2020
Copy link
Member

@mkow mkow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r3.
Reviewable status: all files reviewed, all discussions resolved, not enough approvals from maintainers (1 more required), "fixup! " found in commit messages' one-liners


Pal/src/slab.c, line 45 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Oops, Jenkins caught my stupid mistake. It is wrong to write asserts on "is global lock released?", because some other thread may grab this lock in the meantime, and assert fails. I removed this assert and simply left a comment.

Hah, lol, this was quite obvious but I also missed it :)

If PAL memory allocator depleted the pre-allocated memory pool of
POOL_SIZE, then previously PAL would allocate more pages (for its
internal purposes e.g. for event objects), but LibOS had no idea
about it. This led to LibOS allocating pages at same addresses and
ultimately to subtle memory corruptions.

Fixing it requires complete re-write of PAL memory allocation.
This commit simply modifies slab allocator to loudly terminate.
It also fixes small bug of lock-unprotected search in memory pool.
@dimakuv dimakuv force-pushed the dimakuv/loudly-fail-when-pal-mempool-depleted branch from 51b8f73 to 0e98fc4 Compare August 19, 2020 18:39
Copy link
Contributor

@yamahata yamahata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r3.
Reviewable status: all files reviewed, all discussions resolved, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel)

Copy link
Member

@mkow mkow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! all files reviewed, all discussions resolved

Copy link
Contributor

@boryspoplawski boryspoplawski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! all files reviewed, all discussions resolved

@mkow mkow merged commit 0e98fc4 into master Aug 19, 2020
@mkow mkow deleted the dimakuv/loudly-fail-when-pal-mempool-depleted branch August 19, 2020 21:37
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants