-
Notifications
You must be signed in to change notification settings - Fork 41.8k
node: cm: don't share containerMap instances between managers #128657
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
node: cm: don't share containerMap instances between managers #128657
Conversation
|
/test pull-crio-cgroupv1-node-e2e-resource-managers |
|
/cc @SergeyKanzhelev |
|
/lgtm |
|
LGTM label has been added. Git tree hash: 8228c9743572fdfa082c8ca60f8edf45a9f95c48
|
Since the GA graduation of memory manager in kubernetes#128517 we are sharing the initial container map across managers. The intention of this sharing was not to actually share a data structure, but 1. save the relatively expensive relisting from runtime 2. have all the managers share a consistent view - even though the chance for misalignement tend to be tiny. The unwanted side effect though is now all the managers race to modify a data shared, not thread safe data structure. The fix is to clone (deepcopy) the computed map when passing it to each manager. This restores the old semantic of the code. This issue brings the topic of possibly managers go out of sync since each of them maintain a private view of the world. This risk is real, yet this is how the code worked for most of the lifetime, so the plan is to look at this and evaluate possible improvements later on. Signed-off-by: Francesco Romani <[email protected]>
4b6bfae to
2a99bfc
Compare
wanted to add another test to exercise the new workhorse |
|
/lgtm |
|
LGTM label has been added. Git tree hash: 5a434c477fb4d8dac0226d0e5266de301a06845b
|
|
/test pull-crio-cgroupv1-node-e2e-resource-managers |
|
/triage accepted again pretty evident from the context |
|
The addition of the /approve |
|
/test pull-kubernetes-integration failure seems to be unrelated |
|
@ffromani why any of those managers are writing to this map? |
because the reconciliation loops, and to keep their copy up to date when container come and go |
oh, I see. This is just an initial list. I'm surprised this is the first time we see it panics /lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dims, ffromani, SergeyKanzhelev The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
previously, each manager used its instance of the map returned by When we GA'd memory manager we removed the feature gate and the duplicated call to |
As pointed out in kubernetes#128657 (comment) Signed-off-by: Francesco Romani <[email protected]>
As pointed out in kubernetes#128657 (comment) Signed-off-by: Francesco Romani <[email protected]>
As pointed out in kubernetes#128657 (comment) Signed-off-by: Francesco Romani <[email protected]>
As pointed out in kubernetes#128657 (comment) Signed-off-by: Francesco Romani <[email protected]>
As pointed out in kubernetes#128657 (comment) Signed-off-by: Francesco Romani <[email protected]>
What type of PR is this?
/kind bug
/kind failing-test
/kind regression
What this PR does / why we need it:
Since the GA graduation of memory manager in #128517 we are sharing the initial container map across managers.
The intention of this sharing was not to actually share a data structure, but
The unwanted side effect though is now all the managers race to modify a shared, not thread safe data structure.
The fix is to clone (deepcopy) the computed map when passing it to each manager. This restores the old semantic of the code.
This issue brings the topic of possibly managers go out of sync since each of them maintain a private view of the world. This risk is real, yet very much likely tiny, and this is how the code worked for most of the lifetime, so the plan is to look at this and evaluate possible improvements later on.
Which issue(s) this PR fixes:
Fixes #128638
Special notes for your reviewer:
N/A
Does this PR introduce a user-facing change?