-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
libct/cg/sd/v1: Fix unnecessary freeze/thaw #3143
Conversation
1ab12c9
to
db65ebc
Compare
libcontainer/cgroups/systemd/v1.go
Outdated
return | ||
devAllow, e := getUnitTypeProperty(m.dbus, unitName, getUnitType(unitName), "DeviceAllow") | ||
if e == nil { | ||
rv := reflect.ValueOf(devAllow.Value.Value()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean that devAllow.Value == dbus.MakeVariant([]deviceAllowEntry{})
did not work? Or that your version is more readable? (or both?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean that devAllow.Value == dbus.MakeVariant([]deviceAllowEntry{}) did not work? Or that your version is more readable? (or both?)
Well, it doesn't seen to work for me. Does that for you, in that case there is something wrong with my setup? 🤔
And, I do think my version is less readable. 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I improved the readability a notch in #3148:
devAllow, e := getUnitTypeProperty(m.dbus, unitName, unitType, "DeviceAllow")
if e == nil {
if rv := reflect.ValueOf(devAllow.Value.Value()); rv.Kind() == reflect.Slice && rv.Len() == 0 {
needsFreeze = false
needsThaw = false
return
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guess I can do that instead if the others agree, even tho. the line is a bit too long. I'll update the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that's too long, what about this:
if e == nil {
rv := reflect.ValueOf(devAllow.Value.Value())
if rv.Kind() == reflect.Slice && rv.Len() == 0 {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be it reads slightly better than the switch/case (although I agree that either version we came with so far is not super readable).
I guess what we need first is a unit test that would create a unit (of various types) with a particular property, get the property, and check it's as expected. Last commit of #3082 is an integration test ( |
Yeah, some tests would be ok I guess. It is fairly hard to test tho. But keep reading
Well, it only verifies that there is no errors with the devices. It does however not check if the slice is frozen or not during the update. It should not be frozen at all during update. The test only verifies after the test. To test this, one can simply verify like this - keep the loop running while running tests with
As seen here, only the scope/container is frozen with this branch, while in master both the pod/slice and the container/scope is frozen. Does that make sense to you @kolyshkin ? |
Hmm. Is something wrong with CI, or? |
Yup; see https://githubstatus.com/. Seems over now, do a force push to trigger CI. |
or we can close/open to re-kick CI (usually it works) |
Does this fix kubernetes/kubernetes#104280 ? |
Yes. But as I have said a few times already, for k8s a flag for disabling this would be better and simpler. I guess we (in k8s, not here in runc) have to discuss if we should avoid using |
2644106
to
a45ffdd
Compare
@odinuge I was playing with the test case and ended up doing #3148. The other thing I did is slightly simplified the deviceAllow check (#3143 (comment)) We can either use #3148, or cherry-pick the test case from it to here. |
@@ -518,3 +518,10 @@ func addCpuset(cm *dbusConnManager, props *[]systemdDbus.Property, cpus, mems st | |||
} | |||
return nil | |||
} | |||
|
|||
func getUnitType(unitName string) string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add something like
// This code should be in sync with getUnitName.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack 👍
I can cherry-pick the test commit here if they should be in the same PR. I prefer to keep this PR, otherwise I cannot update and change the PR (you created based of this), making it a bit cumbersome to work on it. Having a separate PR with the test works as well. |
Please do (otherwise the CI is semi-useless).
Something like |
Once ready and merged, we'll have a backport to 1.0 and will try 1.0.2 released ASAP. |
@odinuge your concern about this code path being slow is valid, it is also quite complicated. I was looking into how to make it fast and simple -- PTAL #3151. Surely a simple "SkipFreeze" flag will be faster, but this seems lightweight enough to not cause performance concerns. |
Yeah, #3151 would probably work, but I don't like going around systemd like that approach does. Systemd should be the source of the truth in this case, as it is now. (agian, I am mostly a fan of a flag for k8s, otherwise we have to look for alternatives. k8s do these calls very often, and with an increasing amount of pods, this will increase further. Same for shortlived containers. A flag as
|
a45ffdd
to
9dbf27b
Compare
This fixes the behavior intended to avoid freezing containers/control groups without it being necessary. This is important for end users of libcontainer who rely on the behavior of no freeze. The previous implementation would always get error trying to get DevicePolicy from the Unit via dbus, since the Unit interface doesn't contain DevicePolicy. Signed-off-by: Odin Ugedal <[email protected]>
9dbf27b
to
efbf0ae
Compare
/hold I am running node e2e + systemd + crio tests locally using the runc binary built from this PR. The first run was successful, but I will have more runs to verify that this PR indeed fixes kubernetes/kubernetes#104280 (comment) |
It is the vendoring of runc into k8s that is important for fixing kubernetes/kubernetes#104280, not only the runc binary used by the CRI runtime. |
True, but the failures I was seeing in CI were specifically coming from runc binary indicating that CRIO was able to start the container but it ended up in paused state. /hold cancel |
Yes, but that is due to kubelet (via this code) freezing |
Oh ok, that means the success I am seeing in node e2e (2 out of 2 runs so far) is due to me updating runc from rc95 to your latest PR (which has v1.0.1). |
// a non-existent unit returns default properties, | ||
// and settings in (2) are the defaults. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tests in this PR do not check whether this is true. I have added those checks last night, plus another test case -- @odinuge can you please re-cherry-pick the last commit from https://github.com/kolyshkin/runc/tree/fix-freeze-before-set
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack 😄 Test looks good!
So, we have to admit that this is a kludge already. We need a way to figure out whether systemd is going do a disruptive device rules update. Any way to figure that out is fine as long as it works. Checking unit properties is one way, checking devices.list is another. Sure, #3151 has an assumption that devices.list in the kernel and systemd Device* properties are in sync. I think such assumption is valid (if not, this would mean systemd is not doing its job, ignoring device properties updates or some such).
Yes, this is also a way to figure out whether a freeze is needed or not, and it's fine as long as it works. The only thing I don't like about it is it requires an additional knob to set. Generally speaking, if there is a way to achieve a goal without adding an explicit knob, and it works reliably and fast, I think such way is better. Currently I see #3151 as such a way. |
Yeah, that solution does also work, and yeah, this is all pretty kludgy, I agree about that 😅 k8s and the runc binary also work differently, with different purposes, making it even more kludgy. But yeah, then we instead can just look at the provided resources, and if the devices allow access to all, then drop the freeze. If we don't rely on systemd, and don't want a separate knob, I think checking devices would be better and simpler to understand, together with better performance. Eg something like: oldDevices := []specs.LinuxDeviceCgroup{
{
Allow: true,
Access: "rwm",
},
}
newDevices := []specs.LinuxDeviceCgroup{
{
Allow: true,
Access: "rwm",
},
}
if allowsAll(oldDevices) && allowsAll(newDevices) {
// Skip freezing
} (although we might need to poke systemd or cgropufs in Any thoughts about that instead? Wouldn't that work for
The |
We can also change this to a single dbus call by using |
Add a test for freezeBeforeSet, checking various scenarios including those that were failing before the fix in the previous commit. [v2: add more cases, add a check before creating a unit.] Signed-off-by: Kir Kolyshkin <[email protected]>
efbf0ae
to
987ed97
Compare
It's an interesting idea, which I also considered but gave up on that since this returns all (tens of, I guess) properties in a OTOH if the call itself is relatively heavy, the cost of the saved call may outweigh the GC overhead, and this might be worth doing. I guess the way to check which way is better is to benchmark both implementations, using runtime.GC() inside the benchmarked code to make sure GC overhead is accounted for. IOW something like func BenchmarkXXX(b *testing.B) {
// ... any prep code
runtime.GC()
b.ResetTimer()
for i := 0; i < b.N, i++ {
// ... whatever we test
}
runtime.GC()
} |
Well, this is conceptually the same that we do here and/or in #3151. In terms of #3151, |
Ha! This is easy peasy -- become a runc maintainer, and you'll be able to shove your commits into contributors' PRs 🕺🏻 On a serious note, though, I think I misunderstood you, thought you were against cherry-picking a commit. 😀 |
OK I ended up writing this benchmark for Here are the results (first line is this PR, second is #3151):
To have some reference points, I did a few quick benchmarks
From that we can estimate Set() overhead for kubelet case. With current approach (#3082 (commit f2db879) + #3143): slowdown from 200 to 500 us (150% worse). With devices.list approach (#3151): slowdown from 200 to 210 us (5% worse) |
Ah, here's the benchmarking code: https://gist.github.com/kolyshkin/20e8f43fedde4989033941be8bef9fc6 It needs some tweaks to measure Set(), I was using these placeholders for Set measurements: func (m *legacyManager) freezeBeforeSetNo(unitName string, r *configs.Resources) (needsFreeze, needsThaw bool, err error) {
return false, false, nil
}
func (m *legacyManager) freezeBeforeSetYes(unitName string, r *configs.Resources) (needsFreeze, needsThaw bool, err error) {
return true, true, nil
} |
Nice. I guess that is more or less predicted. Freezing with a lot of procs, eg. 110 pods on 100 logical Linux CPUs, I assume the overhead will be quite noticeable, and probably quite huge (so people don't infer that unconditional freeze works ok).
Well, looking at it that way all the solutions are conceptually the same. But if we only look at the oci-data to see if we can avoid freezing, without any external calls. |
Problem is, runc cgroup manager do not have persistent storage, and kubelet instantiates a new cgroup manager to call Set(): |
Yeah, but we do supply cgroup config when creating the manager. But yeah, if the only way to avoid this "freeze dance" is for k8s to keep the manager around, I guess we can do that; but that will require some more code. (not familiar with how the runc binary works in detail btw.) |
It's (usually) a short-lived binary, so we also instantiate a new manager for e.g. |
You can supply anything, meaning this solution would be a skip-freeze flag in disguise. Oh well, computers are tough. 😩 |
And that is exactly what I want! 😄
Yeee, they are... Ack. +1 for that one |
My testing shows that #3082 (commit f2db879) doesn't work as expected. This causes big issues in k8s.
PTAL @kolyshkin 😄
Finally back again after having some time off, so haven't had time to look at it before now.