-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Runc init process step on DISK Sleep Status When freezen cgroup #2753
Comments
runc init parent process maybe runc exec xxx yyy 15758 is in sleep status ..... there another process about runc update resource 15801 si in sleep status ...... we strace 15801 .... it is wait cgroup freezer change "FREEZING" to last status ....
|
exec init + set cgroup freeze will make cgroup step on FREEZING status as code note said ,,, the final status really never come,, loop run infinitely ....
|
Linux kernel 4.19.160-0419160-generic |
If you can come up with a repro that does not involve docker, kubernetes etc. that'd be great. |
as code note said ,,, the final status really never come,, loop run infinitely .... we also should reset the status ,,, not block getState forever.....
|
we also found an description about this on "Red Hat bugzilla", https://bugzilla.redhat.com/show_bug.cgi?id=1903228 I think there also meet this issues ..... |
we used runc rc-10 with docker 19 ,,, and never hit "DISK Sleep" until we upgrade to docker20 with runc rc92 , we try to replace runc rc92 to rc-10 , and nothing happen ,,,, I think rc92 update command add freezen cgroup step ,, increase the probability |
i write freezer.state file and use runc exec to reproduce the failure package main
import (
"flag"
"fmt"
"github.com/cyphar/filepath-securejoin"
"github.com/pkg/errors"
"golang.org/x/sys/unix"
"io/ioutil"
"log"
"math/rand"
"os"
"os/exec"
"strings"
"time"
)
func main() {
var path string
var cid string
flag.StringVar(&path, "p", "", "")
flag.StringVar(&cid, "c", "", "")
flag.Parse()
if path == "" || cid == "" {
panic("path is empty")
}
for i := 0; i < 15; i++ {
go func() {
for {
start := time.Now()
cmd := exec.Command("bash", "exec.sh", cid)
cmd.Stdout = os.Stdout
err := cmd.Run()
if err != nil {
log.Print(err)
}
fmt.Println(time.Now().Sub(start))
num := rand.Intn(100)
time.Sleep(time.Duration(num) * time.Millisecond)
}
}()
}
for i := 0; i < 4; i++ {
for {
err := freezeAndThawed(path)
if err != nil {
log.Printf("freeze and thawed %v", err)
} else {
log.Printf("freeze and thawed success")
}
time.Sleep(time.Duration(rand.Intn(400)+200) * time.Millisecond)
}
}
}
func freezeAndThawed(path string) (err error) {
err = freezeWithTimeout(path, string(Frozen))
if err != nil {
return err
}
err = freezeWithTimeout(path, string(Thawed))
if err != nil {
return err
}
err = freezeWithTimeout(path, string(Thawed))
if err != nil {
return err
}
return err
}
func freezeWithTimeout(path, freezer string) (err error) {
ch := make(chan error)
go func() {
ch <- freeze(path, freezer)
}()
for {
select {
case <-time.After(10 * time.Second):
log.Printf("time out to set freeze %v %v", path, freezer)
case err := <-ch:
log.Printf("set path %v,%v end %v", path, freezer, err)
return err
}
}
}
type FreezerState string
const (
Undefined FreezerState = ""
Frozen FreezerState = "FROZEN"
Thawed FreezerState = "THAWED"
)
func getFreezeWithTimeout(path string) (state FreezerState, err error) {
type r struct {
state FreezerState
err error
}
ch := make(chan r)
go func() {
state, err := getFreeze(path)
ch <- r{state, err}
}()
for {
select {
case <-time.After(10 * time.Second):
log.Printf("time out to get freeze %v", path)
case r := <-ch:
log.Printf("get from path %v,%v %v", path, r.state, r.err)
return r.state, r.err
}
}
}
func getFreeze(path string) (state FreezerState, err error) {
for {
state, err := ReadFile(path, "freezer.state")
if err != nil {
// If the kernel is too old, then we just treat the freezer as
// being in an "undefined" state.
if os.IsNotExist(err) || errors.Is(err, unix.ENODEV) {
err = nil
}
return Undefined, err
}
switch strings.TrimSpace(state) {
case "THAWED":
return Thawed, nil
case "FROZEN":
return Frozen, nil
case "FREEZING":
// Make sure we get a stable freezer state, so retry if the cgroup
// is still undergoing freezing. This should be a temporary delay.
time.Sleep(1 * time.Millisecond)
continue
default:
return Undefined, fmt.Errorf("unknown freezer.state %q", state)
}
}
}
func freeze(path string, freezer string) (err error) {
for {
// In case this loop does not exit because it doesn't get the expected
// state, let's write again this state, hoping it's going to be properly
// set this time. Otherwise, this loop could run infinitely, waiting for
// a state change that would never happen.
if err := WriteFile(path, "freezer.state", freezer); err != nil {
return err
}
state, err := getFreezeWithTimeout(path)
if err != nil {
return err
}
if strings.TrimSpace(string(state)) == freezer {
break
}
time.Sleep(1 * time.Millisecond)
}
return nil
}
func WriteFile(dir, file, data string) error {
if dir == "" {
return errors.Errorf("no directory specified for %s", file)
}
path, err := securejoin.SecureJoin(dir, file)
if err != nil {
return err
}
if err := ioutil.WriteFile(path, []byte(data), 0700); err != nil {
return errors.Wrapf(err, "failed to write %q to %q", data, path)
}
return nil
}
func ReadFile(dir, file string) (string, error) {
if dir == "" {
return "", errors.Errorf("no directory specified for %s", file)
}
path, err := securejoin.SecureJoin(dir, file)
if err != nil {
return "", err
}
data, err := ioutil.ReadFile(path)
return string(data), err
} exec.sh
|
when exec execute concurrently with freezen will reproduce the failure |
@niuyueyang1996 @gaopeiliang thanks, reproduced locally (on cgroup v1 -- apparently v2 is not vulnerable), Ubuntu 20.04, kernel 5.4.0-42-generic. |
Simplified repro in bash # Run exec and pause/resume in parallel
(
while :; do
$RUNC exec $CID printf .
done
) &
PID_EXEC=$!
(
while :; do
$RUNC pause $CID && $RUNC resume $CID && date
done
) &
PID_FUF=$!
sleep 1m
kill -9 $PID_FUF
kill -9 $PID_EXEC
runc exec $CID true && echo "success" Usually 10s or so is enough to repro. When runc init is stuck in D state, here's how it looks like:
|
Since the reporter saw it during the concurrent use of |
The trace is just showing that it's frozen via the freezer cgroup (hence the creatively-named Unfortunately if you have concurrent So we have to either mutex
We have to do this because systemd will modify the cgroup configuration even if you don't touch the device policy, meaning you get spurious errors in containers when writing to |
Ah, you're also hitting it with stock |
Nah, I think what happens is a new process (from runc exec) is added to the cgroup while we're trying to freeze it (for whatever reason -- pause or update), and in this case we're required to repeat the freeze attempt as per the last paragraph of freezer-subsystem.txt: It's important to note that freezing can be incomplete. In that case we return EBUSY. This means that some tasks in the cgroup are busy doing something that prevents us from completely freezing the cgroup at this time. After EBUSY, the cgroup will remain partially frozen -- reflected by freezer.state reporting "FREEZING" when read. The state will remain "FREEZING" until one of these things happens: |
Ah okay. But we do retry it -- that's what the |
So, my current plan is:
|
Ah I just realised the mistake I made -- yeah, If you can find a way to eliminate (2) that sounds good but based on my memory from working on this last year, the only way to eliminate (2) is to fix systemd (which I did take a look at but their cgroup management code is quite complicated to say the least). There's a reason we only do this for systemd and not for cgroupfs -- our cgroupfs manager handles devices cgroup updates properly because of #2391. |
No, this is against EINTR bug caused by go 1.14+ runtime (which sends SIGURG to improve preemptiveness, and for some reason writing to cgroupv1 is not SA_RESTARTable, see discussion at #2258 (comment)). This was fixed in golang 1.15 (see golang/go#38033). In addition, despite what the kernel doc says, I haven't seen |
That's what the diff --git a/libcontainer/cgroups/fs/freezer.go b/libcontainer/cgroups/fs/freezer.go
index 11cb1646f482..828bae2bead9 100644
--- a/libcontainer/cgroups/fs/freezer.go
+++ b/libcontainer/cgroups/fs/freezer.go
@@ -62,7 +62,7 @@ func (s *FreezerGroup) GetStats(path string, stats *cgroups.Stats) error {
}
func (s *FreezerGroup) GetState(path string) (configs.FreezerState, error) {
- for {
+ for retries := 0; retries < 5; retries++ {
state, err := fscommon.ReadFile(path, "freezer.state")
if err != nil {
// If the kernel is too old, then we just treat the freezer as
@@ -78,12 +78,16 @@ func (s *FreezerGroup) GetState(path string) (configs.FreezerState, error) {
case "FROZEN":
return configs.Frozen, nil
case "FREEZING":
- // Make sure we get a stable freezer state, so retry if the cgroup
- // is still undergoing freezing. This should be a temporary delay.
+ // Try to get a stable freezer state, so retry if the cgroup
+ // is still undergoing freezing. This should be a temporary delay,
+ // but since this function is called in Set() it's possible that
+ // FREEZING is permanent -- so we limit the number of retries.
time.Sleep(1 * time.Millisecond)
continue
default:
return configs.Undefined, fmt.Errorf("unknown freezer.state %q", state)
}
}
+ // We ran out of retries -- return FREEZING.
+ return configs.Freezing, nil
}
diff --git a/libcontainer/configs/cgroup_linux.go b/libcontainer/configs/cgroup_linux.go
index aada5d62f199..01eb1a4b7c92 100644
--- a/libcontainer/configs/cgroup_linux.go
+++ b/libcontainer/configs/cgroup_linux.go
@@ -11,6 +11,7 @@ const (
Undefined FreezerState = ""
Frozen FreezerState = "FROZEN"
Thawed FreezerState = "THAWED"
+ Freezing FreezerState = "FREEZING"
)
type Cgroup struct { |
I think we should not return FREEZING as a state as no one expects it. Here's my attempt at solving this: diff --git a/libcontainer/cgroups/fs/freezer.go b/libcontainer/cgroups/fs/freezer.go
index 11cb1646..dc265d7f 100644
--- a/libcontainer/cgroups/fs/freezer.go
+++ b/libcontainer/cgroups/fs/freezer.go
@@ -28,33 +28,40 @@ func (s *FreezerGroup) Apply(path string, d *cgroupData) error {
func (s *FreezerGroup) Set(path string, cgroup *configs.Cgroup) error {
switch cgroup.Resources.Freezer {
- case configs.Frozen, configs.Thawed:
- for {
- // In case this loop does not exit because it doesn't get the expected
- // state, let's write again this state, hoping it's going to be properly
- // set this time. Otherwise, this loop could run infinitely, waiting for
- // a state change that would never happen.
- if err := fscommon.WriteFile(path, "freezer.state", string(cgroup.Resources.Freezer)); err != nil {
+ case configs.Frozen:
+ // As per kernel doc (freezer-subsystem.txt), if FREEZING
+ // is seen, userspace should either retry or thaw.
+ for i := 0; i < 10; i++ {
+ if err := fscommon.WriteFile(path, "freezer.state", string(configs.Frozen)); err != nil {
return err
}
- state, err := s.GetState(path)
+ state, err := fscommon.ReadFile(path, "freezer.state")
if err != nil {
return err
}
- if state == cgroup.Resources.Freezer {
- break
+ state = strings.TrimSpace(state)
+ switch state {
+ case "FREEZING":
+ time.Sleep(time.Duration(i) * time.Millisecond)
+ continue
+ case string(configs.Frozen):
+ return nil
+ default:
+ return fmt.Errorf("unexpected state %s while freezing", strings.TrimSpace(state))
}
-
- time.Sleep(1 * time.Millisecond)
}
+ // It got stuck in FREEZING. Try to thaw it back
+ // (which will most probably succeed) and error out.
+ _ = fscommon.WriteFile(path, "freezer.state", string(configs.Thawed))
+ return errors.New("unable to freeze")
+ case configs.Thawed:
+ return fscommon.WriteFile(path, "freezer.state", string(configs.Thawed))
case configs.Undefined:
return nil
default:
return fmt.Errorf("Invalid argument '%s' to freezer.state", string(cgroup.Resources.Freezer))
}
-
- return nil
}
func (s *FreezerGroup) GetStats(path string, stats *cgroups.Stats) error { This way, GetState() does not need to be changed as it will eventually succeed. |
Now playing with paralel exec / update as it's more realistic use case. With the variation of the above solution, have the following results running my test for one minute:
I played with the retry parameters (sleep time, number of retries) and some values appears to result in better rates than other, but obviously we can't entirely eliminate the possibility (of being unable to freeze) using this method. Ultimately, it needs to be solved in the kernel (which has all the controls, and I guess it's already done in cgroup v2). In runc, we can avoid updating devices in case no change in device perms are requested. We could also introduce a lock between freeze and exec (so no new execs are performed while freeze is in progress), but that won't help the "internal threat" (something like a limited fork bomb inside a container). |
I'm a bit confused because the latest kernel documentation (at least the one on |
we upgrade docker-ce version to 20:10 from 19.3, And now docker component version:
when Kubernetes terminal POD , we found container process step into DISK Sleep status ,,,, docker inspect container hang forever ,, kubelet pleg not healthy , kubelet not ready ......
when we ps container process .... we found runc init process also DISK Sleep ....
The text was updated successfully, but these errors were encountered: