Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runc init process step on DISK Sleep Status When freezen cgroup #2753

Closed
gaopeiliang opened this issue Jan 19, 2021 · 24 comments · Fixed by #2774
Closed

Runc init process step on DISK Sleep Status When freezen cgroup #2753

gaopeiliang opened this issue Jan 19, 2021 · 24 comments · Fixed by #2774
Milestone

Comments

@gaopeiliang
Copy link

we upgrade docker-ce version to 20:10 from 19.3, And now docker component version:

Client: Docker Engine - Community
 Version:           20.10.1
 API version:       1.41
 Go version:        go1.13.15
 Git commit:        831ebea
 Built:             Tue Dec 15 04:35:01 2020
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.1
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.13.15
  Git commit:       f001486
  Built:            Tue Dec 15 04:32:57 2020
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.4.3
  GitCommit:        269548fa27e0089a8b8278fc4fc781d7f65a939b
 runc:
  Version:          1.0.0-rc92
  GitCommit:        ff819c7e9184c13b7c2607fe6c30ae19403a7aff
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

when Kubernetes terminal POD , we found container process step into DISK Sleep status ,,,, docker inspect container hang forever ,, kubelet pleg not healthy , kubelet not ready ......

when we ps container process .... we found runc init process also DISK Sleep ....

root     15758 57587  0 16:39 ?        00:00:00 runc --root /var/run/docker/runtime-runc/moby --log /run/containerd/io.containerd.runtime.v2.task/moby/452a153f1775d37f5813c1b71ff46df6e0da76ce5e8450d3df46d1de3cc97b2c/log.json --log-format json --systemd-cgroup exec --process /tmp/runc-process627966115 --detach --pid-file /run/containerd/io.containerd.runtime.v2.task/moby/452a153f1775d37f5813c1b71ff46df6e0da76ce5e8450d3df46d1de3cc97b2c/27e5850550ae06545bde2abb0470316e95938c54ad673bb9b1e0686129d10ae1.pid 452a153f1775d37f5813c1b71ff46df6e0da76ce5e8450d3df46d1de3cc97b2c
root     15795 15758  0 16:39 ?        00:00:00 runc init
root     15801 57587  4 16:39 ?        00:12:50 runc --root /var/run/docker/runtime-runc/moby --log /run/containerd/io.containerd.runtime.v2.task/moby/452a153f1775d37f5813c1b71ff46df6e0da76ce5e8450d3df46d1de3cc97b2c/log.json --log-format json --systemd-cgroup update --resources - 452a153f1775d37f5813c1b71ff46df6e0da76ce5e8450d3df46d1de3cc97b2c

# runc init process step in "D" status...
 
root@bigo:/home/xyz# cat /proc/15795/status
Name:   runc:[2:INIT]
Umask:  0022
State:  D (disk sleep)
Tgid:   15795
Ngid:   0
Pid:    15795
PPid:   15758
TracerPid:      0
Uid:    0       0       0       0
Gid:    0       0       0       0
FDSize: 64
Groups:  
NStgid: 15795   230
NSpid:  15795   230
NSpgid: 15795   230
NSsid:  15795   230
VmPeak:   279548 kB
VmSize:   230660 kB
VmLck:         0 kB
VmPin:         0 kB
VmHWM:     14876 kB
VmRSS:     14876 kB
RssAnon:            9064 kB
RssFile:            5812 kB
RssShmem:              0 kB

@gaopeiliang
Copy link
Author

runc init parent process maybe runc exec xxx yyy 15758 is in sleep status .....

there another process about runc update resource 15801 si in sleep status ......

we strace 15801 .... it is wait cgroup freezer change "FREEZING" to last status ....


newfstatat(AT_FDCWD, "/sys/fs/cgroup/freezer/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod6ee54f7c_8b7b_49a6_ad29_5fa0e1d953b3.slice/docker-452a153f1775d37f5813c1b71ff46df6e0da76ce5e8450d3df46d1de3cc97b2c.scope/freezer.state", {st_mode=S_IFREG|0644, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
openat(AT_FDCWD, "/sys/fs/cgroup/freezer/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod6ee54f7c_8b7b_49a6_ad29_5fa0e1d953b3.slice/docker-452a153f1775d37f5813c1b71ff46df6e0da76ce5e8450d3df46d1de3cc97b2c.scope/freezer.state", O_RDONLY|O_CLOEXEC) = 7
epoll_ctl(4, EPOLL_CTL_ADD, 7, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=3228278120, u64=139778643959144}}) = 0
fcntl(7, F_GETFL)                       = 0x8000 (flags O_RDONLY|O_LARGEFILE)
fcntl(7, F_SETFL, O_RDONLY|O_NONBLOCK|O_LARGEFILE) = 0
fstat(7, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
read(7, "FREEZING\n", 512)              = 9
read(7, "", 1527)                       = 0
epoll_ctl(4, EPOLL_CTL_DEL, 7, 0xc0005647dc) = 0
close(7)                                = 0
futex(0xc0000432c8, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x5616715d98e0, FUTEX_WAIT_PRIVATE, 0, {0, 975302}) = -1 ETIMEDOUT (Connection timed out)
futex(0x5616715d5b10, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x5616715d5a10, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc0000432c8, FUTEX_WAKE_PRIVATE, 1) = 1
newfstatat(AT_FDCWD, "/sys/fs/cgroup/freezer/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod6ee54f7c_8b7b_49a6_ad29_5fa0e1d953b3.slice/docker-452a153f1775d37f5813c1b71ff46df6e0da76ce5e8450d3df46d1de3cc97b2c.scope/freezer.state", {st_mode=S_IFREG|0644, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
openat(AT_FDCWD, "/sys/fs/cgroup/freezer/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod6ee54f7c_8b7b_49a6_ad29_5fa0e1d953b3.slice/docker-452a153f1775d37f5813c1b71ff46df6e0da76ce5e8450d3df46d1de3cc97b2c.scope/freezer.state", O_RDONLY|O_CLOEXEC) = 7
epoll_ctl(4, EPOLL_CTL_ADD, 7, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=3228278120, u64=139778643959144}}) = 0
fcntl(7, F_GETFL)                       = 0x8000 (flags O_RDONLY|O_LARGEFILE)
fcntl(7, F_SETFL, O_RDONLY|O_NONBLOCK|O_LARGEFILE) = 0
fstat(7, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
read(7, "FREEZING\n", 512)              = 9
read(7, "", 1527)                       = 0
epoll_ctl(4, EPOLL_CTL_DEL, 7, 0xc0005647dc) = 0
close(7)                                = 0
futex(0xc0000432c8, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x5616715d98e0, FUTEX_WAIT_PRIVATE, 0, {0, 975585}) = -1 ETIMEDOUT (Connection timed out)
futex(0xc0000432c8, FUTEX_WAKE_PRIVATE, 1) = 1
newfstatat(AT_FDCWD, "/sys/fs/cgroup/freezer/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod6ee54f7c_8b7b_49a6_ad29_5fa0e1d953b3.slice/docker-452a153f1775d37f5813c1b71ff46df6e0da76ce5e8450d3df46d1de3cc97b2c.scope/freezer.state", {st_mode=S_IFREG|0644, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
openat(AT_FDCWD, "/sys/fs/cgroup/freezer/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod6ee54f7c_8b7b_49a6_ad29_5fa0e1d953b3.slice/docker-452a153f1775d37f5813c1b71ff46df6e0da76ce5e8450d3df46d1de3cc97b2c.scope/freezer.state", O_RDONLY|O_CLOEXEC) = 7
epoll_ctl(4, EPOLL_CTL_ADD, 7, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=3228278120, u64=139778643959144}}) = 0
fcntl(7, F_GETFL)                       = 0x8000 (flags O_RDONLY|O_LARGEFILE)
fcntl(7, F_SETFL, O_RDONLY|O_NONBLOCK|O_LARGEFILE) = 0
fstat(7, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
read(7, "FREEZING\n", 512)              = 9
read(7, "", 1527)                       = 0
epoll_ctl(4, EPOLL_CTL_DEL, 7, 0xc0005647dc) = 0
close(7)       

@gaopeiliang
Copy link
Author

gaopeiliang commented Jan 26, 2021

exec init + set cgroup freeze will make cgroup step on FREEZING status

as code note said ,,, the final status really never come,, loop run infinitely ....

func (s *FreezerGroup) Set(path string, cgroup *configs.Cgroup) error {
	switch cgroup.Resources.Freezer {
	case configs.Frozen, configs.Thawed:
		for {
			// In case this loop does not exit because it doesn't get the expected
			// state, let's write again this state, hoping it's going to be properly
			// set this time. Otherwise, this loop could run infinitely, waiting for
			// a state change that would never happen.
			if err := fscommon.WriteFile(path, "freezer.state", string(cgroup.Resources.Freezer)); err != nil {
				return err
			}

			state, err := s.GetState(path)
			if err != nil {
				return err
			}
			if state == cgroup.Resources.Freezer {
				break
			}

			time.Sleep(1 * time.Millisecond)
		}
	case configs.Undefined:
		return nil
	default:
		return fmt.Errorf("Invalid argument '%s' to freezer.state", string(cgroup.Resources.Freezer))
	}

	return nil
}



@gaopeiliang
Copy link
Author

#2767

@gaopeiliang
Copy link
Author

Linux kernel 4.19.160-0419160-generic
Linux kernel 4.9.70-040970-generic

@kolyshkin
Copy link
Contributor

If you can come up with a repro that does not involve docker, kubernetes etc. that'd be great.

@gaopeiliang gaopeiliang changed the title Runc init process step on DISK Sleep Status When Kill Container Runc init process step on DISK Sleep Status When freezen cgroup Jan 27, 2021
@gaopeiliang
Copy link
Author

as code note said ,,, the final status really never come,, loop run infinitely ....

we also should reset the status ,,, not block getState forever.....

func (s *FreezerGroup) Set(path string, cgroup *configs.Cgroup) error {
	switch cgroup.Resources.Freezer {
	case configs.Frozen, configs.Thawed:
		for {
			// In case this loop does not exit because it doesn't get the expected
			// state, let's write again this state, hoping it's going to be properly
			// set this time. Otherwise, this loop could run infinitely, waiting for
			// a state change that would never happen.
			if err := fscommon.WriteFile(path, "freezer.state", string(cgroup.Resources.Freezer)); err != nil {
				return err
			}

			state, err := s.GetState(path)
			if err != nil {
				return err
			}
			if state == cgroup.Resources.Freezer {
				break
			}

			time.Sleep(1 * time.Millisecond)
		}
	case configs.Undefined:
		return nil
	default:
		return fmt.Errorf("Invalid argument '%s' to freezer.state", string(cgroup.Resources.Freezer))
	}

	return nil
}


@gaopeiliang
Copy link
Author

we also found an description about this on "Red Hat bugzilla",

https://bugzilla.redhat.com/show_bug.cgi?id=1903228

I think there also meet this issues .....

@gaopeiliang
Copy link
Author

we used runc rc-10 with docker 19 ,,, and never hit "DISK Sleep" until we upgrade to docker20 with runc rc92 , we try to replace runc rc92 to rc-10 , and nothing happen ,,,,

I think rc92 update command add freezen cgroup step ,, increase the probability

@niuyueyang1996
Copy link

niuyueyang1996 commented Jan 27, 2021

i write freezer.state file and use runc exec to reproduce the failure

package main

import (
	"flag"
	"fmt"
	"github.com/cyphar/filepath-securejoin"
	"github.com/pkg/errors"
	"golang.org/x/sys/unix"
	"io/ioutil"
	"log"
	"math/rand"
	"os"
	"os/exec"
	"strings"
	"time"
)

func main() {
	var path string
	var cid string
	flag.StringVar(&path, "p", "", "")
	flag.StringVar(&cid, "c", "", "")
	flag.Parse()
	if path == "" || cid == "" {
		panic("path is empty")
	}
	for i := 0; i < 15; i++ {
		go func() {
			for {
				start := time.Now()
				cmd := exec.Command("bash", "exec.sh", cid)
				cmd.Stdout = os.Stdout
				err := cmd.Run()
				if err != nil {
					log.Print(err)
				}
				fmt.Println(time.Now().Sub(start))
				num := rand.Intn(100)
				time.Sleep(time.Duration(num) * time.Millisecond)
			}
		}()
	}
	for i := 0; i < 4; i++ {
		for {
			err := freezeAndThawed(path)
			if err != nil {
				log.Printf("freeze and thawed %v", err)
			} else {
				log.Printf("freeze and thawed success")
			}
			time.Sleep(time.Duration(rand.Intn(400)+200) * time.Millisecond)
		}
	}

}

func freezeAndThawed(path string) (err error) {
	err = freezeWithTimeout(path, string(Frozen))
	if err != nil {
		return err
	}
	err = freezeWithTimeout(path, string(Thawed))
	if err != nil {
		return err
	}
	err = freezeWithTimeout(path, string(Thawed))
	if err != nil {
		return err
	}
	return err
}

func freezeWithTimeout(path, freezer string) (err error) {
	ch := make(chan error)
	go func() {
		ch <- freeze(path, freezer)
	}()
	for {
		select {
		case <-time.After(10 * time.Second):
			log.Printf("time out to set freeze %v %v", path, freezer)
		case err := <-ch:
			log.Printf("set path %v,%v end %v", path, freezer, err)
			return err
		}
	}
}

type FreezerState string

const (
	Undefined FreezerState = ""
	Frozen    FreezerState = "FROZEN"
	Thawed    FreezerState = "THAWED"
)

func getFreezeWithTimeout(path string) (state FreezerState, err error) {
	type r struct {
		state FreezerState
		err   error
	}
	ch := make(chan r)
	go func() {
		state, err := getFreeze(path)
		ch <- r{state, err}
	}()
	for {
		select {
		case <-time.After(10 * time.Second):
			log.Printf("time out to get freeze %v", path)
		case r := <-ch:
			log.Printf("get from path %v,%v %v", path, r.state, r.err)
			return r.state, r.err
		}
	}
}
func getFreeze(path string) (state FreezerState, err error) {
	for {
		state, err := ReadFile(path, "freezer.state")
		if err != nil {
			// If the kernel is too old, then we just treat the freezer as
			// being in an "undefined" state.
			if os.IsNotExist(err) || errors.Is(err, unix.ENODEV) {
				err = nil
			}
			return Undefined, err
		}
		switch strings.TrimSpace(state) {
		case "THAWED":
			return Thawed, nil
		case "FROZEN":
			return Frozen, nil
		case "FREEZING":
			// Make sure we get a stable freezer state, so retry if the cgroup
			// is still undergoing freezing. This should be a temporary delay.
			time.Sleep(1 * time.Millisecond)
			continue
		default:
			return Undefined, fmt.Errorf("unknown freezer.state %q", state)
		}
	}
}
func freeze(path string, freezer string) (err error) {
	for {
		// In case this loop does not exit because it doesn't get the expected
		// state, let's write again this state, hoping it's going to be properly
		// set this time. Otherwise, this loop could run infinitely, waiting for
		// a state change that would never happen.
		if err := WriteFile(path, "freezer.state", freezer); err != nil {
			return err
		}
		state, err := getFreezeWithTimeout(path)
		if err != nil {
			return err
		}
		if strings.TrimSpace(string(state)) == freezer {
			break
		}

		time.Sleep(1 * time.Millisecond)
	}
	return nil
}
func WriteFile(dir, file, data string) error {
	if dir == "" {
		return errors.Errorf("no directory specified for %s", file)
	}
	path, err := securejoin.SecureJoin(dir, file)
	if err != nil {
		return err
	}
	if err := ioutil.WriteFile(path, []byte(data), 0700); err != nil {
		return errors.Wrapf(err, "failed to write %q to %q", data, path)
	}
	return nil
}
func ReadFile(dir, file string) (string, error) {
	if dir == "" {
		return "", errors.Errorf("no directory specified for %s", file)
	}
	path, err := securejoin.SecureJoin(dir, file)
	if err != nil {
		return "", err
	}
	data, err := ioutil.ReadFile(path)
	return string(data), err
}

exec.sh

runc --root=/var/run/docker/runtime-runc/moby/ exec $1 stat /data

@niuyueyang1996
Copy link

when exec execute concurrently with freezen will reproduce the failure
we hope getState and freeze can exit with timeout

@kolyshkin
Copy link
Contributor

@niuyueyang1996 @gaopeiliang thanks, reproduced locally (on cgroup v1 -- apparently v2 is not vulnerable), Ubuntu 20.04, kernel 5.4.0-42-generic.

@kolyshkin
Copy link
Contributor

Simplified repro in bash

# Run exec and pause/resume in parallel
(
	while :; do
		$RUNC exec $CID printf .
	done
) &
PID_EXEC=$!

(
	while :; do
		$RUNC pause $CID && $RUNC resume $CID && date
	done
) &
PID_FUF=$!

sleep 1m
kill -9 $PID_FUF
kill -9 $PID_EXEC

runc exec $CID true && echo "success"

Usually 10s or so is enough to repro.

When runc init is stuck in D state, here's how it looks like:

root@ubu2004:/home/kir/git/runc# cat /proc/337830/stack 
[<0>] __refrigerator+0x50/0x150
[<0>] get_signal+0x86f/0x890
[<0>] do_signal+0x34/0x6c0
[<0>] exit_to_usermode_loop+0xbf/0x160
[<0>] do_syscall_64+0x163/0x190
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

@kolyshkin
Copy link
Contributor

Since the reporter saw it during the concurrent use of runc update and runc exec, it looks like a regression caused by commit b810da1 (PR #2391). In short, when system cgroup driver is used, it freezes the container on every runc update for the sake of setting systemd unit device properties (even if they are not set/changed) // cc @cyphar

@cyphar
Copy link
Member

cyphar commented Jan 28, 2021

The trace is just showing that it's frozen via the freezer cgroup (hence the creatively-named __refrigerator). I think the race probably just boils down to how we handle "undoing" the freeze while we apply cgroup rules under systemd -- we fetch the current frozen state then reset it after we're done applying cgroup rules.

Unfortunately if you have concurrent runc updates running, you'll end up in a situation where one runc update freezes the cgroup, another reads the state, then the first unfreezes it and finally the second re-freezes it (not knowing that it had observed an interstitial state).

So we have to either mutex runc update or do some other kind of check for what we should do. I think if we check the container state directly (rather than the current cgroup state) that might be less racy since we don't touch the container state during the frozen section (so it should still be "running" even though it's temporarily frozen).

In short, when system cgroup driver is used, it freezes the container on every runc update for the sake of setting systemd unit device properties (even if they are not set/changed)

We have to do this because systemd will modify the cgroup configuration even if you don't touch the device policy, meaning you get spurious errors in containers when writing to /dev/null during runc update. All of this code is working around a systemd bug -- if systemd had code like we have for cgroupfs none of this would be necessary.

@cyphar
Copy link
Member

cyphar commented Jan 28, 2021

Ah, you're also hitting it with stock runc pause ; runc resume. So the above explanation doesn't explain that behaviour as well...

@kolyshkin
Copy link
Contributor

I think the race probably just boils down to how we handle "undoing" the freeze while we apply cgroup rules under systemd -- we fetch the current frozen state then reset it after we're done applying cgroup rules.

Nah, I think what happens is a new process (from runc exec) is added to the cgroup while we're trying to freeze it (for whatever reason -- pause or update), and in this case we're required to repeat the freeze attempt as per the last paragraph of freezer-subsystem.txt:

It's important to note that freezing can be incomplete. In that case we return EBUSY. This means that some tasks in the cgroup are busy doing something that prevents us from completely freezing the cgroup at this time. After EBUSY, the cgroup will remain partially frozen -- reflected by freezer.state reporting "FREEZING" when read. The state will remain "FREEZING" until one of these things happens:
1) Userspace cancels the freezing operation by writing "THAWED" to
	the freezer.state file
2) Userspace retries the freezing operation by writing "FROZEN" to
	the freezer.state file (writing "FREEZING" is not legal
	and returns EINVAL)
3) The tasks that blocked the cgroup from entering the "FROZEN"
	state disappear from the cgroup's set of tasks.

@cyphar
Copy link
Member

cyphar commented Jan 28, 2021

Ah okay. But we do retry it -- that's what the fscommon.WriteFile loop is doing? (Also we're ignoring the possible -EBUSY error here as well?)

@kolyshkin
Copy link
Contributor

So, my current plan is:

  1. Fix the (*FreezerGroup).Set(configs.Frozen) to retry if it sees "FREEZING".

  2. Try to eliminate the unnecessary freeze/thaw on start and update (currently we do 1 freeze and 2 thaws on every runc --systemd-cgroup update).

  3. Fix (*FreezerGroup).GetState() to error out, not being stuck forever.

@cyphar
Copy link
Member

cyphar commented Jan 28, 2021

Ah I just realised the mistake I made -- yeah, GetState shouldn't loop the way it does (or we need to add a new configs.Freezing state). (1) is already the case, the problem is that (3) needs to be fixed because GetState is stuck in a loop. I'm pretty sure that when I transplanted the logic into GetState it was already a loop and I just moved it, but whatever -- that's where the bug is now.

If you can find a way to eliminate (2) that sounds good but based on my memory from working on this last year, the only way to eliminate (2) is to fix systemd (which I did take a look at but their cgroup management code is quite complicated to say the least). There's a reason we only do this for systemd and not for cgroupfs -- our cgroupfs manager handles devices cgroup updates properly because of #2391.

@kolyshkin
Copy link
Contributor

But we do retry it -- that's what the fscommon.WriteFile loop is doing? (Also we're ignoring the possible -EBUSY error here as well?)

No, this is against EINTR bug caused by go 1.14+ runtime (which sends SIGURG to improve preemptiveness, and for some reason writing to cgroupv1 is not SA_RESTARTable, see discussion at #2258 (comment)). This was fixed in golang 1.15 (see golang/go#38033).

In addition, despite what the kernel doc says, I haven't seen EBUSY on write to freezer.state, so I guess we should read it back and retry on getting FREEZING.

@cyphar
Copy link
Member

cyphar commented Jan 28, 2021

I guess we should read it back and retry on getting FREEZING.

That's what the WriteFile+GetState loop does though (we don't check for "freezing" specifically but I don't think it hurts to check for "is this the same state as the one we want"). I was thinking of something simple like:

diff --git a/libcontainer/cgroups/fs/freezer.go b/libcontainer/cgroups/fs/freezer.go
index 11cb1646f482..828bae2bead9 100644
--- a/libcontainer/cgroups/fs/freezer.go
+++ b/libcontainer/cgroups/fs/freezer.go
@@ -62,7 +62,7 @@ func (s *FreezerGroup) GetStats(path string, stats *cgroups.Stats) error {
 }
 
 func (s *FreezerGroup) GetState(path string) (configs.FreezerState, error) {
-	for {
+	for retries := 0; retries < 5; retries++ {
 		state, err := fscommon.ReadFile(path, "freezer.state")
 		if err != nil {
 			// If the kernel is too old, then we just treat the freezer as
@@ -78,12 +78,16 @@ func (s *FreezerGroup) GetState(path string) (configs.FreezerState, error) {
 		case "FROZEN":
 			return configs.Frozen, nil
 		case "FREEZING":
-			// Make sure we get a stable freezer state, so retry if the cgroup
-			// is still undergoing freezing. This should be a temporary delay.
+			// Try to get a stable freezer state, so retry if the cgroup
+			// is still undergoing freezing. This should be a temporary delay,
+			// but since this function is called in Set() it's possible that
+			// FREEZING is permanent -- so we limit the number of retries.
 			time.Sleep(1 * time.Millisecond)
 			continue
 		default:
 			return configs.Undefined, fmt.Errorf("unknown freezer.state %q", state)
 		}
 	}
+	// We ran out of retries -- return FREEZING.
+	return configs.Freezing, nil
 }
diff --git a/libcontainer/configs/cgroup_linux.go b/libcontainer/configs/cgroup_linux.go
index aada5d62f199..01eb1a4b7c92 100644
--- a/libcontainer/configs/cgroup_linux.go
+++ b/libcontainer/configs/cgroup_linux.go
@@ -11,6 +11,7 @@ const (
 	Undefined FreezerState = ""
 	Frozen    FreezerState = "FROZEN"
 	Thawed    FreezerState = "THAWED"
+	Freezing  FreezerState = "FREEZING"
 )
 
 type Cgroup struct {

@kolyshkin
Copy link
Contributor

I think we should not return FREEZING as a state as no one expects it. Here's my attempt at solving this:

diff --git a/libcontainer/cgroups/fs/freezer.go b/libcontainer/cgroups/fs/freezer.go
index 11cb1646..dc265d7f 100644
--- a/libcontainer/cgroups/fs/freezer.go
+++ b/libcontainer/cgroups/fs/freezer.go
@@ -28,33 +28,40 @@ func (s *FreezerGroup) Apply(path string, d *cgroupData) error {
 
 func (s *FreezerGroup) Set(path string, cgroup *configs.Cgroup) error {
        switch cgroup.Resources.Freezer {
-       case configs.Frozen, configs.Thawed:
-               for {
-                       // In case this loop does not exit because it doesn't get the expected
-                       // state, let's write again this state, hoping it's going to be properly
-                       // set this time. Otherwise, this loop could run infinitely, waiting for
-                       // a state change that would never happen.
-                       if err := fscommon.WriteFile(path, "freezer.state", string(cgroup.Resources.Freezer)); err != nil {
+       case configs.Frozen:
+               // As per kernel doc (freezer-subsystem.txt), if FREEZING
+               // is seen, userspace should either retry or thaw.
+               for i := 0; i < 10; i++ {
+                       if err := fscommon.WriteFile(path, "freezer.state", string(configs.Frozen)); err != nil {
                                return err
                        }
 
-                       state, err := s.GetState(path)
+                       state, err := fscommon.ReadFile(path, "freezer.state")
                        if err != nil {
                                return err
                        }
-                       if state == cgroup.Resources.Freezer {
-                               break
+                       state = strings.TrimSpace(state)
+                       switch state {
+                       case "FREEZING":
+                               time.Sleep(time.Duration(i) * time.Millisecond)
+                               continue
+                       case string(configs.Frozen):
+                               return nil
+                       default:
+                               return fmt.Errorf("unexpected state %s while freezing", strings.TrimSpace(state))
                        }
-
-                       time.Sleep(1 * time.Millisecond)
                }
+               // It got stuck in FREEZING. Try to thaw it back
+               // (which will most probably succeed) and error out.
+               _ = fscommon.WriteFile(path, "freezer.state", string(configs.Thawed))
+               return errors.New("unable to freeze")
+       case configs.Thawed:
+               return fscommon.WriteFile(path, "freezer.state", string(configs.Thawed))
        case configs.Undefined:
                return nil
        default:
                return fmt.Errorf("Invalid argument '%s' to freezer.state", string(cgroup.Resources.Freezer))
        }
-
-       return nil
 }
 
 func (s *FreezerGroup) GetStats(path string, stats *cgroups.Stats) error {

This way, GetState() does not need to be changed as it will eventually succeed.

@kolyshkin
Copy link
Contributor

Now playing with paralel exec / update as it's more realistic use case. With the variation of the above solution, have the following results running my test for one minute:

 === execs total:		         1123
 === updates total:		       2254
 === frozen after retries:	  111
 === unable to freeze:    	   18

I played with the retry parameters (sleep time, number of retries) and some values appears to result in better rates than other, but obviously we can't entirely eliminate the possibility (of being unable to freeze) using this method. Ultimately, it needs to be solved in the kernel (which has all the controls, and I guess it's already done in cgroup v2).

In runc, we can avoid updating devices in case no change in device perms are requested. We could also introduce a lock between freeze and exec (so no new execs are performed while freeze is in progress), but that won't help the "internal threat" (something like a limited fork bomb inside a container).

@cyphar
Copy link
Member

cyphar commented Jan 29, 2021

I'm a bit confused because the latest kernel documentation (at least the one on HEAD) implies that the kernel does do internal retries, but yeah this solution seems reasonable. But I think even if we looped forever it would still work for most cases because the issue with the current implementation is we loop in GetState but don't write to the file -- so the state never changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants