os: StartProcess ETXTBSY race on Unix systems #22315

rsc · 2017-10-18T02:27:59Z

Modern Unix systems appear to have a fundamental design flaw in the interaction between multithreaded programs, fork+exec, and the prohibition on executing a program if that program is open for writing.

Below is a simple multithreaded C program. It creates 20 threads all doing the same thing: write an exit 0 shell script to /var/tmp/fork-exec-N (for different N), and then fork and exec that script. Repeat ad infinitum. Note that the shell script fds are opened O_CLOEXEC, so that an fd being written by one thread does not leak into the fork+exec's shell script of a different thread.

On my Linux workstation, this program produces a never-ending stream of ETXTBSY errors. The problem is that O_CLOEXEC is not enough. The fd being written by one thread can leak into the forked child of a second thread, and it stays there until that child calls exec. If the first thread closes the fd and calls exec before the second thread's child does exec, then the first thread's exec will get ETXTBSY, because somewhere in the system (specifically, in the child of the second thread), there is an fd still open for writing the first thread's shell script, and according to modern Unix rules, one must not exec a program if there exists any fd anywhere open for writing that program.

Five years ago this bit us because cmd/go installed cmd/cgo (that is, copied the binary from a temporary location to somewhere/bin/cgo) and then executed it. To fix this we put a sleep+retry loop around the fork+exec of cgo when it gets ETXTBSY. Now (as of last week or so) we don't ever install cmd/cgo and execute it in the same cmd/go process, so that specific race is gone, although as I write this cmd/go still has the sleep+retry loop, which I intend to remove.

Last week this bit us again because cmd/go updated a build stamp in the binary, closed it, and executed it. The resulting flaky ETXTBSY failures were reported as #22220. A pending CL fixes this by not updating the build stamp in temporary binaries, which are the main ones we execute. There's still one case where we write+execute a program, which is go test -cpuprofile x.prof pkg. The cpuprofile flag (and a few others) cause cmd/go to leave the pkg.test in the current directory for debugging purposes but also run the test. Luckily running the test is currently the final thing cmd/go does, and it waits for any other fork+exec'ed programs to finish before fork+exec'ing the test. So the race cannot happen in this case.

In general this race is going to happen every time anyone writes a program that both writes and executes a program. It's easy to imagine other build systems running into this, but also programs that do things like unzip a zip file and then run a program inside it - think a program supervisor or mini container runtime. As soon as there are multiple threads doing fork+exec at the same time, and one of them is doing fork+exec of a program that was previously open for write in the same process, you have a mysterious flaky problem.

It seems like maybe Go should take care of this, if possible. We've now hit it twice in cmd/go, five years apart, and at least this past time it took the better part of a day to figure out. (I don't remember how long it took five years ago, in part because I don't remember anything about discovering it five years ago. I also don't want to rediscover all this five years from now.)

There are a few hacks we could use:

In os.StartProcess, if we see ETXTBSY, sleep 100ms and try again, maybe a few times, up to say 1 second of sleeping. In general we don't know how long to sleep.
Arrange with a locking mechanism that close must never complete during a fork+exec sequence. The end of the fork+exec sequence needs to be the point where we know the close-on-exec fds have been closed. Unfortunately there is no portable way to identify that point.
- If the exec fails and the child tells us and exits, we can wait for the exit. That's easy.
- If the exec succeeds, we find out because the exec closes the child's end of the status pipe, and we get EOF.
  - If we know that an OS does close-on-exec work in increasing fd order, then we could also track the maximum fd we've opened and move the status pipe above that. Then seeing the status pipe close would mean all other fds are closed too.
  - If the OS had a "close all fds above x", we could use that. (I don't know of any that do, but it sure would help.)
- It may not be OK to block all closes on a wedged fork+exec (in general an exec'ed program may be loaded from some slow network server).
Note that vfork(2) is not a solution. Vfork is defined as the parent does not continue executing until the child is no longer using the parent's memory image. In the case of a successful exec, at least on Linux, vfork releases the memory image before doing any of the close-on-exec work, so the parent continues running before the child has closed the fds we care about.

None of these seem great. The ETXTBSY sleep, up to 1 second, might be the best option. It would certainly reduce the flake rate and in many cases would probably make it undetectable. It would not help exec of very slow-to-load programs, but that's not the common case.

I wondered how Java deals with this, and the answer seems to be that Java doesn't deal with this. https://bugs.openjdk.java.net/browse/JDK-8068370 was filed in 2014 and is still open.

#include <pthread.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>
#include <fcntl.h>
#include <stdlib.h>
#include <sys/wait.h>
#include <errno.h>
#include <stdint.h>

void* runner(void*);

int
main(void)
{
	int i;
	pthread_t pid[20];

	for(i=1; i<20; i++)
		pthread_create(&pid[i], 0, runner, (void*)(uintptr_t)i);
	runner(0);
	return 0;
}

char script[] = "#!/bin/sh\nexit 0\n";

void*
runner(void *v)
{
	int i, fd, pid, status;
	char buf[100], *argv[2];
	
	i = (int)(uintptr_t)v;
	snprintf(buf, sizeof buf, "/var/tmp/fork-exec-%d", i);
	argv[0] = buf;
	argv[1] = 0;
	for(;;) {
		fd = open(buf, O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC, 0777);
		if(fd < 0) {
			perror("open");
			exit(2);
		}
		write(fd, script, strlen(script));
		close(fd);
		pid = fork();
		if(pid < 0) {
			perror("fork");
			exit(2);
		}
		if(pid == 0) {
			execve(buf, argv, 0);
			exit(errno);
		}
		if(waitpid(pid, &status, 0) < 0) {
			perror("waitpid");
			exit(2);
		}
		if(!WIFEXITED(status)) {
			perror("waitpid not exited");
			exit(2);
		}
		status = WEXITSTATUS(status);
		if(status != 0)
			fprintf(stderr, "exec: %d %s\n", status, strerror(status));
	}
	return 0;
}

The text was updated successfully, but these errors were encountered:

rsc · 2017-10-18T02:29:20Z

/cc @aclements @ianlancetaylor @crawshaw

gopherbot · 2017-10-18T02:30:25Z

Change https://golang.org/cl/71570 mentions this issue: cmd/go: skip updateBuildID on binaries we will run

gopherbot · 2017-10-18T02:36:49Z

Change https://golang.org/cl/71571 mentions this issue: cmd/go: delete ETXTBSY hack that is no longer needed

RalphCorderoy · 2017-10-18T11:23:42Z

Userspace workarounds seem flawed or less than ideal. This is a kernel
problem, like O_CLOEXEC. Perhaps lobby for a O_CLOFORK that's similar
but close on fork instead. The writer would open, write, close, fork,
exec so wouldn't make use of it, but any other thread that forks
wouldn't carry the FD with it so the writer's close would succeeding in
nailing the sole, final, reference to the "open file description", as
POSIX calls it.

ianlancetaylor · 2017-10-18T13:51:28Z

O_CLOFORK is a good idea. Does anybody want to suggest that to the Linux kernel maintainers? I expect that if someone can get it into Linux it will flow through to the other kernels.

I'm going to repeat a hack I described elsewhere that I believe would work for pure Go programs.

record the highest file descriptor returned by syscall.Open, syscall.Socket, syscall.Dup, etc.
add a new RWMutex in syscall: forkMutex
during syscall.Close, acquire a read lock on forkMutex
in syscall.forkAndExecInChild acquire a write lock on forkMutex, and
open a pipe in the parent (as we already do if UidMappings is set), and
in the child, loop through the descriptors up to the highest one,
closing each one that is marked close-on-exec, then close the pipe to the parent
in the parent, when the pipe is closed, release the forkMutex lock

The effect of this should be that when syscall.Close returns, we know for sure that there is no forked child that has an open copy of the descriptor.

The disadvantages are that all forks are serialized, and that all forks waste time closing descriptors that will shortly be closed anyhow. Also, of course, forks temporarily block closes, but that is unlikely to be significant.

RalphCorderoy · 2017-10-18T14:34:32Z

O_CLOFORK is a good idea. Does anybody want to suggest that to the Linux kernel maintainers?

I'm happy to have a go, but I'm a nobody on that list. I was assuming folks here might have the ear of a Google kernel developer or two in that area that would vet the idea and suggest it to the list if worthy. :-)

during syscall.Close, acquire a read lock on forkMutex

And syscall.Dup2 and Dup3 as they may cause newfd to close.

Do syscall.Open et al also synchronise with forkMutex somehow? I'm wondering if they can be creating more FDs, either above or below the highwater mark, whilst forkAndExecInChild is looping, closing close-on-exec ones.

ianlancetaylor · 2017-10-18T14:39:36Z

Is there a place to file a feature request against the Linux kernel? I know nothing about the kernel development process. I hear it uses git.

Agree about Dup2 and Dup3.

As far as I can see it doesn't matter if syscall.Open and friends create a new FD while the child is looping, because the child won't see the new descriptor anyhow.

rsc · 2017-10-18T15:13:06Z

@ianlancetaylor thanks, yes, the explicit closes would solve the problem with slow execs, which would be nice. That might make this actually palatable. You also don't even need the extra pipe if you use vfork in this approach.

I agree with @RalphCorderoy that there's a race between the "maintain the max" and "fork", in that Open might create a new fd, then fork runs in a different thread before Open can update the max. But since fds are created lowest-available, it should suffice for the child to assume that max is, say, 10 larger than it is.

Also note that this need not be an RWMutex (and for that matter the current syscall.ForkMutex need not be an RWMutex either). It just needs to be an "either-or" mutex. An RWMutex allows N readers or 1 writer. The mutex we need would allow N of type A or N of type B, just never a mix. If we built that (not difficult, I don't think), then programs that never fork would not serialize any of their closes, and programs that fork a lot but don't close things would not serialize any of their forks.

O_CLOFORK would require having fcntl F_SETFL/F_GETFL support for that bit too, and it would complicate fork a little more than it already is. An alternative that would be equally fine for us would be a "close all fd's above" or "tell me the maximum fd of my process" syscall. I don't know if a new bit or a new syscall is more likely.

rsc · 2017-10-18T15:28:35Z

I should maybe also note that macOS fixes this problem by putting #if 0 around the ETXTBSY check in the kernel implementation of exec. That would be a third option for Linux although probably less likely than the other two.

RalphCorderoy · 2017-10-18T15:35:36Z

I've emailed [email protected]. Will reference an archive once it appears.
If they're unpersuaded, then there's the POSIX folks at Open Group; they have a bug tracker.

RalphCorderoy · 2017-10-18T16:48:10Z

linux-kernel mailing-list archive of post: https://marc.info/?l=linux-kernel&m=150834137201488

On modern Unix systems it is basically impossible for a multithreaded program to open a binary for write, close it, and then fork+exec that same binary. So don't write the binary if we're going to fork+exec it. This fixes the ETXTBSY flakes. Fixes #22220. See also #22315. Change-Id: I6be4802fa174726ef2a93d5b2f09f708da897cdb Reviewed-on: https://go-review.googlesource.com/71570 Run-TryBot: Russ Cox <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Ian Lance Taylor <[email protected]>

This hack existed because cmd/go used to install (write) and then run cmd/cgo in the same invocation, and writing and then running a program is a no-no in modern multithreaded Unix programs (see #22315). As of CL 68338, cmd/go no longer installs any programs that it then tries to use. It never did this for any program other than cgo, and CL 68338 removed that special case for cgo. Now this special case, added for #3001 long ago, can be removed too. Change-Id: I338f1f8665e9aca823e33ef7dda9d19f665e4281 Reviewed-on: https://go-review.googlesource.com/71571 Run-TryBot: Russ Cox <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Ian Lance Taylor <[email protected]>

bradfitz · 2017-11-30T16:26:13Z

What's the plan here for Go 1.10?

@RalphCorderoy, looks like you never got a reply, eh?

ianlancetaylor · 2018-12-20T01:43:25Z

Looks like Solaris and macOS and OpenBSD have O_CLOFORK already. Hopefully it will catch on further.

gopherbot · 2024-02-01T15:28:47Z

Change https://go.dev/cl/560415 mentions this issue: cmd/go: avoid copying a binary to be exec'd in TestScript/gotoolchain_path

…_path Runinng 'go build' writes the binary in a separate process, so avoids the race described in #22315. However, the script engine's 'cp' command currently executes in-process, so it does not avoid that bug and may retain stale file descriptors when running tests in parallel. Avoid the race in this particular test by giving the final binary location in the '-o' argument instead of copying it there after the fact. Fixes #64019. Change-Id: I96d276f33c09e39f465e9877356f1d8f2ae55062 Cq-Include-Trybots: luci.golang.try:gotip-linux-amd64-longtest,gotip-windows-amd64-longtest Reviewed-on: https://go-review.googlesource.com/c/go/+/560415 Auto-Submit: Bryan Mills <[email protected]> LUCI-TryBot-Result: Go LUCI <[email protected]> Reviewed-by: Michael Matloob <[email protected]>

…_path Runinng 'go build' writes the binary in a separate process, so avoids the race described in golang#22315. However, the script engine's 'cp' command currently executes in-process, so it does not avoid that bug and may retain stale file descriptors when running tests in parallel. Avoid the race in this particular test by giving the final binary location in the '-o' argument instead of copying it there after the fact. Fixes golang#64019. Change-Id: I96d276f33c09e39f465e9877356f1d8f2ae55062 Cq-Include-Trybots: luci.golang.try:gotip-linux-amd64-longtest,gotip-windows-amd64-longtest Reviewed-on: https://go-review.googlesource.com/c/go/+/560415 Auto-Submit: Bryan Mills <[email protected]> LUCI-TryBot-Result: Go LUCI <[email protected]> Reviewed-by: Michael Matloob <[email protected]>

Since #2325, we learned that the "text file busy" error reported by the customer was probably golang/go#22315. This change makes the retry and log pathway more specific to that error, and retries more times more frequently.

On Wed, May 29, 2024 at 04:41:32PM -0400, Josef Bacik wrote: > NOTE: > This is compile tested only. It's also based on an out of tree feature branch > from Amir that I'm extending to add page fault content events to allow us to > have on-demand binary loading at exec time. If you need more context please let > me know, I'll push my current branch somewhere and wire up how I plan to use > this patch so you can see it in better context, but hopefully I've described > what I'm trying to accomplish enough that this leads to useful discussion. > > > Currently we have ->i_writecount to control write access to an inode. > Callers may deny write access by calling deny_write_access(), which will > cause ->i_writecount to go negative, and allow_write_access() to push it > back up to 0. > > This is used in a few ways, the biggest user being exec. Historically > we've blocked write access to files that are opened for executing. > fsverity is the other notable user. > > With the upcoming fanotify feature that allows for on-demand population > of files, this blanket policy of denying writes to files opened for > executing creates a problem. We have to issue events right before > denying access, and the entire file must be populated before we can > continue with the exec. > > This creates a problem for users who have large binaries they want to > populate on demand. Inside of Meta we often have multi-gigabyte > binaries, even on the order of tens of gigabytes. Pre-loading these > large files is costly, especially when large portions of the binary may > never be read (think debuginfo). > > In order to facilitate on-demand loading of binaries we need to have a > way to punch a hole in this exec related write denial. Hm. I suggest we try to tackle this in a completely different way. Maybe I mentioned it during LSFMM but instead of doing this dance we should try and remove the deny_write_access() mechanisms for exec completely. Back in 2021 we removed the remaining VM_DENYWRITE bits but left in deny_write_access() for exec. Back then I had thought that this was a bit risky for not too much gain. But looking at this code here I think we now have an even stronger reason to try and get rid of this restriction. And I've since changed my mind. See my notes on the _completely untested_ RFC patch I appended. Ofc depends on whether Linus still agrees that removing this might be something we could try. > This patch accomplishes this by doing something similar to the freeze > levels on the super block. We have two levels, one for all write > denial, and one for exec. People wishing to deny writes will specify > the level they're denying. Users wishing to get write access must go > through all of the levels and increment each levels counter until it > increments them all, or encounters a level that is currently denied, at > which point they undo their increments and return an error. > > Future patches will use the get_write_access_level() helper in order to > skip the level they wish to be allowed to access. Any inode being > populated via the pre-content fanotify mechanism will be marked with a > special flag, and the open path will use get_write_access_level() in > order to bypass the exec restriction. > > This is a significant behavior change, as it allows us to write to a > file that is currently being exec'ed. However this will be limited to a > very narrow use case. > > Signed-off-by: Josef Bacik <[email protected]> > --- > drivers/md/md.c | 2 +- > fs/binfmt_elf.c | 4 +- > fs/exec.c | 6 +-- > fs/ext4/file.c | 4 +- > fs/inode.c | 3 +- > fs/kernel_read_file.c | 4 +- > fs/locks.c | 2 +- > fs/quota/dquot.c | 2 +- > fs/verity/enable.c | 4 +- > include/linux/fs.h | 90 +++++++++++++++++++++++++++---- > include/trace/events/filelock.h | 2 +- > kernel/fork.c | 11 ++-- > security/integrity/evm/evm_main.c | 2 +- > security/integrity/ima/ima_main.c | 2 +- > 14 files changed, 104 insertions(+), 34 deletions(-) > > diff --git a/drivers/md/md.c b/drivers/md/md.c > index aff9118ff697..134cefbd7bef 100644 > --- a/drivers/md/md.c > +++ b/drivers/md/md.c > @@ -7231,7 +7231,7 @@ static int set_bitmap_file(struct mddev *mddev, int fd) > pr_warn("%s: error: bitmap file must open for write\n", > mdname(mddev)); > err = -EBADF; > - } else if (atomic_read(&inode->i_writecount) != 1) { > + } else if (atomic_read(&inode->i_writecount[INODE_DENY_WRITE_ALL]) != 1) { > pr_warn("%s: error: bitmap file is already in use\n", > mdname(mddev)); > err = -EBUSY; > diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c > index a43897b03ce9..9a6fcf8ba17c 100644 > --- a/fs/binfmt_elf.c > +++ b/fs/binfmt_elf.c > @@ -1216,7 +1216,7 @@ static int load_elf_binary(struct linux_binprm *bprm) > } > reloc_func_desc = interp_load_addr; > > - allow_write_access(interpreter); > + allow_write_access(interpreter, INODE_DENY_WRITE_EXEC); > fput(interpreter); > > kfree(interp_elf_ex); > @@ -1308,7 +1308,7 @@ static int load_elf_binary(struct linux_binprm *bprm) > kfree(interp_elf_ex); > kfree(interp_elf_phdata); > out_free_file: > - allow_write_access(interpreter); > + allow_write_access(interpreter, INODE_DENY_WRITE_EXEC); > if (interpreter) > fput(interpreter); > out_free_ph: > diff --git a/fs/exec.c b/fs/exec.c > index 18f057ba64a5..6b2900ee448d 100644 > --- a/fs/exec.c > +++ b/fs/exec.c > @@ -971,7 +971,7 @@ static struct file *do_open_execat(int fd, struct filename *name, int flags) > if (err) > goto exit; > > - err = deny_write_access(file); > + err = deny_write_access(file, INODE_DENY_WRITE_EXEC); > if (err) > goto exit; > > @@ -1545,7 +1545,7 @@ static void do_close_execat(struct file *file) > { > if (!file) > return; > - allow_write_access(file); > + allow_write_access(file, INODE_DENY_WRITE_EXEC); > fput(file); > } > > @@ -1865,7 +1865,7 @@ static int exec_binprm(struct linux_binprm *bprm) > bprm->file = bprm->interpreter; > bprm->interpreter = NULL; > > - allow_write_access(exec); > + allow_write_access(exec, INODE_DENY_WRITE_EXEC); > if (unlikely(bprm->have_execfd)) { > if (bprm->executable) { > fput(exec); > diff --git a/fs/ext4/file.c b/fs/ext4/file.c > index c89e434db6b7..6196f449649c 100644 > --- a/fs/ext4/file.c > +++ b/fs/ext4/file.c > @@ -171,8 +171,8 @@ static int ext4_release_file(struct inode *inode, struct file *filp) > } > /* if we are the last writer on the inode, drop the block reservation */ > if ((filp->f_mode & FMODE_WRITE) && > - (atomic_read(&inode->i_writecount) == 1) && > - !EXT4_I(inode)->i_reserved_data_blocks) { > + (atomic_read(&inode->i_writecount[INODE_DENY_WRITE_ALL]) == 1) && > + !EXT4_I(inode)->i_reserved_data_blocks) { > down_write(&EXT4_I(inode)->i_data_sem); > ext4_discard_preallocations(inode); > up_write(&EXT4_I(inode)->i_data_sem); > diff --git a/fs/inode.c b/fs/inode.c > index 3a41f83a4ba5..fb6155412252 100644 > --- a/fs/inode.c > +++ b/fs/inode.c > @@ -173,7 +173,8 @@ int inode_init_always(struct super_block *sb, struct inode *inode) > inode->i_opflags |= IOP_XATTR; > i_uid_write(inode, 0); > i_gid_write(inode, 0); > - atomic_set(&inode->i_writecount, 0); > + for (int i = 0; i < INODE_DENY_WRITE_LEVEL; i++) > + atomic_set(&inode->i_writecount[i], 0); > inode->i_size = 0; > inode->i_write_hint = WRITE_LIFE_NOT_SET; > inode->i_blocks = 0; > diff --git a/fs/kernel_read_file.c b/fs/kernel_read_file.c > index c429c42a6867..9af82d20aa1f 100644 > --- a/fs/kernel_read_file.c > +++ b/fs/kernel_read_file.c > @@ -48,7 +48,7 @@ ssize_t kernel_read_file(struct file *file, loff_t offset, void **buf, > if (!S_ISREG(file_inode(file)->i_mode)) > return -EINVAL; > > - ret = deny_write_access(file); > + ret = deny_write_access(file, INODE_DENY_WRITE_ALL); > if (ret) > return ret; > > @@ -119,7 +119,7 @@ ssize_t kernel_read_file(struct file *file, loff_t offset, void **buf, > } > > out: > - allow_write_access(file); > + allow_write_access(file, INODE_DENY_WRITE_ALL); > return ret == 0 ? copied : ret; > } > EXPORT_SYMBOL_GPL(kernel_read_file); > diff --git a/fs/locks.c b/fs/locks.c > index 90c8746874de..3e6a62f56528 100644 > --- a/fs/locks.c > +++ b/fs/locks.c > @@ -1763,7 +1763,7 @@ check_conflicting_open(struct file *filp, const int arg, int flags) > else if (filp->f_mode & FMODE_READ) > self_rcount = 1; > > - if (atomic_read(&inode->i_writecount) != self_wcount || > + if (atomic_read(&inode->i_writecount[INODE_DENY_WRITE_ALL]) != self_wcount || > atomic_read(&inode->i_readcount) != self_rcount) > return -EAGAIN; > > diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c > index 627eb2f72ef3..fa470d76f0b3 100644 > --- a/fs/quota/dquot.c > +++ b/fs/quota/dquot.c > @@ -1031,7 +1031,7 @@ static int add_dquot_ref(struct super_block *sb, int type) > list_for_each_entry(inode, &sb->s_inodes, i_sb_list) { > spin_lock(&inode->i_lock); > if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) || > - !atomic_read(&inode->i_writecount) || > + !inode_is_open_for_write(inode) || > !dqinit_needed(inode, type)) { > spin_unlock(&inode->i_lock); > continue; > diff --git a/fs/verity/enable.c b/fs/verity/enable.c > index c284f46d1b53..a0e0d49c6ccc 100644 > --- a/fs/verity/enable.c > +++ b/fs/verity/enable.c > @@ -371,7 +371,7 @@ int fsverity_ioctl_enable(struct file *filp, const void __user *uarg) > if (err) /* -EROFS */ > return err; > > - err = deny_write_access(filp); > + err = deny_write_access(filp, INODE_DENY_WRITE_ALL); > if (err) /* -ETXTBSY */ > goto out_drop_write; > > @@ -397,7 +397,7 @@ int fsverity_ioctl_enable(struct file *filp, const void __user *uarg) > * allow_write_access() is needed to pair with deny_write_access(). > * Regardless, the filesystem won't allow writing to verity files. > */ > - allow_write_access(filp); > + allow_write_access(filp, INODE_DENY_WRITE_ALL); > out_drop_write: > mnt_drop_write_file(filp); > return err; > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 0283cf366c2a..f0720cd0ab45 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -620,6 +620,22 @@ is_uncached_acl(struct posix_acl *acl) > #define IOP_XATTR 0x0008 > #define IOP_DEFAULT_READLINK 0x0010 > > +/* > + * These are used with the *write_access helpers. > + * > + * INODE_DENY_WRITE_ALL - Do not allow writes at all. > + * INODE_DENY_WRITE_EXEC - Do not allow writes because we are > + * exec'ing the file. This can be bypassed > + * in cases where the file needs to be > + * filled in via the fanotify pre-content > + * hooks. > + */ > +enum inode_write_level { > + INODE_DENY_WRITE_ALL, > + INODE_DENY_WRITE_EXEC, > + INODE_DENY_WRITE_LEVEL, > +}; > + > /* > * Keep mostly read-only and often accessed (especially for > * the RCU path lookup and 'stat' data) fields at the beginning > @@ -701,7 +717,7 @@ struct inode { > atomic64_t i_sequence; /* see futex */ > atomic_t i_count; > atomic_t i_dio_count; > - atomic_t i_writecount; > + atomic_t i_writecount[INODE_DENY_WRITE_LEVEL]; > #if defined(CONFIG_IMA) || defined(CONFIG_FILE_LOCKING) > atomic_t i_readcount; /* struct files open RO */ > #endif > @@ -2920,38 +2936,90 @@ static inline void kiocb_end_write(struct kiocb *iocb) > * put_write_access() releases this write permission. > * deny_write_access() denies write access to a file. > * allow_write_access() re-enables write access to a file. > + * __get_write_access_level() gets write permission for a file for the given > + * level. > + * put_write_access_level() releases the level write permission. > * > - * The i_writecount field of an inode can have the following values: > + * The level indicates which level we're denying writes for. Some levels are > + * allowed to be bypassed in special circumstances. In the cases that the write > + * level needs to be bypassed the user must use the > + * get_write_access_level()/put_write_access_level() pairs of the helpers, which > + * will allow the user to bypass the given level, but none of the others. > + * > + * The i_writecount[level] field of an inode can have the following values: > * 0: no write access, no denied write access > - * < 0: (-i_writecount) users that denied write access to the file. > - * > 0: (i_writecount) users that have write access to the file. > + * < 0: (-i_writecount[level]) users that denied write access to the file. > + * > 0: (i_writecount[level]) users that have write access to the file. > * > * Normally we operate on that counter with atomic_{inc,dec} and it's safe > * except for the cases where we don't hold i_writecount yet. Then we need to > * use {get,deny}_write_access() - these functions check the sign and refuse > * to do the change if sign is wrong. > */ > +static inline int __get_write_access(struct inode *inode, > + enum inode_write_level skip_level) > +{ > + int i; > + > + /* We are not allowed to skip INODE_DENY_WRITE_ALL. */ > + WARN_ON_ONCE(skip_level == INODE_DENY_WRITE_ALL); > + > + for (i = 0; i < INODE_DENY_WRITE_LEVEL; i++) { > + if (i == skip_level) > + continue; > + if (!atomic_inc_unless_negative(&inode->i_writecount[i])) > + goto fail; > + } > + return 0; > +fail: > + while (--i >= 0) { > + if (i == skip_level) > + continue; > + atomic_dec(&inode->i_writecount[i]); > + } > + return -ETXTBSY; > +} > static inline int get_write_access(struct inode *inode) > { > - return atomic_inc_unless_negative(&inode->i_writecount) ? 0 : -ETXTBSY; > + return __get_write_access(inode, INODE_DENY_WRITE_LEVEL); > } > -static inline int deny_write_access(struct file *file) > +static inline int get_write_access_level(struct inode *inode, > + enum inode_write_level skip_level) > +{ > + return __get_write_access(inode, skip_level); > +} > +static inline int deny_write_access(struct file *file, > + enum inode_write_level level) > { > struct inode *inode = file_inode(file); > - return atomic_dec_unless_positive(&inode->i_writecount) ? 0 : -ETXTBSY; > + return atomic_dec_unless_positive(&inode->i_writecount[level]) ? 0 : -ETXTBSY; > +} > +static inline void __put_write_access(struct inode *inode, > + enum inode_write_level skip_level) > +{ > + for (int i = 0; i < INODE_DENY_WRITE_LEVEL; i++) { > + if (i == skip_level) > + continue; > + atomic_dec(&inode->i_writecount[i]); > + } > } > static inline void put_write_access(struct inode * inode) > { > - atomic_dec(&inode->i_writecount); > + __put_write_access(inode, INODE_DENY_WRITE_LEVEL); > } > -static inline void allow_write_access(struct file *file) > +static inline void put_write_access_level(struct inode *inode, > + enum inode_write_level skip_level) > +{ > + __put_write_access(inode, skip_level); > +} > +static inline void allow_write_access(struct file *file, enum inode_write_level level) > { > if (file) > - atomic_inc(&file_inode(file)->i_writecount); > + atomic_inc(&file_inode(file)->i_writecount[level]); > } > static inline bool inode_is_open_for_write(const struct inode *inode) > { > - return atomic_read(&inode->i_writecount) > 0; > + return atomic_read(&inode->i_writecount[INODE_DENY_WRITE_ALL]) > 0; > } > > #if defined(CONFIG_IMA) || defined(CONFIG_FILE_LOCKING) > diff --git a/include/trace/events/filelock.h b/include/trace/events/filelock.h > index b8d1e00a7982..ad076e87a956 100644 > --- a/include/trace/events/filelock.h > +++ b/include/trace/events/filelock.h > @@ -187,7 +187,7 @@ TRACE_EVENT(generic_add_lease, > TP_fast_assign( > __entry->s_dev = inode->i_sb->s_dev; > __entry->i_ino = inode->i_ino; > - __entry->wcount = atomic_read(&inode->i_writecount); > + __entry->wcount = atomic_read(&inode->i_writecount[INODE_DENY_WRITE_ALL]); > __entry->rcount = atomic_read(&inode->i_readcount); > __entry->icount = atomic_read(&inode->i_count); > __entry->owner = fl->c.flc_owner; > diff --git a/kernel/fork.c b/kernel/fork.c > index 99076dbe27d8..b6dc7aed2ebf 100644 > --- a/kernel/fork.c > +++ b/kernel/fork.c > @@ -620,7 +620,7 @@ static void dup_mm_exe_file(struct mm_struct *mm, struct mm_struct *oldmm) > * We depend on the oldmm having properly denied write access to the > * exe_file already. > */ > - if (exe_file && deny_write_access(exe_file)) > + if (exe_file && deny_write_access(exe_file, INODE_DENY_WRITE_EXEC)) > pr_warn_once("deny_write_access() failed in %s\n", __func__); > } > > @@ -1417,13 +1417,14 @@ int set_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file) > * We expect the caller (i.e., sys_execve) to already denied > * write access, so this is unlikely to fail. > */ > - if (unlikely(deny_write_access(new_exe_file))) > + if (unlikely(deny_write_access(new_exe_file, > + INODE_DENY_WRITE_EXEC))) > return -EACCES; > get_file(new_exe_file); > } > rcu_assign_pointer(mm->exe_file, new_exe_file); > if (old_exe_file) { > - allow_write_access(old_exe_file); > + allow_write_access(old_exe_file, INODE_DENY_WRITE_EXEC); > fput(old_exe_file); > } > return 0; > @@ -1464,7 +1465,7 @@ int replace_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file) > return ret; > } > > - ret = deny_write_access(new_exe_file); > + ret = deny_write_access(new_exe_file, INODE_DENY_WRITE_EXEC); > if (ret) > return -EACCES; > get_file(new_exe_file); > @@ -1476,7 +1477,7 @@ int replace_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file) > mmap_write_unlock(mm); > > if (old_exe_file) { > - allow_write_access(old_exe_file); > + allow_write_access(old_exe_file, INODE_DENY_WRITE_EXEC); > fput(old_exe_file); > } > return 0; > diff --git a/security/integrity/evm/evm_main.c b/security/integrity/evm/evm_main.c > index 62fe66dd53ce..b83dfb295d85 100644 > --- a/security/integrity/evm/evm_main.c > +++ b/security/integrity/evm/evm_main.c > @@ -1084,7 +1084,7 @@ static void evm_file_release(struct file *file) > if (!S_ISREG(inode->i_mode) || !(mode & FMODE_WRITE)) > return; > > - if (iint && atomic_read(&inode->i_writecount) == 1) > + if (iint && atomic_read(&inode->i_writecount[INODE_DENY_WRITE_ALL]) == 1) > iint->flags &= ~EVM_NEW_FILE; > } > > diff --git a/security/integrity/ima/ima_main.c b/security/integrity/ima/ima_main.c > index f04f43af651c..7aed80a70692 100644 > --- a/security/integrity/ima/ima_main.c > +++ b/security/integrity/ima/ima_main.c > @@ -164,7 +164,7 @@ static void ima_check_last_writer(struct ima_iint_cache *iint, > return; > > mutex_lock(&iint->mutex); > - if (atomic_read(&inode->i_writecount) == 1) { > + if (atomic_read(&inode->i_writecount[INODE_DENY_WRITE_ALL]) == 1) { > struct kstat stat; > > update = test_and_clear_bit(IMA_UPDATE_XATTR, > -- > 2.43.0 > From ba4a40635236fef36663bcfeca13dd816f9de69a Mon Sep 17 00:00:00 2001 From: Christian Brauner <[email protected]> Date: Thu, 30 May 2024 09:35:44 +0200 Subject: [PATCH] [RFC UNTESTED]: fs: don't deny writes to executables Back in 2021 we already discussed removing deny_write_access() for executables. Back then I was hesistant because I thought that this might cause issues in userspace. But even back then I had started taking some notes on what could potentially depend on this and I didn't come up with a lot so I've changed my mind and I would like to try this. Here are some of the notes that I took: (1) The deny_write_access() mechanism is causing really pointless issues such as [1]. If a thread in a thread-group opens a file writable, then writes some stuff, then closing the file descriptor and then calling execve() they can fail the execve() with ETXTBUSY because another thread in the thread-group could have concurrently called fork(). Multi-threaded libraries such as go suffer from this. (2) There are userspace attacks that rely on overwriting the binary of a running process. These attacks are _mitigated_ but _not at all prevented_ from ocurring by the deny_write_access() mechanism. I'll go over some details. The clearest example of such attacks was the attack against runC in CVE-2019-5736 (cf. [3]). An attack could compromise the runC host binary from inside a _privileged_ runC container. The malicious binary could then be used to take over the host. (It is crucial to note that this attack is _not_ possible with unprivileged containers. IOW, the setup here is already insecure.) The attack can be made when attaching to a running container or when starting a container running a specially crafted image. For example, when runC attaches to a container the attacker can trick it into executing itself. This could be done by replacing the target binary inside the container with a custom binary pointing back at the runC binary itself. As an example, if the target binary was /bin/bash, this could be replaced with an executable script specifying the interpreter path #!/proc/self/exe. As such when /bin/bash is executed inside the container, instead the target of /proc/self/exe will be executed. That magic link will point to the runc binary on the host. The attacker can then proceed to write to the target of /proc/self/exe to try and overwrite the runC binary on the host. However, this will not succeed because of deny_write_access(). Now, one might think that this would prevent the attack but it doesn't. To overcome this, the attacker has multiple ways: * Open a file descriptor to /proc/self/exe using the O_PATH flag and then proceed to reopen the binary as O_WRONLY through /proc/self/fd/<nr> and try to write to it in a busy loop from a separate process. Ultimately it will succeed when the runC binary exits. After this the runC binary is compromised and can be used to attack other containers or the host itself. * Use a malicious shared library annotating a function in there with the constructor attribute making the malicious function run as an initializor. The malicious library will then open /proc/self/exe for creating a new entry under /proc/self/fd/<nr>. It'll then call exec to a) force runC to exit and b) hand the file descriptor off to a program that then reopens /proc/self/fd/<nr> for writing (which is now possible because runC has exited) and overwriting that binary. To sum up: the deny_write_access() mechanism doesn't prevent such attacks in insecure setups. It just makes them minimally harder. That's all. The only way back then to prevent this is to create a temporary copy of the calling binary itself when it starts or attaches to containers. So what I did back then for LXC (and Aleksa for runC) was to create an anonymous, in-memory file using the memfd_create() system call and to copy itself into the temporary in-memory file, which is then sealed to prevent further modifications. This sealed, in-memory file copy is then executed instead of the original on-disk binary. Any compromising write operations from a privileged container to the host binary will then write to the temporary in-memory binary and not to the host binary on-disk, preserving the integrity of the host binary. Also as the temporary, in-memory binary is sealed, writes to this will also fail. The point is that deny_write_access() is uselss to prevent these attacks. (3) Denying write access to an inode because it's currently used in an exec path could easily be done on an LSM level. It might need an additional hook but that should be about it. (4) deny_write_access() doesn't help with hardlinks. (5) The MAP_DENYWRITE flag for mmap() has been deprecated a long time ago so while we do protect the main executable the bigger portion of the things you'd think need protecting such as the shared libraries aren't. IOW, we let anyone happily overwrite shared libraries. (6) We removed all remaining uses of VM_DENYWRITE in [2]. That means: (6.1) We removed the legacy uselib() protection for preventing overwriting of shared libraries. Nobody cared in 3 years. (6.2) We allow write access to the elf interpreter after exec completed treating it on a par with shared libraries. (7) Yes, someone in userspace could potentially be relying on this. It's not completely out of the realm of possibility but let's find out if that's actually the case and not guess. Link: golang/go#22315 [1] Link: 49624ef ("Merge tag 'denywrite-for-5.15' of git://github.com/davidhildenbrand/linux") [2] Link: https://unit42.paloaltonetworks.com/breaking-docker-via-runc-explaining-cve-2019-5736 [3] Link: https://lwn.net/Articles/866493 Signed-off-by: Christian Brauner <[email protected]>

Back in 2021 we already discussed removing deny_write_access() for executables. Back then I was hesistant because I thought that this might cause issues in userspace. But even back then I had started taking some notes on what could potentially depend on this and I didn't come up with a lot so I've changed my mind and I would like to try this. Here are some of the notes that I took: (1) The deny_write_access() mechanism is causing really pointless issues such as [1]. If a thread in a thread-group opens a file writable, then writes some stuff, then closing the file descriptor and then calling execve() they can fail the execve() with ETXTBUSY because another thread in the thread-group could have concurrently called fork(). Multi-threaded libraries such as go suffer from this. (2) There are userspace attacks that rely on overwriting the binary of a running process. These attacks are _mitigated_ but _not at all prevented_ from ocurring by the deny_write_access() mechanism. I'll go over some details. The clearest example of such attacks was the attack against runC in CVE-2019-5736 (cf. [3]). An attack could compromise the runC host binary from inside a _privileged_ runC container. The malicious binary could then be used to take over the host. (It is crucial to note that this attack is _not_ possible with unprivileged containers. IOW, the setup here is already insecure.) The attack can be made when attaching to a running container or when starting a container running a specially crafted image. For example, when runC attaches to a container the attacker can trick it into executing itself. This could be done by replacing the target binary inside the container with a custom binary pointing back at the runC binary itself. As an example, if the target binary was /bin/bash, this could be replaced with an executable script specifying the interpreter path #!/proc/self/exe. As such when /bin/bash is executed inside the container, instead the target of /proc/self/exe will be executed. That magic link will point to the runc binary on the host. The attacker can then proceed to write to the target of /proc/self/exe to try and overwrite the runC binary on the host. However, this will not succeed because of deny_write_access(). Now, one might think that this would prevent the attack but it doesn't. To overcome this, the attacker has multiple ways: * Open a file descriptor to /proc/self/exe using the O_PATH flag and then proceed to reopen the binary as O_WRONLY through /proc/self/fd/<nr> and try to write to it in a busy loop from a separate process. Ultimately it will succeed when the runC binary exits. After this the runC binary is compromised and can be used to attack other containers or the host itself. * Use a malicious shared library annotating a function in there with the constructor attribute making the malicious function run as an initializor. The malicious library will then open /proc/self/exe for creating a new entry under /proc/self/fd/<nr>. It'll then call exec to a) force runC to exit and b) hand the file descriptor off to a program that then reopens /proc/self/fd/<nr> for writing (which is now possible because runC has exited) and overwriting that binary. To sum up: the deny_write_access() mechanism doesn't prevent such attacks in insecure setups. It just makes them minimally harder. That's all. The only way back then to prevent this is to create a temporary copy of the calling binary itself when it starts or attaches to containers. So what I did back then for LXC (and Aleksa for runC) was to create an anonymous, in-memory file using the memfd_create() system call and to copy itself into the temporary in-memory file, which is then sealed to prevent further modifications. This sealed, in-memory file copy is then executed instead of the original on-disk binary. Any compromising write operations from a privileged container to the host binary will then write to the temporary in-memory binary and not to the host binary on-disk, preserving the integrity of the host binary. Also as the temporary, in-memory binary is sealed, writes to this will also fail. The point is that deny_write_access() is uselss to prevent these attacks. (3) Denying write access to an inode because it's currently used in an exec path could easily be done on an LSM level. It might need an additional hook but that should be about it. (4) The MAP_DENYWRITE flag for mmap() has been deprecated a long time ago so while we do protect the main executable the bigger portion of the things you'd think need protecting such as the shared libraries aren't. IOW, we let anyone happily overwrite shared libraries. (5) We removed all remaining uses of VM_DENYWRITE in [2]. That means: (5.1) We removed the legacy uselib() protection for preventing overwriting of shared libraries. Nobody cared in 3 years. (5.2) We allow write access to the elf interpreter after exec completed treating it on a par with shared libraries. Yes, someone in userspace could potentially be relying on this. It's not completely out of the realm of possibility but let's find out if that's actually the case and not guess. Link: golang/go#22315 [1] Link: 49624ef ("Merge tag 'denywrite-for-5.15' of git://github.com/davidhildenbrand/linux") [2] Link: https://unit42.paloaltonetworks.com/breaking-docker-via-runc-explaining-cve-2019-5736 [3] Link: https://lwn.net/Articles/866493 Link: golang/go#22220 Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/cmd/go/internal/work/buildid.go#L724 Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/cmd/go/internal/work/exec.go#L1493 Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/cmd/go/internal/script/cmds.go#L457 Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/cmd/go/internal/test/test.go#L1557 Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/os/exec/lp_linux_test.go#L61 Link: buildkite/agent#2736 Link: rust-lang/rust#114554 Link: https://bugs.openjdk.org/browse/JDK-8068370 Link: dotnet/runtime#58964 Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Christian Brauner <[email protected]>

nadiamoe · 2024-06-25T11:31:23Z

I've worked around this issue on a piece of code that writes to a file that will be executed shortly after by locking syscall.ForkLock for reading while the file is open.

If my understanding is correct, the runtime always locks syscall.ForkLock for writing before forking, so locking it for reading before opening a file prevents "leaking" its file descriptor to forked processes, as that fork cannot happen while syscall.ForkLock is locked.

This is of course not a solution to the general problem, but should be a viable workaround for pieces of code which can assume that a file they are writing will be executed later.

Circumvents golang/go#22315

Back in 2021 we already discussed removing deny_write_access() for executables. Back then I was hesistant because I thought that this might cause issues in userspace. But even back then I had started taking some notes on what could potentially depend on this and I didn't come up with a lot so I've changed my mind and I would like to try this. Here are some of the notes that I took: (1) The deny_write_access() mechanism is causing really pointless issues such as [1]. If a thread in a thread-group opens a file writable, then writes some stuff, then closing the file descriptor and then calling execve() they can fail the execve() with ETXTBUSY because another thread in the thread-group could have concurrently called fork(). Multi-threaded libraries such as go suffer from this. (2) There are userspace attacks that rely on overwriting the binary of a running process. These attacks are _mitigated_ but _not at all prevented_ from ocurring by the deny_write_access() mechanism. I'll go over some details. The clearest example of such attacks was the attack against runC in CVE-2019-5736 (cf. [3]). An attack could compromise the runC host binary from inside a _privileged_ runC container. The malicious binary could then be used to take over the host. (It is crucial to note that this attack is _not_ possible with unprivileged containers. IOW, the setup here is already insecure.) The attack can be made when attaching to a running container or when starting a container running a specially crafted image. For example, when runC attaches to a container the attacker can trick it into executing itself. This could be done by replacing the target binary inside the container with a custom binary pointing back at the runC binary itself. As an example, if the target binary was /bin/bash, this could be replaced with an executable script specifying the interpreter path #!/proc/self/exe. As such when /bin/bash is executed inside the container, instead the target of /proc/self/exe will be executed. That magic link will point to the runc binary on the host. The attacker can then proceed to write to the target of /proc/self/exe to try and overwrite the runC binary on the host. However, this will not succeed because of deny_write_access(). Now, one might think that this would prevent the attack but it doesn't. To overcome this, the attacker has multiple ways: * Open a file descriptor to /proc/self/exe using the O_PATH flag and then proceed to reopen the binary as O_WRONLY through /proc/self/fd/<nr> and try to write to it in a busy loop from a separate process. Ultimately it will succeed when the runC binary exits. After this the runC binary is compromised and can be used to attack other containers or the host itself. * Use a malicious shared library annotating a function in there with the constructor attribute making the malicious function run as an initializor. The malicious library will then open /proc/self/exe for creating a new entry under /proc/self/fd/<nr>. It'll then call exec to a) force runC to exit and b) hand the file descriptor off to a program that then reopens /proc/self/fd/<nr> for writing (which is now possible because runC has exited) and overwriting that binary. To sum up: the deny_write_access() mechanism doesn't prevent such attacks in insecure setups. It just makes them minimally harder. That's all. The only way back then to prevent this is to create a temporary copy of the calling binary itself when it starts or attaches to containers. So what I did back then for LXC (and Aleksa for runC) was to create an anonymous, in-memory file using the memfd_create() system call and to copy itself into the temporary in-memory file, which is then sealed to prevent further modifications. This sealed, in-memory file copy is then executed instead of the original on-disk binary. Any compromising write operations from a privileged container to the host binary will then write to the temporary in-memory binary and not to the host binary on-disk, preserving the integrity of the host binary. Also as the temporary, in-memory binary is sealed, writes to this will also fail. The point is that deny_write_access() is uselss to prevent these attacks. (3) Denying write access to an inode because it's currently used in an exec path could easily be done on an LSM level. It might need an additional hook but that should be about it. (4) The MAP_DENYWRITE flag for mmap() has been deprecated a long time ago so while we do protect the main executable the bigger portion of the things you'd think need protecting such as the shared libraries aren't. IOW, we let anyone happily overwrite shared libraries. (5) We removed all remaining uses of VM_DENYWRITE in [2]. That means: (5.1) We removed the legacy uselib() protection for preventing overwriting of shared libraries. Nobody cared in 3 years. (5.2) We allow write access to the elf interpreter after exec completed treating it on a par with shared libraries. Yes, someone in userspace could potentially be relying on this. It's not completely out of the realm of possibility but let's find out if that's actually the case and not guess. Link: golang/go#22315 [1] Link: 49624ef ("Merge tag 'denywrite-for-5.15' of git://github.com/davidhildenbrand/linux") [2] Link: https://unit42.paloaltonetworks.com/breaking-docker-via-runc-explaining-cve-2019-5736 [3] Link: https://lwn.net/Articles/866493 Link: golang/go#22220 Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/cmd/go/internal/work/buildid.go#L724 Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/cmd/go/internal/work/exec.go#L1493 Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/cmd/go/internal/script/cmds.go#L457 Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/cmd/go/internal/test/test.go#L1557 Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/os/exec/lp_linux_test.go#L61 Link: buildkite/agent#2736 Link: rust-lang/rust#114554 Link: https://bugs.openjdk.org/browse/JDK-8068370 Link: dotnet/runtime#58964 Signed-off-by: Christian Brauner <[email protected]>

geofft · 2024-09-15T18:56:13Z

FYI, Linux 6.11 (just released today) has dropped the ETXTBSY "feature" entirely. The commit making this change links to this issue, and also has an argument that the security purpose of this feature didn't actually work. See also this 2021 LWN discussion of the issue and previous changes to reduce the cases that trigger ETXTBSY.

It'll be months/years before the distros that real-world users are actually using get to 6.11, but it's good to know that at least on Linux this issue will completely go away at some point.

Maybe we can also use this to argue to Oracle, Apple, and the BSDs that they should make the equivalent change.

kmeaw · 2025-01-13T23:14:38Z

If maintaining specific parent-child relationships is not required, the "zygote" concept - commonly used in systems like Chromium and Android - could be applied to address file descriptor leaks. The "zygote" process serves as a dedicated origin for process creation, ensuring that every process either forks or is multithreaded, but never both.

This separation avoids the complexity and pitfalls of mixing forking with Go's multithreaded runtime. Communication between the multithreaded environment and the "zygote" process could occur over a Unix domain socket, allowing the explicit passing of fork/exec requests and any file descriptors intended for inheritance. By centralizing control in this way, the risk of file descriptor leaks is minimized, as inheritance becomes explicit and tightly managed.

It may be useful to use this approach if O_CLOFORK is not available.

egorpugin · 2025-01-13T23:37:55Z

Interesting.

on fedora41
linux 6.11.8 - the issue is fixed
linux 6.12.8 - it is back again

Any reverts of that commit?
torvalds/linux@2a010c412853

Updated example from the first post with more heavy write()s and more threads.

#include <pthread.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>
#include <fcntl.h>
#include <stdlib.h>
#include <sys/wait.h>
#include <errno.h>
#include <stdint.h>

void* runner(void*);

int
main(void)
{
  int i;
  pthread_t pid[40];

  for(i=1; i<40; i++)
    pthread_create(&pid[i], 0, runner, (void*)(uintptr_t)i);
  runner(0);
  return 0;
}

char script[] = "#!/bin/sh\nexit 0\n";
char script2[] = "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           ";

void*
runner(void *v)
{
  int i, fd, pid, status, len2, n2;
  char buf[100], *argv[2];
  
  i = (int)(uintptr_t)v;
  len2 = strlen(script2);
  snprintf(buf, sizeof buf, "/var/tmp/fork-exec-%d", i);
  argv[0] = buf;
  argv[1] = 0;
  for(;;) {
    fd = open(buf, O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC, 0777);
    if(fd < 0) {
      perror("open");
      exit(2);
    }
    write(fd, script, strlen(script));
    n2 = 1000;
    while (n2--) {
      write(fd, script2, len2);
    }
    close(fd);
    pid = fork();
    if(pid < 0) {
      perror("fork");
      exit(2);
    }
    if(pid == 0) {
      execve(buf, argv, 0);
      exit(errno);
    }
    if(waitpid(pid, &status, 0) < 0) {
      perror("waitpid");
      exit(2);
    }
    if(!WIFEXITED(status)) {
      perror("waitpid not exited");
      exit(2);
    }
    status = WEXITSTATUS(status);
    if(status != 0)
      fprintf(stderr, "exec: %d %s\n", status, strerror(status));
  }
  return 0;
}

mvdan · 2025-01-14T00:05:13Z

@egorpugin I am also seeing this error again on Linux 6.12.9, and I was indeed scratching my head wondering what was going on.

mvdan · 2025-01-14T00:06:49Z

A quick github search did unfortunately uncover a revert: torvalds/linux@3b83203

ianlancetaylor · 2025-01-14T04:13:18Z

@kmeaw Thanks, it's an interesting suggestion. Unfortunately I'm confident that it would break some existing programs, even if we changed the Wait method to contact the zygote.

rsc added this to the Go1.10 milestone Oct 18, 2017

rsc mentioned this issue Oct 18, 2017

cmd/go: "fork/exec foo.test: text file busy" flakes #22220

Closed

ianlancetaylor modified the milestones: Go1.10, Go1.11 Dec 6, 2017

mmdriley mentioned this issue May 10, 2018

Downloaded plugin can fail to run with ETXTBUSY pulumi/pulumi#1353

Closed

ianlancetaylor added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jun 27, 2018

ianlancetaylor modified the milestones: Go1.11, Go1.12 Jun 27, 2018

ianlancetaylor modified the milestones: Go1.12, Go1.13 Dec 20, 2018

travisperson mentioned this issue Mar 18, 2019

FAST payment channel tests filecoin-project/venus#2319

Closed

travisperson mentioned this issue Apr 8, 2019

IPTB: Work around for golang bug #22315 filecoin-project/venus#2518

Merged

andybons modified the milestones: Go1.13, Go1.14 Jul 8, 2019

rsc removed this from the Go1.14 milestone Oct 9, 2019

Kiskae mentioned this issue Nov 2, 2023

ETXTBSY on execute_install() DeterminateSystems/nix-installer-action#51

Closed

DrJosh9000 mentioned this issue Nov 6, 2023

Fix write-exec-ETXTBSY race buildkite/bintest#18

Merged

RebeccaMahany mentioned this issue Nov 13, 2023

Prevent "text file busy" errors in tests kolide/launcher#1454

Merged

bcmills mentioned this issue Jan 18, 2024

cmd/go: test failures due to ETXTBSY in toolchain execs #64019

Closed

getvictor mentioned this issue Jan 20, 2024

Remove Parallel Test Runners fleetdm/fleet#16241

Closed

DrJosh9000 mentioned this issue Apr 22, 2024

Tweak ETXTBSY retry, and be helpful for ENOENT buildkite/agent#2736

Merged

2 tasks

nadiamoe added a commit to grafana/crocochrome that referenced this issue Jul 9, 2024

heartbeat: lock ForkLock for reading

11a582c

Circumvents golang/go#22315

nadiamoe added a commit to grafana/crocochrome that referenced this issue Jul 9, 2024

heartbeat: lock ForkLock for reading

e5faa0d

Circumvents golang/go#22315

nadiamoe added a commit to grafana/crocochrome that referenced this issue Jul 10, 2024

heartbeat: lock ForkLock for reading

b131eb3

Circumvents golang/go#22315

nadiamoe added a commit to grafana/crocochrome that referenced this issue Jul 10, 2024

heartbeat: lock ForkLock for reading

e0a14a2

Circumvents golang/go#22315

slydiman mentioned this issue Jul 29, 2024

[lldb] Improved lldb-server stability for remote launching llvm/llvm-project#100659

Closed

mvdan mentioned this issue Sep 19, 2024

cmd/go: cache link output binaries in the build cache #69290

Closed

big-guy mentioned this issue Nov 13, 2024

Running executables sporadically fails with ETXTBSY (Text file busy) gradle/gradle#31282

Closed

smoelius mentioned this issue Dec 7, 2024

Bump serde from 1.0.213 to 1.0.214 trailofbits/necessist#1346

Closed

ianlancetaylor mentioned this issue Dec 18, 2024

runtime:cpu1: TestLockRankGenerated failures #70793

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

os: StartProcess ETXTBSY race on Unix systems #22315

os: StartProcess ETXTBSY race on Unix systems #22315

rsc commented Oct 18, 2017

rsc commented Oct 18, 2017

gopherbot commented Oct 18, 2017

gopherbot commented Oct 18, 2017

RalphCorderoy commented Oct 18, 2017

ianlancetaylor commented Oct 18, 2017

RalphCorderoy commented Oct 18, 2017

ianlancetaylor commented Oct 18, 2017

rsc commented Oct 18, 2017

rsc commented Oct 18, 2017

RalphCorderoy commented Oct 18, 2017

RalphCorderoy commented Oct 18, 2017

bradfitz commented Nov 30, 2017

ianlancetaylor commented Dec 20, 2018

gopherbot commented Feb 1, 2024

nadiamoe commented Jun 25, 2024 •

edited

Loading

geofft commented Sep 15, 2024

kmeaw commented Jan 13, 2025

egorpugin commented Jan 13, 2025

mvdan commented Jan 14, 2025

mvdan commented Jan 14, 2025

ianlancetaylor commented Jan 14, 2025

os: StartProcess ETXTBSY race on Unix systems #22315

os: StartProcess ETXTBSY race on Unix systems #22315

Comments

rsc commented Oct 18, 2017

rsc commented Oct 18, 2017

gopherbot commented Oct 18, 2017

gopherbot commented Oct 18, 2017

RalphCorderoy commented Oct 18, 2017

ianlancetaylor commented Oct 18, 2017

RalphCorderoy commented Oct 18, 2017

ianlancetaylor commented Oct 18, 2017

rsc commented Oct 18, 2017

rsc commented Oct 18, 2017

RalphCorderoy commented Oct 18, 2017

RalphCorderoy commented Oct 18, 2017

bradfitz commented Nov 30, 2017

ianlancetaylor commented Dec 20, 2018

gopherbot commented Feb 1, 2024

nadiamoe commented Jun 25, 2024 • edited Loading

geofft commented Sep 15, 2024

kmeaw commented Jan 13, 2025

egorpugin commented Jan 13, 2025

mvdan commented Jan 14, 2025

mvdan commented Jan 14, 2025

ianlancetaylor commented Jan 14, 2025

nadiamoe commented Jun 25, 2024 •

edited

Loading