-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nsenter: overwrite glibc's internal tid cache on clone() #4247
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While looking quite scary, this resolves my concern about systems with CONFIG_CHECKPOINT_RESTORE not set, so SGTM.
Was thinking of using a list of known offsets first, before resorting to linear scan. Can try different distros to get those offsets (I'm pretty sure there are only 2 or 3).
512 as a upper size limit is also looking scary (withput haven't checked the actual struct pthread size).
Maybe move the kludge to a separate function, and #ifdef it out unless we're using GLIBC.
So maybe we can ifdef the kludge with something like |
Yeah, I was thinking of that too but wasn't sure how variable all the structures are before
Yeah, unfortunately we can't be sure of the size of
Will do.
What matters is the runtime version, so we'd need to parse |
libcontainer/nsenter/nsexec.c
Outdated
|
||
pid_t *fake_tid_array = (pid_t *) pthread_self(); | ||
/* On my machine, fake_tid_array[180] is PTHREAD_SELF->tid. */ | ||
for (size_t i = 0; i < 512; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What a wonderful solution!
I think this looks like Credential-Stuffing attack
? Is there a chance that some other var's value happens equal to actual_tid
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I'd call it "wonderful" 😅
It is possible for another field to have the same value, and bad things might happen if we overwrite it. I've to implemented @kolyshkin's suggestion of trying some known field offsets first, and only doing a linear scan if the known offsets don't match (which should never happen in practice, unless glibc changes their structure layouts).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I'd call it "wonderful" 😅
I have tried CLONE_CHILD_SETTID | CLONE_CHILD_CLEARTID
, but I didn't think out using it to rewrite the tid value in pthread_t. 👍
4cd85ef
to
d9f1e23
Compare
@kolyshkin I went with writing dummy versions of the structure layouts so we can just use |
d90f74e
to
fcb0926
Compare
libcontainer/nsenter/nsexec.c
Outdated
static pid_t *find_tls_tid_address(void) | ||
{ | ||
/* | ||
* Since glibc 2.25 (see c579f48edba88380635ab98cb612030e3ed8691e), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
libcontainer/nsenter/nsexec.c
Outdated
#if __GLIBC__ == 2 && __GLIBC_MINOR__ < 14 | ||
# define _GNU_SOURCE | ||
# include "syscall.h" | ||
#if GLIBC_VERSION < 2014 /* < 2.14 */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just keep #if __GLIBC__ == 2 && __GLIBC_MINOR__ < 14
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the latest version I'm about to push, I switched them to using __GLIBC_PREREQ
which is even simpler.
While nobody is going to be using glibc 1.x and glibc has said there will never be a 3.x, technically doing the version check this way is not correct.
On musl, it seems they:
So for musl we would need to do the linear scan and then set the address manually using |
7027163
to
69ffa06
Compare
I did some bare bone testing using the following Dockerfile (had to use alpine edge as the latest tagged version still uses go 1.21):
With the above Dockerfile, I built runc and ran some tests ( For now, I think, it does not make sense to fix what's not broken. |
Adapted from: opencontainers/runc#4247 Execution of a container using a PID namespace can fail on certain versions of glibc when Singularity is built with Go 1.22. This is due to Go 1.22 performing calls using pthread_self which, from glibc 2.25, is not updated for the current TID on clone. Fixes sylabs#2677 ----- Original runc explanation: Since glibc 2.25, the thread-local cache of the current TID is no longer updated in the child when calling clone(2). This results in very unfortunate behaviour when Go does pthread calls using pthread_self(), which has the wrong TID stored. The "simple" solution is to forcefully overwrite this cached value. Unfortunately (and unsurprisingly), the layout of "struct pthread" is strictly private and can change without warning. Luckily, glibc (currently) uses CLONE_CHILD_CLEARTID for all forks (with the child_tid set to the cached &PTHREAD_SELF->tid), meaning that as long as runc is using glibc, when "runc init" is spawned the child process will have a pointer directly to the cached value we want to change. With CONFIG_CHECKPOINT_RESTORE=y kernels on Linux 3.5 and later, we can simply use prctl(PR_GET_TID_ADDRESS). For older kernels we need to memory scan the TLS structure (pthread_self() returns a pointer to the start of the structure so we can "just" scan it for a field containing the current TID and assume that it is the correct field). Obviously this is all very horrific, and if you are reading this in the future, it almost certainly has caused some horrific bug that I did not forsee. Sorry about that. As far as I can tell, there is no other workable solution that doesn't also depend on the CLONE_CHILD_CLEARTID behaviour of glibc in some way. We cannot "just" do a re-exec after clone(2) for security reasons. Fixes opencontainers/runc#4233 Signed-off-by: Aleksa Sarai [email protected]
Adapted from: opencontainers/runc#4247 Execution of a container using a PID namespace can fail on certain versions of glibc when Singularity is built with Go 1.22. This is due to Go 1.22 performing calls using pthread_self which, from glibc 2.25, is not updated for the current TID on clone. Fixes sylabs#2677 ----- Original runc explanation: Since glibc 2.25, the thread-local cache of the current TID is no longer updated in the child when calling clone(2). This results in very unfortunate behaviour when Go does pthread calls using pthread_self(), which has the wrong TID stored. The "simple" solution is to forcefully overwrite this cached value. Unfortunately (and unsurprisingly), the layout of "struct pthread" is strictly private and can change without warning. Luckily, glibc (currently) uses CLONE_CHILD_CLEARTID for all forks (with the child_tid set to the cached &PTHREAD_SELF->tid), meaning that as long as runc is using glibc, when "runc init" is spawned the child process will have a pointer directly to the cached value we want to change. With CONFIG_CHECKPOINT_RESTORE=y kernels on Linux 3.5 and later, we can simply use prctl(PR_GET_TID_ADDRESS). For older kernels we need to memory scan the TLS structure (pthread_self() returns a pointer to the start of the structure so we can "just" scan it for a field containing the current TID and assume that it is the correct field). Obviously this is all very horrific, and if you are reading this in the future, it almost certainly has caused some horrific bug that I did not forsee. Sorry about that. As far as I can tell, there is no other workable solution that doesn't also depend on the CLONE_CHILD_CLEARTID behaviour of glibc in some way. We cannot "just" do a re-exec after clone(2) for security reasons. Fixes opencontainers/runc#4233 Signed-off-by: Aleksa Sarai [email protected]
Adapted from: opencontainers/runc#4247 Execution of a container using a PID namespace can fail on certain versions of glibc when Singularity is built with Go 1.22. This is due to Go 1.22 performing calls using pthread_self which, from glibc 2.25, is not updated for the current TID on clone. Fixes sylabs#2677 ----- Original runc explanation: Since glibc 2.25, the thread-local cache of the current TID is no longer updated in the child when calling clone(2). This results in very unfortunate behaviour when Go does pthread calls using pthread_self(), which has the wrong TID stored. The "simple" solution is to forcefully overwrite this cached value. Unfortunately (and unsurprisingly), the layout of "struct pthread" is strictly private and can change without warning. Luckily, glibc (currently) uses CLONE_CHILD_CLEARTID for all forks (with the child_tid set to the cached &PTHREAD_SELF->tid), meaning that as long as runc is using glibc, when "runc init" is spawned the child process will have a pointer directly to the cached value we want to change. With CONFIG_CHECKPOINT_RESTORE=y kernels on Linux 3.5 and later, we can simply use prctl(PR_GET_TID_ADDRESS). For older kernels we need to memory scan the TLS structure (pthread_self() returns a pointer to the start of the structure so we can "just" scan it for a field containing the current TID and assume that it is the correct field). Obviously this is all very horrific, and if you are reading this in the future, it almost certainly has caused some horrific bug that I did not forsee. Sorry about that. As far as I can tell, there is no other workable solution that doesn't also depend on the CLONE_CHILD_CLEARTID behaviour of glibc in some way. We cannot "just" do a re-exec after clone(2) for security reasons. Fixes opencontainers/runc#4233 Signed-off-by: Aleksa Sarai [email protected]
Adapted from: opencontainers/runc#4247 Execution of a container using a PID namespace can fail on certain versions of glibc when Singularity is built with Go 1.22. This is due to Go 1.22 performing calls using pthread_self which, from glibc 2.25, is not updated for the current TID on clone. Fixes sylabs#2677 ----- Original runc explanation: Since glibc 2.25, the thread-local cache of the current TID is no longer updated in the child when calling clone(2). This results in very unfortunate behaviour when Go does pthread calls using pthread_self(), which has the wrong TID stored. The "simple" solution is to forcefully overwrite this cached value. Unfortunately (and unsurprisingly), the layout of "struct pthread" is strictly private and can change without warning. Luckily, glibc (currently) uses CLONE_CHILD_CLEARTID for all forks (with the child_tid set to the cached &PTHREAD_SELF->tid), meaning that as long as runc is using glibc, when "runc init" is spawned the child process will have a pointer directly to the cached value we want to change. With CONFIG_CHECKPOINT_RESTORE=y kernels on Linux 3.5 and later, we can simply use prctl(PR_GET_TID_ADDRESS). For older kernels we need to memory scan the TLS structure (pthread_self() returns a pointer to the start of the structure so we can "just" scan it for a field containing the current TID and assume that it is the correct field). Obviously this is all very horrific, and if you are reading this in the future, it almost certainly has caused some horrific bug that I did not forsee. Sorry about that. As far as I can tell, there is no other workable solution that doesn't also depend on the CLONE_CHILD_CLEARTID behaviour of glibc in some way. We cannot "just" do a re-exec after clone(2) for security reasons. Fixes opencontainers/runc#4233 Signed-off-by: Aleksa Sarai [email protected]
@kolyshkin It probably works because Go 1.22 still ignores the error from I would also prefer to not merge this code if it isn't necessary, but I'm worried there is the possibility of some part of Go's runtime hitting an issue on production machines. Then again, as far as I can tell from looking into it, in practice things seem to just happen to work out.
As far as I can tell, both uses just happen to work:
Now, it is entirely possible that some other aspect of Go could have issues with our |
libcontainer/nsenter/nsexec.c
Outdated
/* #if TLS_DTV_AT_TP */ | ||
struct pthread__dtv_at_tp { | ||
union { | ||
struct { | ||
int multiple_threads, gscope_flag; | ||
} header; | ||
void *__padding[24]; | ||
}; | ||
struct { | ||
void *prev, *next; | ||
} list; | ||
pid_t tid; /* the field we are looking for! */ | ||
}; | ||
|
||
/* #if !TLS_DTV_AT_TP */ | ||
struct pthread__tcbhead { | ||
union { | ||
struct tcbhead_t header; | ||
void *__padding[24]; | ||
}; | ||
struct { | ||
void *prev, *next; | ||
} list; | ||
pid_t tid; /* the field we are looking for! */ | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like TLS_DTV_AT_TP / TLS_TCB_AT_TP is arch-specific:
- i386, x86_64, s390 and sparc have TLS_TCB_AT_TP
- all other architectures have TLS_DTV_AT_TP
- this has never changed as far as I can see from the git history
We can probably use this and only try one offset. Or do you think that it may change in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, do something like this:
// The following is a summary from glibc <sysdeps/.../ntpl/tls.h>.
#if defined(__x86_64__) || defined(__i386__) || defined(__s390__) || defined(__s390x__) || defined(__sparc__)
#define TLS_TCB_AT_TP 1
#define TLS_DTV_AT_TP 0
#else
#define TLS_DTV_AT_TP 1
#define TLS_TCB_AT_TP 0
#endif
#ifdef TLS_TCB_AT_TP
// XXX define struct tcbhead_t as you did above.
// XXX No need to include powerpc version as it's not used.
#endif
// This is adopted from glibc <nptl/descr.h>.
struct _glibc_pthread {
union {
#if TLS_TCB_AT_TP // !TLS_DTV_AT_TP
tcbhead_t header;
#else
struct {
int multiple_threads;
int gscope_flag;
} header;
#endif
void *__padding[24];
};
struct {
void *prev, *next;
} list;
pid_t tid; /* the field we are looking for! */
}
libcontainer/nsenter/nsexec.c
Outdated
#endif | ||
}; | ||
|
||
/* #if TLS_DTV_AT_TP */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a note this is taken from glibc's nptl/descr.h
.
The only question remains, do we need this kludge for all Go versions of only for Go 1.22+. I tentatively vote for the latter, based on "don't fix if not broken" principle. Can do this adding a go file like this: //go:build go1.22
package nsenter
// #cgo CFLAGS: -DRUNC_GLIBC_TID_KLUDGE=1
import "C" |
@cyphar shall we move this out of draft? |
69ffa06
to
b0654c7
Compare
else | ||
write_log(DEBUG, "clone: find_glibc_tls_tid_address: PR_GET_TID_ADDRESS failed: %m"); | ||
write_log(WARNING, | ||
"clone: find_glibc_tls_tid_address: falling back to scanning pthread_self(). Please use a kernel with CONFIG_CHECKPOINT_RESTORE=y."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we need to add a ci test for CONFIG_CHECKPOINT_RESTORE!=y
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hard to implement as for that we need a kernel with CONFIG_CHECKPOINT_RESTORE not set
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand. Maybe we can use an env to skip prctl(PR_GET_TID_ADDRESS)
when testing, for example:
950ff28
* cover every architecture without a huge risk of SIGSEGV. | ||
*/ | ||
int i; | ||
for (i = 0; i < 1024; i += sizeof(pid_t)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should remove this fail back? Because this may get wrong offset.
Since glibc 2.25, the thread-local cache of the current TID is no longer updated in the child when calling clone(2). This results in very unfortunate behaviour when Go does pthread calls using pthread_self(), which has the wrong TID stored. The "simple" solution is to forcefully overwrite this cached value. Unfortunately (and unsurprisingly), the layout of "struct pthread" is strictly private and could change without warning. Luckily, glibc (currently) uses CLONE_CHILD_CLEARTID for all forks (with the child_tid set to the cached &PTHREAD_SELF->tid), meaning that as long as runc is using glibc, when "runc init" is spawned the child process will have a pointer directly to the cached value we want to change. With CONFIG_CHECKPOINT_RESTORE=y kernels on Linux 3.5 and later, we can simply use prctl(PR_GET_TID_ADDRESS). For older kernels we need to memory scan the TLS structure (pthread_self() is a pointer to the head of the TLS structure). However, to avoid false positives we first try known-correct offsets based on the current structure layouts. If that fails, we scan the 1K block for any fields that might match. When doing the scan, we assume that the first field we find that contains the actual TID of the current process is the field we want. Obviously this is all very horrific, and if you are reading this in the future, it almost certainly has caused some horrific bug that I did not forsee. Sorry about that. As far as I can tell, there is no other workable solution that doesn't also depend on the CLONE_CHILD_CLEARTID behaviour of glibc in some way. We cannot "just" do a re-exec after clone(2) for security reasons. Sadly, this is all glibc-specific. musl doesn't even allow you to use CLONE_CHILD_CLEARTID (and they use a different address for the TID anyway). We could do the memory scan and manually overwrite the address after clone(2), but we can deal with that in the future if it turns out people use non-glibc builds and need this fix. Signed-off-by: Aleksa Sarai <[email protected]>
b0654c7
to
e10daeb
Compare
Is this still WIP? |
Let's go ahead with #4292 for now. |
An alternative to #4193.
I will also take a look in a separate PR as to whether we can remove a stage from our three-stage structure (such as moving the mapping requests to the Go parent, maybe?) to remove the need to
PR_SET_CHILD_SUBREAPER
in #4193. If we do it that way then switching tofork()
would be far less of an issue.Since glibc 2.25, the thread-local cache of the current TID is no longer updated in the child when calling clone(2). This results in very unfortunate behaviour when Go does pthread calls using pthread_self(), which has the wrong TID stored.
The "simple" solution is to forcefully overwrite this cached value. Unfortunately (and unsurprisingly), the layout of "struct pthread" is strictly private and can change without warning.
Luckily, glibc (currently) uses CLONE_CHILD_CLEARTID for all forks (with the child_tid set to the cached &PTHREAD_SELF->tid), meaning that as long as runc is using glibc, when "runc init" is spawned the child process will have a pointer directly to the cached value we want to change. With CONFIG_CHECKPOINT_RESTORE=y kernels on Linux 3.5 and later, we can simply use prctl(PR_GET_TID_ADDRESS). For older kernels we need to memory scan the TLS structure (pthread_self() returns a pointer to the start of the structure so we can "just" scan it for a field containing the current TID and assume that it is the correct field).
Obviously this is all very horrific, and if you are reading this in the future, it almost certainly has caused some horrific bug that I did not forsee. Sorry about that. As far as I can tell, there is no other workable solution that doesn't also depend on the CLONE_CHILD_CLEARTID behaviour of glibc in some way. We cannot "just" do a re-exec after clone(2) for security reasons.
Fixes #4233
Signed-off-by: Aleksa Sarai [email protected]