Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use go 1.22 #4278

Closed
wants to merge 4 commits into from
Closed

Conversation

lifubang
Copy link
Member

@lifubang lifubang commented May 12, 2024

An alternative to #4193.
Fixes #4233

As mentioned in golang/go#65625 (comment), the commit golang/go@52e17c2 (CL 519457) uses x_cgo_getstackbound to set g->stacklo based on the stack size. It will cause some backward compatibility issues with older versions of the glibc because of the tid cache update issue.

I think golang runtime should provide this backward compatibility(I think we should submit a PR to golang), but unfortunately they didn't implement it. Before golang support this feature, I think we can wrap x_cgo_getstackbound function to ensure we can have this backward compatibility.

lifubang added 3 commits May 12, 2024 14:57
Signed-off-by: lifubang <[email protected]>
Signed-off-by: lfbzhm <[email protected]>
This reverts commit ac31da6.

Signed-off-by: lifubang <[email protected]>
Signed-off-by: lfbzhm <[email protected]>
This reverts commit e377e16.

Signed-off-by: lifubang <[email protected]>
Signed-off-by: lfbzhm <[email protected]>
@AkihiroSuda AkihiroSuda added the backport/1.1-todo A PR in main branch which needs to be backported to release-1.1 label May 12, 2024
@lifubang lifubang force-pushed the fix-go-1.22-tid-cache branch 2 times, most recently from 2339a19 to 19cd152 Compare May 12, 2024 12:29
// Since Go 1.21 <https://github.com/golang/go/commit/c426c87012b5e>, the Go
// runtime will try to call pthread_getattr_np(pthread_self()). This causes
// issues with nsexec and requires backward compatibility with old versions
// of the glibc.

This comment was marked as off-topic.

@kolyshkin
Copy link
Contributor

As pointed out by @cyphar https://github.com/golang/go/issues/65625#issuecomment-2036320298:

Ultimately I think the core issue is that we are violating signal-safety(7) in runc, as outlined in above comments.

Go maintainers (@cherrymui @prattmic) concur:

the ultimate bug is the use of clone(2) in runc isn't safe, which should be addressed, instead of being worked around. In theory a stale tid cache can lead to any weird behavior, not just the crashing.

I agree. I think attempting to continue in this state is asking for trouble. In golang/go#65625 (comment), I noted my opinion that we should abort the process entirely on these errors. IMO, that still seems like the safest approach.

If @cyphar still thinks the same way, maybe we should try switching to fork?

@lifubang
Copy link
Member Author

lifubang commented May 14, 2024

I’ll reply in go PR thread later after I finished read all cgo code. Before that I want to say that tid dirty cache issue is only influence someone who writing multi thread programs(with clone(2)), but both golang and runc are not, golang only use pthread to get stack address and size, we have another way to do this, not just only using pthread.

BTW, I also don’t know what’s the purpose of this commit, it seems that there’s no detail comment in the commit message: golang/go@52e17c2

@lifubang
Copy link
Member Author

maybe we should try switching to fork?

I think in this time, if we switch to fork, we should remove runc run --no-subreaper. Maybe it will introduce some backward compatibility issues? Please see #4193.

So I strongly recommend this PR(#4278), it's easy to review. Though we don't solve any signal-safety(7) issues, but we can let runc work with go 1.22, like the situation before.
Then we can have a time to find out other solution, for example:

  1. overwrite glibc's internal tid cache on clone(); (nsenter: overwrite glibc's internal tid cache on clone() #4247)
  2. switch to fork in nsexec;
  3. move c code to go code.

@kolyshkin
Copy link
Contributor

if we switch to fork, we should remove runc run --no-subreaper

This might not be a bad thing. There is some runc functionality that is plain wrong (I'm not saying --no-subreaper is wrong, but it very well may be, need to take a closer look).

So I strongly recommend this PR(#4278), it's easy to review.

I totally agree we have to have something sooner than later.

@lifubang lifubang force-pushed the fix-go-1.22-tid-cache branch 7 times, most recently from ea2580a to 7c6452d Compare May 20, 2024 17:41
}

pthread_attr_getstack(&attr, &addr, &size); // low address
#elif defined(__illumos__)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said earlier, it makes sense to remove solaris/illumos support, for clarity.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if the change is small, I want to keep it as is, because we just have only added/changed about 5 lines code.
Then people will read the code more clearly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense

// pthread_getattr_np is a GNU extension supported in glibc.
// Solaris is not glibc but does support pthread_getattr_np
// (and the fallback doesn't work...). Illumos does not.
err = pthread_getattr_np(pthread_self(), &attr); // GNU extension
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

attr is not intialized?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, before glibc 2.32, attr are not initialized with all the field filled with zero!
After glibc 2.31, it is initialized in pthread_getattr_np.
https://github.com/bminor/glibc/blob/glibc-2.32/nptl/pthread_getattr_np.c#L38

I have no time to find out which commit include this change.

@lifubang lifubang force-pushed the fix-go-1.22-tid-cache branch from 7c6452d to 45d37ac Compare May 20, 2024 23:41
@kolyshkin
Copy link
Contributor

For the sake of review, here's the diff against the upstream source:

--- ../golang/go/src/runtime/cgo/gcc_stack_unix.c	2024-05-21 13:14:57.977264422 -0700
+++ libcontainer/nsenter/nsenter_go122.go	2024-05-21 13:09:38.552902378 -0700
@@ -1,28 +1,47 @@
-// Copyright 2023 The Go Authors. All rights reserved.
-// Use of this source code is governed by a BSD-style
-// license that can be found in the LICENSE file.
+//go:build linux && !gccgo && go1.22
+// +build linux,!gccgo,go1.22
 
-//go:build unix && !darwin
+package nsenter
 
-#ifndef _GNU_SOURCE // pthread_getattr_np
+/*
+#cgo LDFLAGS: -Wl,--wrap=x_cgo_getstackbound
 #define _GNU_SOURCE
-#endif
-
 #include <pthread.h>
-#include "libcgo.h"
+#include <stdint.h>
 
-void
-x_cgo_getstackbound(uintptr bounds[2])
+typedef uintptr_t uintptr;
+
+// Since Go 1.22 <https://github.com/golang/go/commit/52e17c2>, the
+// go runtime will try to call pthread_getattr_np(pthread_self()) in cgo.
+// This will cause issues in nsexec with glibc < 2.32. so it requires
+// us to provide a backward compatibility with old versions of glibc.
+// The core issue of glibc is that, before 2.32, `pthread_getattr_np`
+// did not call `__pthread_attr_init (attr)`, we need to init the attr
+// in go runtime. Fortunately, cgo exports a function named `x_cgo_getstackbound`,
+// we can wrap it in the compile phrase. Please see:
+// https://github.com/golang/go/blob/52e17c2/src/runtime/cgo/gcc_stack_unix.c
+// Fix me: this hack looks ugly, once we have removed `clone(2)` in nsexec,
+// please remember to remove this hack.
+void __wrap_x_cgo_getstackbound(uintptr bounds[2])
 {
 	pthread_attr_t attr;
 	void *addr;
 	size_t size;
+	int err;
 
 #if defined(__GLIBC__) || (defined(__sun) && !defined(__illumos__))
 	// pthread_getattr_np is a GNU extension supported in glibc.
 	// Solaris is not glibc but does support pthread_getattr_np
 	// (and the fallback doesn't work...). Illumos does not.
-	pthread_getattr_np(pthread_self(), &attr);  // GNU extension
+	err = pthread_getattr_np(pthread_self(), &attr);  // GNU extension
+	// The main change begin
+	if (err != 0) {
+		// This is to have a backward compatibility with glibc < 2.32
+		pthread_attr_init(&attr);
+		pthread_getattr_np(pthread_self(), &attr);
+	}
+	// The main change end
+
 	pthread_attr_getstack(&attr, &addr, &size); // low address
 #elif defined(__illumos__)
 	pthread_attr_init(&attr);
@@ -37,10 +56,8 @@
 #endif
 	pthread_attr_destroy(&attr);
 
-	// bounds points into the Go stack. TSAN can't see the synchronization
-	// in Go around stack reuse.
-	_cgo_tsan_acquire();
 	bounds[0] = (uintptr)addr;
 	bounds[1] = (uintptr)addr + size;
-	_cgo_tsan_release();
 }
+*/
+import "C"

Because for some old versions glibc, there is a tid cache update
issue, it will cause cgo with clone(2) in go1.21 and above error
when calling `pthread_getattr_np`, so we need to add a backward
compatibility with old versions glibc.

Signed-off-by: lifubang <[email protected]>
@lifubang lifubang force-pushed the fix-go-1.22-tid-cache branch from 45d37ac to 741b045 Compare May 23, 2024 15:14
@kolyshkin
Copy link
Contributor

Here's the alternative: https://go-review.googlesource.com/c/go/+/587919

@lifubang
Copy link
Member Author

Here's the alternative: https://go-review.googlesource.com/c/go/+/587919

I think only adding init is not enough:

I suggest to handle this error, or else other contributors will introduce this error again when working around this. https://go.dev/cl/563379

At least I think we should add a comment about this bug from glibc to let other people know we can’t just only check err!=0 when adding error assert.

@kolyshkin
Copy link
Contributor

@lifubang should we close this one in favor or #4292 ?

@lifubang lifubang closed this Jun 4, 2024
@kolyshkin kolyshkin removed the backport/1.1-todo A PR in main branch which needs to be backported to release-1.1 label Jun 11, 2024
@lifubang lifubang deleted the fix-go-1.22-tid-cache branch October 15, 2024 05:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

runc doesn't work with go1.22
3 participants