Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add optional startup --linux_bazel_path_from_getauxval #14391

Closed

Conversation

emidln-imc
Copy link

When using qemu-user-static + binfmt_misc on Linux (e.g.
running docker run --platform linux/amd64 on ARM), bazel
fails to self-extract with a mysterious lseek failure. When
self-extracting using "/proc/self/exe", the referred binary
is the qemu-user-static emulator, not the bazel process. Instead,
we use an alternative API, getauxval(3), which is properly
populated when running normally on the native host platform
as well as when using the qemu + binfmt_misc pattern.

Practically, this allows x86_64 versions of bazel to
self-extract and run under Docker hosted by Linux ARM or M1 Macs.

jreid pushed a commit to jreid/bazel-on-arm that referenced this pull request Dec 30, 2021
jrajahalme added a commit to cilium/proxy that referenced this pull request Jan 11, 2022
Add support for multiarch docker builds for tests. Test execution still fails for non-native targets
due to an Bazel bug that is fixed in bazelbuild/bazel#14391. No Bazel
releases contain this fix for now, so will have to wait a bit to make use of non-native test
execution.

Signed-off-by: Jarno Rajahalme <[email protected]>
jrajahalme added a commit to cilium/proxy that referenced this pull request Jan 11, 2022
Add support for multiarch docker builds for tests. Test execution still fails for non-native targets
due to an Bazel bug that is fixed in bazelbuild/bazel#14391. No Bazel
releases contain this fix for now, so will have to wait a bit to make use of non-native test
execution.

Signed-off-by: Jarno Rajahalme <[email protected]>
jrajahalme added a commit to cilium/proxy that referenced this pull request Jan 11, 2022
Add support for multiarch docker builds for tests. Test execution still fails for non-native targets
due to an Bazel bug that is fixed in bazelbuild/bazel#14391. No Bazel
releases contain this fix for now, so will have to wait a bit to make use of non-native test
execution.

Signed-off-by: Jarno Rajahalme <[email protected]>
jrajahalme added a commit to cilium/proxy that referenced this pull request Jan 11, 2022
Add support for multiarch docker builds for tests. Test execution still fails for non-native targets
due to an Bazel bug that is fixed in bazelbuild/bazel#14391. No Bazel
releases contain this fix for now, so will have to wait a bit to make use of non-native test
execution.

Signed-off-by: Jarno Rajahalme <[email protected]>
@ianbrex
Copy link

ianbrex commented Jan 12, 2022

Wow!
Could you please share that binary for linux-x86_64?

jrajahalme added a commit to cilium/proxy that referenced this pull request Jan 12, 2022
Add support for multiarch docker builds for tests. Test execution still fails for non-native targets
due to an Bazel bug that is fixed in bazelbuild/bazel#14391. No Bazel
releases contain this fix for now, so will have to wait a bit to make use of non-native test
execution.

Signed-off-by: Jarno Rajahalme <[email protected]>
@johnnybigert
Copy link

I'm running into the FATAL: Failed to open '/proc/self/exe' as a zip file: (error: 9): Bad file descriptor on my Mac, can we please have this merged? @emidln-imc

jrajahalme added a commit to cilium/proxy that referenced this pull request Jan 12, 2022
Add support for multiarch docker builds for tests. Test execution
still fails for non-native targets due to an Bazel bug that is fixed
in bazelbuild/bazel#14391. No Bazel releases
contain this fix for now, so will have to wait a bit to make use of
non-native test execution.

Signed-off-by: Jarno Rajahalme <[email protected]>
jrajahalme added a commit to cilium/proxy that referenced this pull request Jan 18, 2022
Add support for multiarch docker builds for tests. Test execution
still fails for non-native targets due to an Bazel bug that is fixed
in bazelbuild/bazel#14391. No Bazel releases
contain this fix for now, so will have to wait a bit to make use of
non-native test execution.

Signed-off-by: Jarno Rajahalme <[email protected]>
@aiuto aiuto requested a review from haxorz January 25, 2022 05:32
@@ -86,7 +87,7 @@ string GetSelfPath(const char* argv0) {
// The file to which this symlink points could change contents or go missing
// concurrent with execution of the Bazel client, so we don't eagerly resolve
// it.
return "/proc/self/exe";
return std::string((char *)getauxval(AT_EXECFN));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

supernit: getauxval can technically return nullptr (maybe check that). Also consider a more C++ reinterpret_cast<const char*>.

Sorry for the silly question, but you mentioned running on M1 Mac in the description -- is the change in blaze_util_linux.cc sufficient for that? Or is that the case of docker running a Linux image on M1?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the main question here is about potentially relative vs absolute paths (also why /proc/self/exe is not Bazel in this case). This is the path used to start the process so it can be the absolute path of the process but it also could be a symlink/relative path. The danger here is that if we end up changing the working directory at any point, GetSelfPath may not point us effectively to usable self anymore. Granted, I don't think we do that nor that we need that path a lot of times, but it is something to keep in mind.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the silly question, but you mentioned running on M1 Mac in the description -- is the change in blaze_util_linux.cc sufficient for that? Or is that the case of docker running a Linux image on M1?

This is running a Linux/x86_64 container on Linux/arm. This is how Docker for Mac works under the hood. It runs a Linux VM running arm Linux and then installs runs the container in that with a binfmt_misc handler set up to run x86_64 binaries through qemu-user-static. This is useful because you usually want to run the same container and environment on your dev host as you do in your CI environment, rather than rebuilding everything for an extra platform. Technically, this isn't limited to running x86 on arm, it would also allow you to run an x86_64 bazel on say, MIPS, or an arm bazel on x86_64 using binfmt_misc.

The underlying issue is that when you're using binfmt_misc, /proc/self/exe is qemu-user-static and not the bazel binary. getauxval(3) is a different api that provides similar information, but it's user-space memory that qemu-user-static can properly set up prior to handing off control to the binary (bazel in this case). arg[0] is another API that can be used which qemu-user-static will patch, but it's liable to contain links or path traversal such that resolving it is more work for us (../../bazel, etc).

@haxorz
Copy link
Contributor

haxorz commented Jan 26, 2022

@emidln-imc Sorry, I've never heard of qemu-user-static before. Am I correct that you would still be asking for this PR even if commit d79ec03 had never happened, because target of symlink /proc/self/exe is not that of the Bazel binary when it's being run on qemu-user-static?

@emidln-imc
Copy link
Author

@haxorz correct, we'd have to do it with or without that patch because /proc/self/exe simply isn't the binary being executed by the kernel.

A deeper explanation (with links to code) follows. Apologies to any glibc, kernel, or qemu hackers reading this. linux_binprm is the salient struct from the kernel for binary loading: https://github.com/torvalds/linux/blob/master/include/linux/binfmts.h#L17

qemu-user-static as shipped by https://github.com/multiarch/qemu-user-static is how Docker supports multiarch binaries. Docker For Mac uses a linux arm vm under the hood, and then layers in qemu-user-static handlers for non-native containers (like those containing x86_64 linux elf binaries). This is also how Docker for Linux/arm supports x86_64 containers on things like AWS Graviton. When qemu-user-static is the loader for binfmt_misc, qemu-user-static is actually the binary in /proc/self/exe. qemu-user-static is responsible for reading the elf binary you're requesting to run, set up some appropriate environment info (argc, argv, envp, auxv, etc), doing some light configuration of the qemu virtual machine, and then starting the VM running your binary. The actual source code for how this is loaded is https://github.com/qemu/qemu/blob/master/linux-user/main.c with most of the interesting bits in https://github.com/qemu/qemu/blob/master/linux-user/elfload.c and https://github.com/qemu/qemu/blob/master/linux-user/linuxload.c#L127). This is happening for binfmt_misc with qemu-user-static registered. binfmt_misc is a loader that allows the user to register interpreter and is primarily implemented by https://github.com/torvalds/linux/blob/master/fs/binfmt_misc.c#L132 and surrounding code.

The usual way, a native x86_64 linux elf binary on x86_64 linux is set up using https://github.com/torvalds/linux/blob/master/fs/binfmt_elf.c#L172 (which is accessed by getauxval(3) from glibc).

All of this comes home if you examine the implementation of /proc/self/exe, which is implemented by https://github.com/torvalds/linux/blob/master/fs/proc/base.c#L1723. Roughly speaking, it looks up the current kernel task (https://github.com/torvalds/linux/blob/master/kernel/fork.c#L1273), finds the backing mm and then the executable which it can deref to get the file path (https://github.com/torvalds/linux/blob/master/fs/proc/base.c#L1735), which is then increments the refcount in the vfs for the file (conceptually a hard link). This is what allows us to use /proc/self/exe even if the underlying file has been moved or disappeared. While this is desirable, it's also broken if you're using a loader like binfmt_misc since /proc/self/exe is the wrong thing.

If we want to keep the original behavior, it seems reasonable to do one of two things: (1) provide a flag or environment variable that can use getauxval(AT_EXECFN) for users who need it or (2) change the way we call into ijar such that we can catch the failure (either through exceptions/errors and fallback to getauxval(AT_EXECFN) if we can't load via /proc/self/exe. My approach was chosen to minimize the diff. Alternate solutions that might work but are hacks would be to try to recognize when /proc/self/exe is qemu and taking the correct path. I didn't like this, and I couldn't find a consistent way to detect running under binfmt_misc, since the entries in proc, to my knowledge, only convey that binfmt_misc is possible, not that the current process is running using it. I don't think the current loader is exposed to the user, but maybe a more experienced kernel dev would know.

@haxorz
Copy link
Contributor

haxorz commented Jan 26, 2022

@emidln-imc Thank you very much for the explanation!

The true background of d79ec03 is that we were working around a Kernel bug (possibly just in Google's internal Kernel; idk) with FUSE paths. Sorry for not giving that background in the commit description or code comment!

I don't want to undo d79ec03. I do acknowledge our situation is a bit weird, so as a compromise I'd be willing to make that change internal to Google's fork of Bazel, and then letting you proceed with this PR as-is. On the other hand, perhaps some Bazel users actually benefit from that change (perhaps they are putting the Bazel binary in a FUSE filesystem).

Similarly, I think the qemu-user-static situation is weird.

Therefore I think the best solution is one where we don't prioritize one weird situation over the other. On that note...

If we want to keep the original behavior, it seems reasonable to... provide a flag or environment variable that can use getauxval(AT_EXECFN) for users who need it

Yes, that's acceptable.

Alternate solutions that might work but are hacks would be to try to recognize when /proc/self/exe is qemu and taking the correct path. I didn't like this, and I couldn't find a consistent way to detect...

I had a similar idea that I think might be achievable: What if the code looked at both the target of /proc/self/exe and also the result of getauxval(AT_EXECFN) (perhaps doing work to produce a fully resolved absolute path in the case when Blaze was invoked using a relative path, causing getauxval(AT_EXECFN) to return a ~relative path). If these are the same, return "/proc/self/exe", otherwise return whatever getauxval(AT_EXECFN) returned.

The extra system calls entailed by this approach would be acceptable to me. And then we don't have to worry about adding an env var / flag knob.

What do you think?

@alexjski
Copy link
Contributor

alexjski commented Jan 27, 2022

@haxorz correct, we'd have to do it with or without that patch because /proc/self/exe simply isn't the binary being executed by the kernel.

A deeper explanation (with links to code) follows. Apologies to any glibc, kernel, or qemu hackers reading this. linux_binprm is the salient struct from the kernel for binary loading: https://github.com/torvalds/linux/blob/master/include/linux/binfmts.h#L17

qemu-user-static as shipped by https://github.com/multiarch/qemu-user-static is how Docker supports multiarch binaries. Docker For Mac uses a linux arm vm under the hood, and then layers in qemu-user-static handlers for non-native containers (like those containing x86_64 linux elf binaries). This is also how Docker for Linux/arm supports x86_64 containers on things like AWS Graviton. When qemu-user-static is the loader for binfmt_misc, qemu-user-static is actually the binary in /proc/self/exe. qemu-user-static is responsible for reading the elf binary you're requesting to run, set up some appropriate environment info (argc, argv, envp, auxv, etc), doing some light configuration of the qemu virtual machine, and then starting the VM running your binary. The actual source code for how this is loaded is https://github.com/qemu/qemu/blob/master/linux-user/main.c with most of the interesting bits in https://github.com/qemu/qemu/blob/master/linux-user/elfload.c and https://github.com/qemu/qemu/blob/master/linux-user/linuxload.c#L127). This is happening for binfmt_misc with qemu-user-static registered. binfmt_misc is a loader that allows the user to register interpreter and is primarily implemented by https://github.com/torvalds/linux/blob/master/fs/binfmt_misc.c#L132 and surrounding code.

The usual way, a native x86_64 linux elf binary on x86_64 linux is set up using https://github.com/torvalds/linux/blob/master/fs/binfmt_elf.c#L172 (which is accessed by getauxval(3) from glibc).

All of this comes home if you examine the implementation of /proc/self/exe, which is implemented by https://github.com/torvalds/linux/blob/master/fs/proc/base.c#L1723. Roughly speaking, it looks up the current kernel task (https://github.com/torvalds/linux/blob/master/kernel/fork.c#L1273), finds the backing mm and then the executable which it can deref to get the file path (https://github.com/torvalds/linux/blob/master/fs/proc/base.c#L1735), which is then increments the refcount in the vfs for the file (conceptually a hard link). This is what allows us to use /proc/self/exe even if the underlying file has been moved or disappeared. While this is desirable, it's also broken if you're using a loader like binfmt_misc since /proc/self/exe is the wrong thing.

If we want to keep the original behavior, it seems reasonable to do one of two things: (1) provide a flag or environment variable that can use getauxval(AT_EXECFN) for users who need it or (2) change the way we call into ijar such that we can catch the failure (either through exceptions/errors and fallback to getauxval(AT_EXECFN) if we can't load via /proc/self/exe. My approach was chosen to minimize the diff. Alternate solutions that might work but are hacks would be to try to recognize when /proc/self/exe is qemu and taking the correct path. I didn't like this, and I couldn't find a consistent way to detect running under binfmt_misc, since the entries in proc, to my knowledge, only convey that binfmt_misc is possible, not that the current process is running using it. I don't think the current loader is exposed to the user, but maybe a more experienced kernel dev would know.

Do you know if getauxval(AT_EXECFD) is 0 in case of running under qemu? I am asking since you mentioned "interpreter" -- not sure if in the same context (sorry for my ignorance). If that was set, then we could use /proc/self/fd/X if AT_EXECFD returns non-0 and /proc/self/exe otherwise.

@gregestren gregestren added the team-OSS Issues for the Bazel OSS team: installation, release processBazel packaging, website label Feb 3, 2022
@luyi326
Copy link

luyi326 commented Feb 5, 2022

Hello! Do we have a timeline on when this can be merged or not? This currently has been a pain point for me as well and it looks like it's the correct fix!

@alexjski
Copy link
Contributor

alexjski commented Feb 7, 2022

Hello! Do we have a timeline on when this can be merged or not? This currently has been a pain point for me as well and it looks like it's the correct fix!

There were a couple questions and comments around this PR:

  1. As per @haxorz, internally, we want to keep using /proc/self/exe -- we would need an approach which allows us to keep doing that.
  2. getauxval(AT_EXECFN) can produce relative paths which could theoretically be a problem -- we should either motivate why that is impossible in Bazel's case (and validate that) or make the solution resilient to that problem. That could potentially be cleaner if getauxval(AT_EXECFD) was set in case of running under qemu which would allow to construct an absolute path easily (/proc/self/fd/X).

Do you think you could address those?

@mattiahj
Copy link

This seems to be a blocker for me; possibly a lot of people on M1 Macs, perhaps someone from the core team could adopt it?

When using qemu-user-static + binfmt_misc on Linux (e.g.
running docker run --platform linux/amd64 on ARM), bazel
fails to self-extract with a mysterious lseek failure. When
self-extracting using "/proc/self/exe", the referred binary
is the qemu-user-static emulator, not the bazel process. Instead,
we use an alternative API, getauxval(3), which is properly
populated when running normally on the native host platform
as well as when using the qemu + binfmt_misc pattern.

Practically, this allows x86_64 versions of bazel to
self-extract and run under Docker hosted by Linux ARM or M1 Macs.
@emidln-imc emidln-imc force-pushed the qemu_binfmt_misc_fix branch from 30ec5ce to e4fddd9 Compare March 16, 2022 14:46
@emidln-imc
Copy link
Author

emidln-imc commented Mar 16, 2022

Hello! Do we have a timeline on when this can be merged or not? This currently has been a pain point for me as well and it looks like it's the correct fix!

There were a couple questions and comments around this PR:

  1. As per @haxorz, internally, we want to keep using /proc/self/exe -- we would need an approach which allows us to keep doing that.

I added this functionality gated behind a new optional startup flag. The original functionality is still the default, but there is now a way to hint to get bazel to use getauxval. Please give me a better name for it, and please let me know if it needs "experimental" or something like that.

  1. getauxval(AT_EXECFN) can produce relative paths which could theoretically be a problem -- we should either motivate why that is impossible in Bazel's case (and validate that) or make the solution resilient to that problem. That could potentially be cleaner if getauxval(AT_EXECFD) was set in case of running under qemu which would allow to construct an absolute path easily (/proc/self/fd/X).

I'd didn't fully resolve this path. From my testing it seems like (int)getauxval(AT_EXECFD) gets the fd of the qemu not the bazel under binfmt_misc. I did notice that realpath("/proc/self/exe", resolved_path) seems to return the same path as realpath((char*)getauxval(AT_EXECFN), resolved_path). Is it acceptable to call realpath() on the branch that uses getauxval?

@emidln-imc
Copy link
Author

While figuring out the bootstrap ordering to get a flag into GetSelfPath() I had to slightly reorder the blaze main which results in the --version processing code to live behind the the initial options parsing. Is this a problem?

@emidln-imc emidln-imc changed the title Prefer getauxval(AT_EXECFN) for qemu-user-static Add optional startup --linux_bazel_path_from_getauxval Mar 16, 2022
@alexjski
Copy link
Contributor

I did notice that realpath("/proc/self/exe", resolved_path) seems to return the same path as realpath((char*)getauxval(AT_EXECFN), resolved_path).

Does that resolve to the same path when running under qemu or regular Bazel? If that's the case for qemu, maybe then we just need a flag whether to realpath or not. As a reminder, we do not want to use realpath by default for the internal reasons mentioned by @haxorz before.

Personally, I have nothing against using realpath under the flag.

@susinmotion
Copy link
Contributor

I'd like to test this change on M1, but I'm not usually a docker user. Could you post a step-by-step repro? Thank you!

@emidln-imc
Copy link
Author

I did notice that realpath("/proc/self/exe", resolved_path) seems to return the same path as realpath((char*)getauxval(AT_EXECFN), resolved_path).

Does that resolve to the same path when running under qemu or regular Bazel? If that's the case for qemu, maybe then we just need a flag whether to realpath or not. As a reminder, we do not want to use realpath by default for the internal reasons mentioned by @haxorz before.

After checking, realpath("/proc/self/exe") resolves to the absolute path of the bazel binary under qemu-user-static. This would save us a weird getauxval call. I can update the PR to just call realpath if the flag is set. What name should the flag be?

@@ -1623,6 +1616,12 @@ int Main(int argc, const char *const *argv, WorkspaceLayout *workspace_layout,
ParseOptionsOrDie(cwd, workspace, *option_processor, argc, argv);
StartupOptions *startup_options = option_processor->GetParsedStartupOptions();
startup_options->MaybeLogStartupOptionWarnings();
const string self_path = GetSelfPath(argv[0], *startup_options);

if (argc == 2 && strcmp(argv[1], "--version") == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately moving this check here breaks bazel --version, and also unfortunately the test for this only present internally. The complication is --version is a magic flag that isn't understood by the options parser. This is by design so that we traverse as little code (which could fail) as possible, including rc parsing.

I'm not sure what the best way to work around this would be... I think we'd either need to rethink how --version is implemented wrt the options parser or see if there's another way to toggle or fall back to the feature you're proposing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what the best way to work around this would be...

If we're running into issues like this I'd recommend abandoning trying to use a startup flag. Instead, either:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to dynamically detecting that if possible. One problem here potentially is that from what I was told, realpath("/proc/self/exe") points to Bazel, in which case I don't know how to distinguish that from realpath(getauxval(AT_EXECFN)). Does readlink offer a way to distinguish those 2?

@alexjski
Copy link
Contributor

I did notice that realpath("/proc/self/exe", resolved_path) seems to return the same path as realpath((char*)getauxval(AT_EXECFN), resolved_path).

Does that resolve to the same path when running under qemu or regular Bazel? If that's the case for qemu, maybe then we just need a flag whether to realpath or not. As a reminder, we do not want to use realpath by default for the internal reasons mentioned by @haxorz before.

After checking, realpath("/proc/self/exe") resolves to the absolute path of the bazel binary under qemu-user-static. This would save us a weird getauxval call. I can update the PR to just call realpath if the flag is set. What name should the flag be?

That's great! I would say something like --linux_resolve_self_path would work for me. I am not great at naming but this seems to express the gist of the new flag.

I did notice that realpath("/proc/self/exe", resolved_path) seems to return the same path as realpath((char*)getauxval(AT_EXECFN), resolved_path).

Does that resolve to the same path when running under qemu or regular Bazel? If that's the case for qemu, maybe then we just need a flag whether to realpath or not. As a reminder, we do not want to use realpath by default for the internal reasons mentioned by @haxorz before.

After checking, realpath("/proc/self/exe") resolves to the absolute path of the bazel binary under qemu-user-static. This would save us a weird getauxval call. I can update the PR to just call realpath if the flag is set. What name should the flag be?

That would work nicely. Anyway, it looks like we may need to hijack that from Google side to better facilitate the split between internal/external. Is there a Bazel issue for that? If no, would you mind filing one, ideally with pasted error you ran into?

@alexjski
Copy link
Contributor

I filed #15076 for this, could you share the short repro for that issue I can include in there? Also an error message would be great so others can easily find that.

@alexjski
Copy link
Contributor

I'm running into the FATAL: Failed to open '/proc/self/exe' as a zip file: (error: 9): Bad file descriptor on my Mac, can we please have this merged? @emidln-imc

I pasted that message to #15076 description, hope you don't mind that.

@alexjski
Copy link
Contributor

I'd like to test this change on M1, but I'm not usually a docker user. Could you post a step-by-step repro? Thank you!

+1 to this request. Not having a repro makes it very hard for us to make best recommendation for the fix.

If it is impossible for you to share one could you at least share what do we get in Bazel from following syscalls under qemu (like add printfs and copy-paste):

  • readlink("/proc/self/exe")
  • realpath("/proc/self/exe")
  • stat("/proc/self/exe")--size may be enough to distinguish qemu and Bazel
  • stat(getauxval(AT_EXECFN))--ditto
  • readlink("/proc/self/fd/AT_EXECFD")

That is the list of the first experiments I would want to run if I had a chance to. I know that you already tried some of them, but providing that info one place could be good for posterity. Thank you in advance!

@zmk-punchbowl
Copy link

#15076

I don't know how helpful this is, since it's not a minimal repro scenario, but I'm running into this issue on an M1 when following the TensorFlow instructions to build from source within a Docker container.

I follow those instructions, but include a --platform linux/amd64 flag with the docker run command. Inside the container, the platform shows as:

...:/tensorflow_src# uname -a
Linux 51d28175cce7 5.10.104-linuxkit #1 SMP PREEMPT Wed Mar 9 19:01:25 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

However, any calls to bazel fail, although it does spit out a version:

:/tensorflow_src# bazel version
Bazelisk version: v1.11.0
Opening zip "/proc/self/exe": lseek(): Bad file descriptor
FATAL: Failed to open '/proc/self/exe' as a zip file: (error: 9): Bad file descriptor

I've also tried (as suggested here), downloading a newer version of bazel, specific to this platform, and copying it to bin, with no luck:

...:/tensorflow_src# wget https://github.com/bazelbuild/bazel/releases/download/4.2.1/bazel-4.2.1-linux-x86_64
--2022-03-24 14:22:15--  https://github.com/bazelbuild/bazel/releases/download/4.2.1/bazel-4.2.1-linux-x86_64
Resolving github.com (github.com)... 140.82.113.4
Connecting to github.com (github.com)|140.82.113.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/20773773/32bcb83c-3234-41f0-9402-25dd8171fb2d?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20220324%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20220324T142215Z&X-Amz-Expires=300&X-Amz-Signature=5378c1bd6517db0b3c36307ef0c01e4dfb3551454275f9c09382f3a3ad06f5be&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=20773773&response-content-disposition=attachment%3B%20filename%3Dbazel-4.2.1-linux-x86_64&response-content-type=application%2Foctet-stream [following]
--2022-03-24 14:22:15--  https://objects.githubusercontent.com/github-production-release-asset-2e65be/20773773/32bcb83c-3234-41f0-9402-25dd8171fb2d?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20220324%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20220324T142215Z&X-Amz-Expires=300&X-Amz-Signature=5378c1bd6517db0b3c36307ef0c01e4dfb3551454275f9c09382f3a3ad06f5be&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=20773773&response-content-disposition=attachment%3B%20filename%3Dbazel-4.2.1-linux-x86_64&response-content-type=application%2Foctet-stream
Resolving objects.githubusercontent.com (objects.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to objects.githubusercontent.com (objects.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 50478648 (48M) [application/octet-stream]
Saving to: ‘bazel-4.2.1-linux-x86_64’

bazel-4.2.1-linux-x86_64                             100%[======================================================================================================================>]  48.14M  8.85MB/s    in 5.5s

2022-03-24 14:22:21 (8.81 MB/s) - ‘bazel-4.2.1-linux-x86_64’ saved [50478648/50478648]

...:/tensorflow_src# which bazel
/usr/bin/bazel
...:/tensorflow_src# cp bazel-4.2.1-linux-x86_64 /usr/bin/bazel
...:/tensorflow_src# bazel version
Opening zip "/proc/self/exe": lseek(): Bad file descriptor
FATAL: Failed to open '/proc/self/exe' as a zip file: (error: 9): Bad file descriptor

@libratiger
Copy link
Contributor

is there any plan to merge this?

@dan-cohn-sabre
Copy link

I'm also curious when this will be merged. We have some developers receiving new M1-based MacBooks and finding that they can't run Bazel in our developer environment because it's x86-based and runs on Docker. Just today I built these changes into the version of Bazel we use and tested it on one of the new Macs. It worked great.

I noticed that the new startup option doesn't appear in the help (bazel help startup-options). Does it need to be added to "src/main/java/com/google/devtools/build/lib/runtime/BlazeServerStartupOptions.java"? It also seems like it's missing from "src/test/cpp/bazel_startup_options_test.cc". I say this as someone who is not familiar with the code base, so take it with a grain of salt.

@ianbrex
Copy link

ianbrex commented Mar 31, 2022

Thanks for creating this PR!
I just cannot silently observe how other folks are suffering, while waiting for this PR to be merged.
Here is the steps that I figured out a while ago how to build a patched bazel binary, which does not fail on M1 in Docker x86.

The key cli commands to run on Linux x86 or in some Docker x86(!) container environment

curl -O -L https://github.com/bazelbuild/bazel/releases/download/4.2.2/bazel-4.2.2-dist.zip
unzip bazel-4.2.2-dist.zip
curl -O -L https://github.com/bazelbuild/bazel/pull/14391.patch
patch -p1 < 14391.patch
env EXTRA_BAZEL_ARGS="--host_javabase=@local_jdk//:jdk" bash ./compile.sh

One might need to install gcc, g++, openjdk-8-jdk, python3, unzip, patch packages, but the crucial steps are listed above.
After build you will need to copy built binary from that Linux box or to the outside world from that Docker container with docker cp command.

@michajlo michajlo requested a review from meteorcloudy March 31, 2022 18:54
@michajlo
Copy link
Contributor

There are still some questions about the effectiveness of doing this with a flag (and what second order effects that has) vs figuring out how to do it in a way that "just works". Unfortunately the folks on this bug so far don't have the facilities to repro or experiment with that, so passing off to a teammate who's in a better position to do so. Should have more feedback next week.

@meteorcloudy
Copy link
Member

I can now reproduce this issue with the tensorflow docker container on a M1 Macbook Pro:

root@6e1f154c76ce:/tensorflow_src# uname -a
Linux 6e1f154c76ce 5.10.104-linuxkit #1 SMP PREEMPT Wed Mar 9 19:01:25 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
root@6e1f154c76ce:/tensorflow_src# bazel version
2022/04/04 15:06:38 Downloading https://releases.bazel.build/5.1.0/release/bazel-5.1.0-linux-x86_64...
Bazelisk version: v1.11.0
Opening zip "/proc/self/exe": lseek(): Bad file descriptor
FATAL: Failed to open '/proc/self/exe' as a zip file: (error: 9): Bad file descriptor
root@6e1f154c76ce:/tensorflow_src#

@meteorcloudy
Copy link
Member

I'm building a binary at this branch to verify if it fixes the issue: https://github.com/bazelbuild/bazel/tree/test_14391

@meteorcloudy
Copy link
Member

meteorcloudy commented Apr 4, 2022

--linux_bazel_path_from_getauxval allows the Bazel binary to extract itself, but it seems the Bazel client isn't able to connect to the server? Can you confirm if you also see similar issue?

root@6e1f154c76ce:/tensorflow_src# export USE_BAZEL_VERSION=e4fddd9e511f72dfa41b019e3d6dde24a49c56de
root@6e1f154c76ce:/tensorflow_src# bazel version
2022/04/04 15:34:24 Using unreleased version at commit e4fddd9e511f72dfa41b019e3d6dde24a49c56de
2022/04/04 15:34:24 Downloading https://storage.googleapis.com/bazel-builds/artifacts/ubuntu1404/e4fddd9e511f72dfa41b019e3d6dde24a49c56de/bazel...
Bazelisk version: v1.11.0
Opening zip "/proc/self/exe": lseek(): Bad file descriptor
FATAL: Failed to open '/proc/self/exe' as a zip file: (error: 9): Bad file descriptor
root@6e1f154c76ce:/tensorflow_src# bazel --linux_bazel_path_from_getauxval version
2022/04/04 15:36:05 Using unreleased version at commit e4fddd9e511f72dfa41b019e3d6dde24a49c56de
Extracting Bazel installation...
Starting local Bazel server and connecting to it...
... still trying to connect to local Bazel server (75) after 10 seconds ...
... still trying to connect to local Bazel server (75) after 20 seconds ...
... still trying to connect to local Bazel server (75) after 30 seconds ...
... still trying to connect to local Bazel server (75) after 40 seconds ...
... still trying to connect to local Bazel server (75) after 50 seconds ...
... still trying to connect to local Bazel server (75) after 60 seconds ...
... still trying to connect to local Bazel server (75) after 70 seconds ...
... still trying to connect to local Bazel server (75) after 80 seconds ...
... still trying to connect to local Bazel server (75) after 90 seconds ...
... still trying to connect to local Bazel server (75) after 100 seconds ...
... still trying to connect to local Bazel server (75) after 110 seconds ...
FATAL: couldn't connect to server (75) after 120 seconds.

@meteorcloudy
Copy link
Member

OK, it worked with a rerun.

@meteorcloudy
Copy link
Member

meteorcloudy commented Apr 4, 2022

@michajlo Is there anything else you want me to test? I can confirm the flag solves the extracting problem, I'm able to start building TensorFlow inside a linux docker container on my M1 machine with a Bazel binary built at this PR.

@michajlo
Copy link
Contributor

michajlo commented Apr 4, 2022

bazel --version and how it interacts with this being in a .bazlerc (which I assume is the intended use case). I think this raises some tricky corner cases that we may be able to just deal with, but that ideally we could side-step by removing the flag and having dynamic detection (see also #14391 (comment)).

@meteorcloudy
Copy link
Member

Oh, yeah, bazel --version breaks with:

root@eccd9cd4052c:/tensorflow_src# bazel --version
2022/04/04 16:10:19 Using unreleased version at commit e4fddd9e511f72dfa41b019e3d6dde24a49c56de
[FATAL 16:10:19.547 src/main/cpp/blaze.cc:1296] Unknown startup option: '--version'.
  For more info, run 'bazel help startup_options'.

Putting the flag --linux_bazel_path_from_getauxval into the bazelrc file works as expected (of course, bazel --version still breaks)

@meteorcloudy
Copy link
Member

I can also confirm this:

root@eccd9cd4052c:/workdir/bazel# bazel-bin/src/main/cpp/client
WARNING: readlink(/proc/self/exe) -> /root/.cache/bazel/_bazel_root/c564d33b2d1d9ff3bd9f69877f355ccc/execroot/io_bazel/bazel-out/k8-fastbuild/bin/src/main/cpp/client
WARNING: realpath(/proc/self/exe) -> /root/.cache/bazel/_bazel_root/c564d33b2d1d9ff3bd9f69877f355ccc/execroot/io_bazel/bazel-out/k8-fastbuild/bin/src/main/cpp/client

I'm running only the client, but it should be the same. Both readlink and realpath resolve to the absolute path for the binary. So maybe no need to call getauxval?

@ouj
Copy link

ouj commented Apr 12, 2022

Please do merge this fix..

...or recommend a bazel alternative?

meteorcloudy added a commit that referenced this pull request Apr 12, 2022
In Blaze, we keep using "/proc/self/exe", but in Bazel we resolve this symlink
to the actual Bazel binary.

This is essentially a rollback of
d79ec03
only for Bazel, because it solved an issue that rare in Bazel,
but it's preventing users from running Bazel inside a Linux docker container on Apple Silicon machines.

Fixes #15076

Closes #14391

RELNOTES: None
PiperOrigin-RevId: 441177762
Change-Id: I48d7283cf7f42f4220e2261f02809c8b5270ef70
@meteorcloudy
Copy link
Member

Sorry for the silence here, after some internal discussion, we decided to fix this by rolling back d79ec03, but only in Bazel.

I'm implementing the change at 417e116, and have manually verified it works. So this PR is not necessary.

If you want to double check, you can do

export USE_BAZEL_VERSION=417e116faef7616b1ad78b01175e7f9301d875bd
bazelisk build //foo/bar

After that change is merged, we'll cherry pick the fix for the next Bazel release.

@bazel-io bazel-io closed this in d3435b0 Apr 12, 2022
meteorcloudy added a commit to meteorcloudy/bazel that referenced this pull request Apr 12, 2022
In Blaze, we keep using "/proc/self/exe", but in Bazel we resolve this symlink
to the actual Bazel binary.

This is essentially a rollback of
bazelbuild@d79ec03
only for Bazel, because it solved an issue that rare in Bazel,
but it's preventing users from running Bazel inside a Linux docker container on Apple Silicon machines.

Fixes bazelbuild#15076

Closes bazelbuild#14391

RELNOTES: None
PiperOrigin-RevId: 441198110
ckolli5 pushed a commit that referenced this pull request Apr 20, 2022
In Blaze, we keep using "/proc/self/exe", but in Bazel we resolve this symlink
to the actual Bazel binary.

This is essentially a rollback of
d79ec03
only for Bazel, because it solved an issue that rare in Bazel,
but it's preventing users from running Bazel inside a Linux docker container on Apple Silicon machines.

Fixes #15076

Closes #14391

RELNOTES: None
PiperOrigin-RevId: 441198110
@VinhLoiIT
Copy link

#15076

I don't know how helpful this is, since it's not a minimal repro scenario, but I'm running into this issue on an M1 when following the TensorFlow instructions to build from source within a Docker container.

I follow those instructions, but include a --platform linux/amd64 flag with the docker run command. Inside the container, the platform shows as:

...:/tensorflow_src# uname -a
Linux 51d28175cce7 5.10.104-linuxkit #1 SMP PREEMPT Wed Mar 9 19:01:25 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

However, any calls to bazel fail, although it does spit out a version:

:/tensorflow_src# bazel version
Bazelisk version: v1.11.0
Opening zip "/proc/self/exe": lseek(): Bad file descriptor
FATAL: Failed to open '/proc/self/exe' as a zip file: (error: 9): Bad file descriptor

I've also tried (as suggested here), downloading a newer version of bazel, specific to this platform, and copying it to bin, with no luck:

...:/tensorflow_src# wget https://github.com/bazelbuild/bazel/releases/download/4.2.1/bazel-4.2.1-linux-x86_64
--2022-03-24 14:22:15--  https://github.com/bazelbuild/bazel/releases/download/4.2.1/bazel-4.2.1-linux-x86_64
Resolving github.com (github.com)... 140.82.113.4
Connecting to github.com (github.com)|140.82.113.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/20773773/32bcb83c-3234-41f0-9402-25dd8171fb2d?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20220324%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20220324T142215Z&X-Amz-Expires=300&X-Amz-Signature=5378c1bd6517db0b3c36307ef0c01e4dfb3551454275f9c09382f3a3ad06f5be&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=20773773&response-content-disposition=attachment%3B%20filename%3Dbazel-4.2.1-linux-x86_64&response-content-type=application%2Foctet-stream [following]
--2022-03-24 14:22:15--  https://objects.githubusercontent.com/github-production-release-asset-2e65be/20773773/32bcb83c-3234-41f0-9402-25dd8171fb2d?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20220324%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20220324T142215Z&X-Amz-Expires=300&X-Amz-Signature=5378c1bd6517db0b3c36307ef0c01e4dfb3551454275f9c09382f3a3ad06f5be&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=20773773&response-content-disposition=attachment%3B%20filename%3Dbazel-4.2.1-linux-x86_64&response-content-type=application%2Foctet-stream
Resolving objects.githubusercontent.com (objects.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to objects.githubusercontent.com (objects.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 50478648 (48M) [application/octet-stream]
Saving to: ‘bazel-4.2.1-linux-x86_64’

bazel-4.2.1-linux-x86_64                             100%[======================================================================================================================>]  48.14M  8.85MB/s    in 5.5s

2022-03-24 14:22:21 (8.81 MB/s) - ‘bazel-4.2.1-linux-x86_64’ saved [50478648/50478648]

...:/tensorflow_src# which bazel
/usr/bin/bazel
...:/tensorflow_src# cp bazel-4.2.1-linux-x86_64 /usr/bin/bazel
...:/tensorflow_src# bazel version
Opening zip "/proc/self/exe": lseek(): Bad file descriptor
FATAL: Failed to open '/proc/self/exe' as a zip file: (error: 9): Bad file descriptor

@zmk-punchbowl did you solve the problem? how did you manage it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team-OSS Issues for the Bazel OSS team: installation, release processBazel packaging, website
Projects
None yet
Development

Successfully merging this pull request may close these issues.