-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
generate-xml.sh fails to execute #12579
Comments
Turns out that this is caused by the process-wrapper binary build with a different (LD_)LIBRARY_PATH than what it is executed in. So either that binary shouldn't be build with the LIBRARY_PATH vars or (better): This action should get passed the local PATH and LD_LIBRARY_PATH, similar to most other actions. See also #4137 |
@Flamefire Thanks for figuring out the root cause! (and sorry for not recalling the issue.. In the meantime, passing --action_env=LD_LIBRARY_PATH to the build could also fix the problem without changing Bazel, but that only works if you don't worry about the potential cache poisoning introduced by doing that. Nevertheless, we should definitely fix #12649 to let Bazel fail with a clear message which can save users a lot of debugging time. |
I just tried that: Passing |
How about --test_env=LD_LIBRARY_PATH? |
If you build with "-s", can you see the correct |
I'm currently double checking but I doubt it will make a difference. I'm working with TensorFlow here and it has a line |
Tests finished, and as expected above no difference even with |
@Flamefire I think I know the problem, it's because Bazel hardcodes the environment variables for the xml generation action.. Looks like this bug change only be fixed from Bazel code, I'll send a patch for this. |
You can apply this patch and build a customised Bazel as a workaround, I'll merge this change once it passes review. |
Thanks. My current workaround is building Bazel with |
@meteorcloudy Is there some way to run some tests after installing bazel via "./compile.sh" to detect this issue and verify it is fixed? On HPC sites it can be quite costly to discover such things only after a 3-4hr build of something else. Usually programs provide a |
@Flamefire To reproduce the exact issue, you basically need a gcc that's not installed in the standard location (and it's libs not in PATH). But I think once you have that setup, you don't need to build TensorFlow to verify it, this problem should exist for all tests. For example, with the Bazel repo, you can do |
Ok, checking now. I do have a reproducible way of setting up such an environment, compilation and test using EasyBuild. I'm testing with Bazel 3.4.1 which I know fails and now apply a backported patch (minor changes to master made the patch fail to apply). Am I right assuming that |
Probably, otherwise only the |
This still sounds wrong. I mean: I'm not running anything from "my" tests/binaries/.... What is being run here is ultimately a bash with a command. This does not require any LD_LIBRARY_PATH modification and hence is good by default, reproducible and so on. Now because of a Bazel internal detail, namely the process-wrapper binary, I need to pass LD_LIBRARY_PATH to everything or I can't use anything at all. So basically Bazel forces me to violate the isolation it should enforce/support. Another datapoint: I tried to run
So again the process-wrapper failed to run due to missing LD_LIBRARY_PATH. So given that the process-wrapper is meant to be run in (pretty much) any environment and should be transparent to the user and its environment, settings, etc. I'd say that the only valid solution is to link it statically. Alternatively one could consider passing e.g. LD_LIBRARY_PATH to the process-wrapper but not to the processes executed by it. But I guess this is harder to implement and enforce. At the very least the bazel build process (especially the bootstrap) should link such binaries static by default with an option to not do that to save space. However this means that at least using |
@Flamefire If we choose dynamic linking, then Bazel expects some basic system libraries are in standard locations. This usually work for most of our users. If we choose static linking, it will definitely help avoid such issues, but it will blow up the Bazel binary size (there are not only process-wrapper, but also ijar, build-runfiles, linux-sandbox, etc.) And many Linux distributions actually prefer dynamic linking a lot. So I don't think that's really an option for us to make static linking by default. To summarize, if you use an official Bazel binary, |
I see. Then the question is whether libstdc++ is considered such a "basic system library".
I tried this and this does not work. The error in my previous message is from a run of the Bazel tests with Bazel built with the mentioned patch and #12651 seems to work on x86, but on another system (using POWER9) it fails with errors like The |
Hmm, I thought --action_env should add the env var for all actions no matter they are using
This looks like another reason when cannot linked everything statically by default.
Glad this way still works. The problem of adding |
This is why I suggested to add support for this to Bazel. I mean isn't one responsibility of a build system to abstract away platform specific things? |
Bazel does provide ways to abstract away platform specific things. But users have to actually implement those for the platforms they care about. However, it's really hard for a project to officially support all the different platforms out there.
Yes, |
Previously, we hardcode the envs of the xml generation action, which caused problem for process-wrapper because it's dynamically linked to some system libraries and the required PATH or LD_LIBRARY_PATH are not set. This change propagate the envs we set for the actual test action to the xml file generation action to make sure the env vars are correctly set and can also be controlled by --action_env and --test_env. Fixes bazelbuild#4137 Fixes bazelbuild#12579 Closes bazelbuild#12659. PiperOrigin-RevId: 347596753
Previously, we hardcode the envs of the xml generation action, which caused problem for process-wrapper because it's dynamically linked to some system libraries and the required PATH or LD_LIBRARY_PATH are not set. This change propagate the envs we set for the actual test action to the xml file generation action to make sure the env vars are correctly set and can also be controlled by --action_env and --test_env. Fixes bazelbuild#4137 Fixes bazelbuild#12579 Closes bazelbuild#12659. PiperOrigin-RevId: 347596753
Previously, we hardcode the envs of the xml generation action, which caused problem for process-wrapper because it's dynamically linked to some system libraries and the required PATH or LD_LIBRARY_PATH are not set. This change propagate the envs we set for the actual test action to the xml file generation action to make sure the env vars are correctly set and can also be controlled by --action_env and --test_env. Fixes bazelbuild#4137 Fixes bazelbuild#12579 Closes bazelbuild#12659. PiperOrigin-RevId: 347596753
Previously, we hardcode the envs of the xml generation action, which caused problem for process-wrapper because it's dynamically linked to some system libraries and the required PATH or LD_LIBRARY_PATH are not set. This change propagate the envs we set for the actual test action to the xml file generation action to make sure the env vars are correctly set and can also be controlled by --action_env and --test_env. Fixes bazelbuild#4137 Fixes bazelbuild#12579 Closes bazelbuild#12659. PiperOrigin-RevId: 347596753
Previously, we hardcode the envs of the xml generation action, which caused problem for process-wrapper because it's dynamically linked to some system libraries and the required PATH or LD_LIBRARY_PATH are not set. This change propagate the envs we set for the actual test action to the xml file generation action to make sure the env vars are correctly set and can also be controlled by --action_env and --test_env. Fixes bazelbuild#4137 Fixes bazelbuild#12579 Closes bazelbuild#12659. PiperOrigin-RevId: 347596753
@meteorcloudy @Flamefire This seems to still be an issue. I'm using Bazel in a custom container build with nix and hitting the For From what I understand we have two options at the moment:
If the current behavior is WAI I guess we have no other choice than to distribute a custom Bazel as part of rules_ll, but it seems to me that the current behavior is not WAI. I've tested this against 6.1.1 and 7.0.0-pre.20230306.4, with and without |
Description of the problem / feature request:
I'm running some test of TensorFlow using bazel but on our multi-core POWER9 system it fails with e.g.
ERROR: /dev/shm/s3248973-EasyBuild/TensorFlow/2.4.0/fosscuda-2019b-Python-3.7.4/TensorFlow/tensorflow-r2.4/tensorflow/core/platform/BUILD:1142:11: failed (Exit 1): generate-xml.sh failed: error executing command
I.e. there is no good error message, it simply failed to execute that script which comes from the Bazel installation. I verified that the executed command (
bazel -s
) runs correctly and the script hence also existsI even modified that script in the Bazel sources to print something at the start but that doesn't show up. So it seems that script is not (yet?) created when Bazel tries to execute it. I hence expect a race condition or something but am unable to verify this.
Any hints, ideas, ...?
Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
Sorry, only thing I have is the command I use to test TF:
What operating system are you running Bazel on?
RHEL 7.6
What's the output of
bazel info release
?release 3.4.1- (@non-git)
If
bazel info release
returns "development version" or "(@non-git)", tell us how you built Bazel.EXTRA_BAZEL_ARGS="--jobs=176 --host_javabase=@local_jdk//:jdk" ./compile.sh
Have you found anything relevant by searching the web?
No
Any other information, logs, or outputs that you want to share?
The text was updated successfully, but these errors were encountered: