Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

generate-xml.sh fails to execute #12579

Closed
Flamefire opened this issue Nov 27, 2020 · 21 comments
Closed

generate-xml.sh fails to execute #12579

Flamefire opened this issue Nov 27, 2020 · 21 comments
Assignees
Labels
P1 I'll work on this now. (Assignee required) team-OSS Issues for the Bazel OSS team: installation, release processBazel packaging, website type: bug

Comments

@Flamefire
Copy link
Contributor

Flamefire commented Nov 27, 2020

Description of the problem / feature request:

I'm running some test of TensorFlow using bazel but on our multi-core POWER9 system it fails with e.g.

ERROR: /dev/shm/s3248973-EasyBuild/TensorFlow/2.4.0/fosscuda-2019b-Python-3.7.4/TensorFlow/tensorflow-r2.4/tensorflow/core/platform/BUILD:1142:11: failed (Exit 1): generate-xml.sh failed: error executing command

I.e. there is no good error message, it simply failed to execute that script which comes from the Bazel installation. I verified that the executed command (bazel -s) runs correctly and the script hence also exists

I even modified that script in the Bazel sources to print something at the start but that doesn't show up. So it seems that script is not (yet?) created when Bazel tries to execute it. I hence expect a race condition or something but am unable to verify this.

Any hints, ideas, ...?

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Sorry, only thing I have is the command I use to test TF:

bazel --output_base=/dev/shm/s3248973-EasyBuild/TensorFlow/2.4.0/fosscuda-2019b-Python-3.7.4/tmptspeEg-bazel-tf/output_base --install_base=/dev/shm/s3248973-EasyBuild/TensorFlow/2.4.0/fosscuda-2019b-Python-3.7.4/tmptspeEg-bazel-tf/output_base/inst_base --output_user_root=/dev/shm/s3248973-EasyBuild/TensorFlow/2.4.0/fosscuda-2019b-Python-3.7.4/tmptspeEg-bazel-tf/output_user_root --host_jvm_args=-Xms512m --host_jvm_args=-Xmx4096m test --compilation_mode=opt --config=opt --subcommands --verbose_failures --config=noaws --jobs=64 --copt="-fPIC"  --distinct_host_configuration=false --test_output=errors --local_test_jobs=1 --build_tests_only --test_tag_filters='-no_gpu,-no_oss,-oss_serial,-benchmark-test,-no_oss_py37,-v1only' --build_tag_filters='-no_gpu,-no_oss,-oss_serial,-benchmark-test,-no_oss_py37,-v1only'  -- //tensorflow/core/... //tensorflow/cc/... //tensorflow/c/... -//tensorflow/core:example_java_proto -//tensorflow/core/example:example_protos_closure

What operating system are you running Bazel on?

RHEL 7.6

What's the output of bazel info release?

release 3.4.1- (@non-git)

If bazel info release returns "development version" or "(@non-git)", tell us how you built Bazel.

EXTRA_BAZEL_ARGS="--jobs=176 --host_javabase=@local_jdk//:jdk" ./compile.sh

Have you found anything relevant by searching the web?

No

Any other information, logs, or outputs that you want to share?

ERROR: /dev/shm/s3248973-EasyBuild/TensorFlow/2.4.0/fosscuda-2019b-Python-3.7.4/TensorFlow/tensorflow-r2.4/tensorflow/core/platform/BUILD:1142:11:  failed (Exit 1): generate-xml.sh failed: error executing command 
  (cd /dev/shm/s3248973-EasyBuild/TensorFlow/2.4.0/fosscuda-2019b-Python-3.7.4/tmptspeEg-bazel-tf/output_base/execroot/org_tensorflow && \
  exec env - \
    PATH=/usr/bin:/bin \
    TEST_BINARY=tensorflow/core/platform/platform_strings_test \
    TEST_NAME=//tensorflow/core/platform:platform_strings_test \
    TEST_SHARD_INDEX=0 \
    TEST_TOTAL_SHARDS=0 \
  external/bazel_tools/tools/test/generate-xml.sh bazel-out/ppc-opt/testlogs/tensorflow/core/platform/platform_strings_test/test.log bazel-out/ppc-opt/testlogs/tensorflow/core/platform/platform_strings_test/test.xml 0 0)
Execution platform: @local_execution_config_platform//:platform
@aiuto aiuto added area-EngProd Bazel CI, infrastructure, bootstrapping, release, and distribution tooling untriaged labels Nov 30, 2020
@sventiffe sventiffe added team-OSS Issues for the Bazel OSS team: installation, release processBazel packaging, website and removed area-EngProd Bazel CI, infrastructure, bootstrapping, release, and distribution tooling labels Dec 4, 2020
@Flamefire
Copy link
Contributor Author

Turns out that this is caused by the process-wrapper binary build with a different (LD_)LIBRARY_PATH than what it is executed in.

So either that binary shouldn't be build with the LIBRARY_PATH vars or (better): This action should get passed the local PATH and LD_LIBRARY_PATH, similar to most other actions. See also #4137

@meteorcloudy
Copy link
Member

@Flamefire Thanks for figuring out the root cause! (and sorry for not recalling the issue..
Patching Bazel with #12651 should fix the problem, but I'm not sure it will get merged since we're also trying to shrink the Bazel binary size as much as possible and it does the opposite.

In the meantime, passing --action_env=LD_LIBRARY_PATH to the build could also fix the problem without changing Bazel, but that only works if you don't worry about the potential cache poisoning introduced by doing that.

Nevertheless, we should definitely fix #12649 to let Bazel fail with a clear message which can save users a lot of debugging time.

@Flamefire
Copy link
Contributor Author

I just tried that: Passing --action_env=LD_LIBRARY_PATH to bazel test does not work for the generate-xml which seems to use an entirely new and very minimal environment.

@meteorcloudy
Copy link
Member

How about --test_env=LD_LIBRARY_PATH?

@meteorcloudy
Copy link
Member

If you build with "-s", can you see the correct LD_LIBRARY_PATH being set?

@Flamefire
Copy link
Contributor Author

I'm currently double checking but I doubt it will make a difference. I'm working with TensorFlow here and it has a line test --test_env=LD_LIBRARY_PATH in the .tf_configure.bazelrc which I assume has the same effect. Output from that is in the OP, so no, not set.
TBH I find it very confusing how and when Bazel decides to set environment variables. I found at least 4 different settings: One with all action_env, one with a limitted env but where PATH and LD_LIBRARY_PATH are still passed through, one where even that is not passed through (I think when use_default_shell_env=True is not set [default]) and then the env from the generate-xml where other variables are set. Even having the source didn't help here as there are to many layers for someone not involved in the development to understand.

@Flamefire
Copy link
Contributor Author

Tests finished, and as expected above no difference even with --test_env and LD_LIBRARY_PATH does not appear so the output from the OP is unchanged.

@meteorcloudy
Copy link
Member

@Flamefire I think I know the problem, it's because Bazel hardcodes the environment variables for the xml generation action..
https://cs.opensource.google/bazel/bazel/+/master:src/main/java/com/google/devtools/build/lib/exec/StandaloneTestStrategy.java;l=409

Looks like this bug change only be fixed from Bazel code, I'll send a patch for this.

@meteorcloudy
Copy link
Member

You can apply this patch and build a customised Bazel as a workaround, I'll merge this change once it passes review.

@Flamefire
Copy link
Contributor Author

Thanks. My current workaround is building Bazel with export BAZEL_LINKOPTS=-static-libstdc++:-static-libgcc BAZEL_LINKLIBS=-l%:libstdc++.a:-lm which also works. Should be the same as #12651

@philwo philwo added P1 I'll work on this now. (Assignee required) type: bug and removed untriaged labels Dec 8, 2020
@Flamefire
Copy link
Contributor Author

@meteorcloudy Is there some way to run some tests after installing bazel via "./compile.sh" to detect this issue and verify it is fixed? On HPC sites it can be quite costly to discover such things only after a 3-4hr build of something else. Usually programs provide a configure && make && make test && make install workflow

@meteorcloudy
Copy link
Member

meteorcloudy commented Dec 16, 2020

@Flamefire To reproduce the exact issue, you basically need a gcc that's not installed in the standard location (and it's libs not in PATH). But I think once you have that setup, you don't need to build TensorFlow to verify it, this problem should exist for all tests. For example, with the Bazel repo, you can do bazel test examples/cpp:hello-success_test -s and check if bazel-out/k8-fastbuild/testlogs/examples/cpp/hello-success_test/test.xml is generated successfully (also verify first it doesn't work if you build with Bazel 3.7.1).

@Flamefire
Copy link
Contributor Author

Ok, checking now. I do have a reproducible way of setting up such an environment, compilation and test using EasyBuild. I'm testing with Bazel 3.4.1 which I know fails and now apply a backported patch (minor changes to master made the patch fail to apply). Am I right assuming that --test_env=LD_LIBRARY_PATH is still required with your fix? It sounds strange to me having to modify my test environment to allow a wrapper to run which executes a Bazel tool only.

@meteorcloudy
Copy link
Member

Am I right assuming that --test_env=LD_LIBRARY_PATH is still required with your fix?

Probably, otherwise only the PATH env var is updated, not sure that's enough. LD_LIBRARY_PATH is not passed by default because bazel wants to make the build as reproducible as possible, this benefits remote caching/execution. If users want to change the env vars, they should do it explicitly.

@Flamefire
Copy link
Contributor Author

This still sounds wrong. I mean: I'm not running anything from "my" tests/binaries/.... What is being run here is ultimately a bash with a command. This does not require any LD_LIBRARY_PATH modification and hence is good by default, reproducible and so on. Now because of a Bazel internal detail, namely the process-wrapper binary, I need to pass LD_LIBRARY_PATH to everything or I can't use anything at all. So basically Bazel forces me to violate the isolation it should enforce/support.
And unfortunately this is even barely tested and frequently breaks as shown by e.g. this issue.

Another datapoint: I tried to run bazel test //src/test (found in one of the CI scripts) and got this:

ERROR: /tmp/s3248973-EasyBuild/Bazel/3.4.1/GCCcore-8.3.0/src/main/java/com/google/devtools/build/lib/runtime/commands/license/BUILD:14:15: error executing shell command: '/bin/bash -c for f in $SRCS; do echo ===== $f ===== && cat $f && ec
ho && echo ; done > $OUT' failed (Exit 1): bash failed: error executing command 
  (cd /tmp/s3248973-EasyBuild/Bazel/3.4.1/GCCcore-8.3.0/tmpXkIhHk-bazel-root/c180037cf43126349216df344a1aca16/sandbox/processwrapper-sandbox/25/execroot/io_bazel && \
  exec env - \
    OUT=bazel-out/k8-fastbuild/bin/src/main/java/com/google/devtools/build/lib/runtime/commands/license/LICENSE \
    SRCS='LICENSE third_party/apache_commons_codec/LICENSE third_party/apache_commons_collections/LICENSE third_party/apache_commons_io/LICENSE third_party/apache_velocity/LICENSE third_party/asm/LICENSE third_party/auto/LICENSE third_party/caffeine/LICENSE third_party/checker_framework_annotations/LICENSE third_party/compile_testing/LICENSE third_party/diffutils/LICENSE third_party/flogger/LICENSE third_party/gson/LICENSE third_party/guava/LICENSE third_party/hamcrest/LICENSE third_party/java-diff-utils/LICENSE third_party/javax_activation/LICENSE third_party/javax_annotations/LICENSE third_party/jimfs/LICENSE third_party/jsr305/LICENSE third_party/junit/LICENSE third_party/mockito/LICENSE third_party/ne
tty/LICENSE.txt third_party/objenesis/LICENSE third_party/opencensus/LICENSE-2.0.txt third_party/tomcat_annotations_api/LICENSE third_party/truth/LICENSE third_party/truth8/LICENSE third_party/xz/LICENSE third_party/allocation_instrumente
r/LICENSE third_party/animal_sniffer/LICENSE third_party/antlr/LICENSE third_party/aws-sdk-auth-lite/LICENSE.txt third_party/checker_framework_dataflow/LICENSE.txt third_party/checker_framework_javacutil/LICENSE.txt third_party/css/bootstrap/LICENSE third_party/css/font_awesome/LICENSE.mit third_party/css/font_awesome/LICENSE.ofl third_party/grpc/LICENSE third_party/ijar/LICENSE third_party/jarjar/LICENSE third_party/java/android_databinding/v3_4_0/LICENSE third_party/java/aosp_gradle_core/LICENSE third_party/java/j2objc-annotations/LICENSE third_party/java/jacoco/LICENSE third_party/java/javapoet/LICENSE third_party/java/jcommander/LICENSE third_party/java/jdk/langtools/LICENSE third_party/java/proguard/
proguard5.3.3/docs/license.html third_party/javascript/bootstrap/LICENSE third_party/jaxb/LICENSE.txt third_party/jetbrains_annotations/LICENSE third_party/jformatstring/LICENSE third_party/juniversalchardet/LICENSE third_party/kotlin_stdlib/LICENSE third_party/pprof/LICENSE third_party/protobuf/LICENSE third_party/py/concurrent/LICENSE third_party/py/gflags/gflags/third_party/pep257/LICENSE third_party/py/mock/LICENSE.txt third_party/py/six/LICENSE third_party/zlib/LICENSE.txt third_party/zlib/contrib/dotzlib/LICENSE_1_0.txt third_party/nanopb/LICENSE.txt external/googleapis/LICENSE' \
  /bin/bash -c 'for f in $SRCS; do echo ===== $f ===== && cat $f && echo && echo ; done > $OUT')
Execution platform: //:default_host_platform
...
/tmp/s3248973-EasyBuild/Bazel/3.4.1/GCCcore-8.3.0/tmpXkIhHk-bazel-root/install/bc88d2b8527ac2d359d67c2bd5bd1b80/process-wrapper: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by /tmp/s3248973-EasyBuild/Bazel/3.4.1/GCCcore-8.3.0/tmpXkIhHk-bazel-root/install/bc88d2b8527ac2d359d67c2bd5bd1b80/process-wrapper)
/tmp/s3248973-EasyBuild/Bazel/3.4.1/GCCcore-8.3.0/tmpXkIhHk-bazel-root/install/bc88d2b8527ac2d359d67c2bd5bd1b80/process-wrapper: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by /tmp/s3248973-EasyBuild/Bazel/3.4.1/GCCcore-8.3.0/tmpXkIhHk-bazel-root/install/bc88d2b8527ac2d359d67c2bd5bd1b80/process-wrapper)

So again the process-wrapper failed to run due to missing LD_LIBRARY_PATH.

So given that the process-wrapper is meant to be run in (pretty much) any environment and should be transparent to the user and its environment, settings, etc. I'd say that the only valid solution is to link it statically.

Alternatively one could consider passing e.g. LD_LIBRARY_PATH to the process-wrapper but not to the processes executed by it. But I guess this is harder to implement and enforce.

At the very least the bazel build process (especially the bootstrap) should link such binaries static by default with an option to not do that to save space. However this means that at least using --action_env=LD_LIBRARY_PATH later may lead to problems if any of those contains another libstdc++

@meteorcloudy
Copy link
Member

@Flamefire
So this is basically a question of choosing dynamic linking or static linking for Bazel and its embedded tools.

If we choose dynamic linking, then Bazel expects some basic system libraries are in standard locations. This usually work for most of our users.

If we choose static linking, it will definitely help avoid such issues, but it will blow up the Bazel binary size (there are not only process-wrapper, but also ijar, build-runfiles, linux-sandbox, etc.) And many Linux distributions actually prefer dynamic linking a lot. So I don't think that's really an option for us to make static linking by default.

To summarize, if you use an official Bazel binary, --action_env=LD_LIBRARY_PATH and --test_env=LD_LIBRARY_PATH could help (after #12659). Otherwise, you can build a customized Bazel binary with process-wrapper (and maybe other tools) statically linked using the workaround suggested in #4137 or #12651

@Flamefire
Copy link
Contributor Author

I see. Then the question is whether libstdc++ is considered such a "basic system library".
On many HPC systems (I know 3 big European ones) it is not: The compiler and its runtimes (excluding very basic ones like libc) are usually provided via modules while the system compiler is never touched. So you get a recent compiler by loading a module (which sets LD_LIBRARY_PATH)
So using the dynamic approach will be problematic on such systems where TensorFlow/Bazel is likely to be used.

To summarize, if you use an official Bazel binary, --action_env=LD_LIBRARY_PATH and --test_env=LD_LIBRARY_PATH could help (after #12659).

I tried this and this does not work. The error in my previous message is from a run of the Bazel tests with Bazel built with the mentioned patch and --action_env=LD_LIBRARY_PATH set. To make it work all rules must use use_default_shell_env = True but some don't

#12651 seems to work on x86, but on another system (using POWER9) it fails with errors like /lib/../lib64/libc.a(reg-printf.o)(.note.stapsdt+0x14): error: relocation refers to local symbol "" [1], which is defined in a discarded section Inspecting the (deleted) temporary parameter file shows that only -static is added which might be to much.

The BAZEL_LINKOPTS approach from #4137 seems to work though. However having "first class support" for a statically linked bazel would be great. Something like BAZEL_STATIC=1 ./compile.sh and/or support in Bazel to link to static C++ runtimes, i.e. the equivalent of BAZEL_LINKOPTS=-static-libstdc++:-static-libgcc BAZEL_LINKLIBS=-l%:libstdc++.a:-lm

@meteorcloudy
Copy link
Member

I tried this and this does not work. The error in my previous message is from a run of the Bazel tests with Bazel built with the mentioned patch and --action_env=LD_LIBRARY_PATH set. To make it work all rules must use use_default_shell_env = True but some don't

Hmm, I thought --action_env should add the env var for all actions no matter they are using use_default_shell_env = True or not. Looks like there are still some exceptions..

#12651 seems to work on x86, but on another system (using POWER9) it fails with errors like /lib/../lib64/libc.a(reg-printf.o)(.note.stapsdt+0x14): error: relocation refers to local symbol "" [1], which is defined in a discarded section Inspecting the (deleted) temporary parameter file shows that only -static is added which might be to much.

This looks like another reason when cannot linked everything statically by default.

The BAZEL_LINKOPTS approach from #4137 seems to work though. However having "first class support" for a statically linked bazel would be great. Something like BAZEL_STATIC=1 ./compile.sh and/or support in Bazel to link to static C++ runtimes, i.e. the equivalent of BAZEL_LINKOPTS=-static-libstdc++:-static-libgcc BAZEL_LINKLIBS=-l%:libstdc++.a:-lm

Glad this way still works. The problem of adding BAZEL_STATIC=1 is that ./compile.sh has to work across platforms. What does it means on macOS or Windows, or even the various Linux distributions. It will take a huge amount of work to make it work correctly and it's really hard to verify. So I think it's best for users to decide which BAZEL_LINKOPTS and BAZEL_LINKLIBS work for them.

@Flamefire
Copy link
Contributor Author

This is why I suggested to add support for this to Bazel. I mean isn't one responsibility of a build system to abstract away platform specific things?
In #12651 you added linkstatic = 1,, so couldn't Bazel support something like linkstatic_cxx which does link libstdc++, libgcc or the compiler/platform specific variants?

@meteorcloudy
Copy link
Member

Bazel does provide ways to abstract away platform specific things. But users have to actually implement those for the platforms they care about. However, it's really hard for a project to officially support all the different platforms out there.

In #12651 you added linkstatic = 1,, so couldn't Bazel support something like linkstatic_cxx which does link libstdc++, libgcc or the compiler/platform specific variants?

Yes, features = ["fully_static_link"], is exactly our effort to implement something like that. As you can see, I tried this and failed. The build passed on Ubuntu but failed on centOS, which means we have to distinguish them. I think it's easy to tweak the build to make static linking work for your own dev environment but it's hard to find a general solution for all platforms.

coeuvre pushed a commit to coeuvre/bazel that referenced this issue Jul 15, 2021
Previously, we hardcode the envs of the xml generation action, which
caused problem for process-wrapper because it's dynamically linked to
some system libraries and the required PATH or LD_LIBRARY_PATH are not
set.

This change propagate the envs we set for the actual test action to the
xml file generation action to make sure the env vars are correctly set and
can also be controlled by --action_env and --test_env.

Fixes bazelbuild#4137
Fixes bazelbuild#12579

Closes bazelbuild#12659.

PiperOrigin-RevId: 347596753
coeuvre pushed a commit to coeuvre/bazel that referenced this issue Jul 15, 2021
Previously, we hardcode the envs of the xml generation action, which
caused problem for process-wrapper because it's dynamically linked to
some system libraries and the required PATH or LD_LIBRARY_PATH are not
set.

This change propagate the envs we set for the actual test action to the
xml file generation action to make sure the env vars are correctly set and
can also be controlled by --action_env and --test_env.

Fixes bazelbuild#4137
Fixes bazelbuild#12579

Closes bazelbuild#12659.

PiperOrigin-RevId: 347596753
coeuvre pushed a commit to coeuvre/bazel that referenced this issue Jul 15, 2021
Previously, we hardcode the envs of the xml generation action, which
caused problem for process-wrapper because it's dynamically linked to
some system libraries and the required PATH or LD_LIBRARY_PATH are not
set.

This change propagate the envs we set for the actual test action to the
xml file generation action to make sure the env vars are correctly set and
can also be controlled by --action_env and --test_env.

Fixes bazelbuild#4137
Fixes bazelbuild#12579

Closes bazelbuild#12659.

PiperOrigin-RevId: 347596753
coeuvre pushed a commit to coeuvre/bazel that referenced this issue Jul 15, 2021
Previously, we hardcode the envs of the xml generation action, which
caused problem for process-wrapper because it's dynamically linked to
some system libraries and the required PATH or LD_LIBRARY_PATH are not
set.

This change propagate the envs we set for the actual test action to the
xml file generation action to make sure the env vars are correctly set and
can also be controlled by --action_env and --test_env.

Fixes bazelbuild#4137
Fixes bazelbuild#12579

Closes bazelbuild#12659.

PiperOrigin-RevId: 347596753
coeuvre pushed a commit to coeuvre/bazel that referenced this issue Jul 16, 2021
Previously, we hardcode the envs of the xml generation action, which
caused problem for process-wrapper because it's dynamically linked to
some system libraries and the required PATH or LD_LIBRARY_PATH are not
set.

This change propagate the envs we set for the actual test action to the
xml file generation action to make sure the env vars are correctly set and
can also be controlled by --action_env and --test_env.

Fixes bazelbuild#4137
Fixes bazelbuild#12579

Closes bazelbuild#12659.

PiperOrigin-RevId: 347596753
@aaronmondal
Copy link

@meteorcloudy @Flamefire This seems to still be an issue. I'm using Bazel in a custom container build with nix and hitting the generate-xml.sh failed: error executing command error. Is there a way to get this to work without building a custom Bazel?

For build and run invocations the process-wrapper respects ldconfig /lib, set during container startup, but the test invocation ignores it. --action_env and --test_env cause the correct LD_LIBRARY_PATH to show up in the env for the test action subcommands, but the path seems to be ignored because it doesn't change the error.

From what I understand we have two options at the moment:

  • Use a regular base image with FHS layout in CI. I'd rather avoid this because pure nix-built containers are easier to reason about, have good caching properties, and in our case replicate the environment in which our users run bazel very well.
  • Build a custom Bazel and pass it to the CI container. I'd also rather avoid that because it increases CI complexity, adds another artifact that we'd have to cache, and introduces a skew between CI and what users would use.

If the current behavior is WAI I guess we have no other choice than to distribute a custom Bazel as part of rules_ll, but it seems to me that the current behavior is not WAI.

I've tested this against 6.1.1 and 7.0.0-pre.20230306.4, with and without --incompatible_strict_action_env. Always the same error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 I'll work on this now. (Assignee required) team-OSS Issues for the Bazel OSS team: installation, release processBazel packaging, website type: bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants