Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bazelisk crashes within enroot container running on slurm cluster #665

Open
vsee opened this issue Mar 3, 2025 · 5 comments
Open

bazelisk crashes within enroot container running on slurm cluster #665

vsee opened this issue Mar 3, 2025 · 5 comments
Labels
P4 This is either out of scope or we don't have bandwidth to review a PR. (No assignee)

Comments

@vsee
Copy link

vsee commented Mar 3, 2025

I am running bazelisk as part of my GitHub CI which in turn uses an enroot container as part of a slurm job. Bazelisk is crashing on me immediately after launch. Here is a minimum command to reproduce it and the corresponding error message:

wget https://github.com/bazelbuild/bazelisk/releases/download/v1.25.0/bazelisk-linux-amd64
chmod +x bazelisk-linux-amd64
./bazelisk-linux-amd64


2025/03/03 05:23:55 Downloading https://releases.bazel.build/8.1.1/release/bazel-8.1.1-linux-x86_64...
Downloading: 58 MB out of 58 MB (100%)
WARNING: Invoking Bazel in batch mode since it is not invoked from within a workspace (below a directory having a MODULE.bazel file).
Extracting Bazel installation...
WARNING: Ignoring JAVA_HOME, because it must point to a JDK, not a JRE.
OpenJDK 64-Bit Server VM warning: Options -Xverify:none and -noverify were deprecated in JDK 13 and will likely be removed in a future release.
FATAL: bazel crashed due to an internal error. Printing stack trace:
java.lang.ExceptionInInitializerError
	at com.google.devtools.build.lib.skyframe.SkyframeExecutor.<clinit>(SkyframeExecutor.java:449)
	at com.google.devtools.build.lib.skyframe.BazelSkyframeExecutorConstants.newBazelSkyframeExecutorBuilder(BazelSkyframeExecutorConstants.java:45)
	at com.google.devtools.build.lib.skyframe.SequencedSkyframeExecutorFactory.create(SequencedSkyframeExecutorFactory.java:44)
	at com.google.devtools.build.lib.runtime.WorkspaceBuilder.build(WorkspaceBuilder.java:118)
	at com.google.devtools.build.lib.runtime.BlazeRuntime.initWorkspace(BlazeRuntime.java:274)
	at com.google.devtools.build.lib.runtime.BlazeRuntime.newRuntime(BlazeRuntime.java:1440)
	at com.google.devtools.build.lib.runtime.BlazeRuntime.batchMain(BlazeRuntime.java:1031)
	at com.google.devtools.build.lib.runtime.BlazeRuntime.main(BlazeRuntime.java:866)
	at com.google.devtools.build.lib.bazel.Bazel.main(Bazel.java:104)
Caused by: java.lang.NullPointerException
	at java.base/java.util.Objects.requireNonNull(Unknown Source)
	at java.base/sun.nio.fs.UnixFileSystem.getPath(Unknown Source)
	at java.base/java.nio.file.Path.of(Unknown Source)
	at java.base/java.nio.file.Paths.get(Unknown Source)
	at java.base/jdk.internal.platform.CgroupUtil.lambda$readStringValue$1(Unknown Source)
	at java.base/java.security.AccessController.doPrivileged(Unknown Source)
	at java.base/jdk.internal.platform.CgroupUtil.readStringValue(Unknown Source)
	at java.base/jdk.internal.platform.CgroupSubsystemController.getStringValue(Unknown Source)
	at java.base/jdk.internal.platform.CgroupSubsystemController.getLongValue(Unknown Source)
	at java.base/jdk.internal.platform.cgroupv1.CgroupV1Subsystem.getLongValue(Unknown Source)
	at java.base/jdk.internal.platform.cgroupv1.CgroupV1Subsystem.getHierarchical(Unknown Source)
	at java.base/jdk.internal.platform.cgroupv1.CgroupV1Subsystem.initSubSystem(Unknown Source)
	at java.base/jdk.internal.platform.cgroupv1.CgroupV1Subsystem.getInstance(Unknown Source)
	at java.base/jdk.internal.platform.CgroupSubsystemFactory.create(Unknown Source)
	at java.base/jdk.internal.platform.CgroupSubsystemFactory.create(Unknown Source)
	at java.base/jdk.internal.platform.CgroupMetrics.getInstance(Unknown Source)
	at java.base/jdk.internal.platform.SystemMetrics.instance(Unknown Source)
	at java.base/jdk.internal.platform.Metrics.systemMetrics(Unknown Source)
	at java.base/jdk.internal.platform.Container.metrics(Unknown Source)
	at jdk.management/com.sun.management.internal.OperatingSystemImpl.<init>(Unknown Source)
	at jdk.management/com.sun.management.internal.PlatformMBeanProviderImpl.getOperatingSystemMXBean(Unknown Source)
	at jdk.management/com.sun.management.internal.PlatformMBeanProviderImpl$3.nameToMBeanMap(Unknown Source)
	at java.management/sun.management.spi.PlatformMBeanProvider$PlatformComponent.getMBeans(Unknown Source)
	at java.management/java.lang.management.ManagementFactory.getPlatformMXBean(Unknown Source)
	at java.management/java.lang.management.ManagementFactory.getOperatingSystemMXBean(Unknown Source)
	at com.google.devtools.build.lib.util.ResourceUsage.<clinit>(ResourceUsage.java:46)
	... 9 more

I can't figure out what the problem with the environment is. Any help would be appreciated.

EDIT: I used something like this to launch the slurm job.

srun  \
--job-name=debug_ci_runner  \
--time=1:00:00 \
--cpus-per-task=16 \
--gres=gpu:1  \
--container-image=/path/to/my/ci/images/gh_runner.sqsh \
--container-remap-root  \
--no-container-mount-home \
--container-mounts=/usr/share/sunk:/usr/share/sunk \
--container-workdir=/actions-runner/  \
--pty /bin/bash
@meteorcloudy
Copy link
Member

Is it possible to provide the container image so that we could reproduce?

@vsee
Copy link
Author

vsee commented Mar 4, 2025

I am using this docker image: docker://nvcr.io#nvidia/pytorch:24.07-py3 (see here as well) as the base image for enroot.

I install some minor additional stuff in order to download and run github actions-runner:

apt update
apt install -y --no-install-recommends \
    ca-certificates curl libicu-dev

@meteorcloudy
Copy link
Member

Running inside docker run -it nvcr.io/nvidia/pytorch:24.07-py3 works for me:

root@d0817b1f63c9:/workspace# ./bazelisk-linux-amd64
WARNING: Invoking Bazel in batch mode since it is not invoked from within a workspace (below a directory having a MODULE.bazel file).
OpenJDK 64-Bit Server VM warning: Options -Xverify:none and -noverify were deprecated in JDK 13 and will likely be removed in a future release.
                                                           [bazel release 8.1.1]
Usage: bazel <command> <options> ...

Available commands:
  analyze-profile     Analyzes build profile data.
  aquery              Analyzes the given targets and queries the action graph.
  build               Builds the specified targets.
  canonicalize-flags  Canonicalizes a list of bazel options.
  clean               Removes output files and optionally stops the server.
  coverage            Generates code coverage report for specified test targets.
  cquery              Loads, analyzes, and queries the specified targets w/ configurations.
  dump                Dumps the internal state of the bazel server process.
  fetch               Fetches external repositories that are prerequisites to the targets.
  help                Prints help for commands, or the index.
  info                Displays runtime info about the bazel server.
  license             Prints the license of this software.
  mobile-install      Installs targets to mobile devices.
  mod                 Queries the Bzlmod external dependency graph
  print_action        Prints the command line args for compiling a file.
  query               Executes a dependency graph query.
  run                 Runs the specified target.
  shutdown            Stops the bazel server.
  sync                Syncs all repositories specified in the workspace file
  test                Builds and runs the specified test targets.
  vendor              Fetches external repositories into a folder specified by the flag --vendor_dir.
  version             Prints version information for bazel.

Getting more help:
  bazel help <command>
                   Prints help and options for <command>.
  bazel help startup_options
                   Options for the JVM hosting bazel.
  bazel help target-syntax
                   Explains the syntax for specifying targets.
  bazel help info-keys
                   Displays a list of keys used by the info command.

@vsee
Copy link
Author

vsee commented Mar 5, 2025

Thanks for testing. Can I get you to try this with enroot for me?

https://github.com/NVIDIA/enroot/blob/master/doc/installation.md

/usr/bin/enroot import --output pytorch:24.07-py3.sqsh -- docker://nvcr.io#nvidia/pytorch:24.07-py3
/usr/bin/enroot create --name my_builder -- pytorch:24.07-py3.sqsh
/usr/bin/enroot start -- my_builder wget https://github.com/bazelbuild/bazelisk/releases/download/v1.25.0/bazelisk-linux-amd64
/usr/bin/enroot start -- my_builder chmod +x bazelisk-linux-amd64
/usr/bin/enroot start -- my_builder ./bazelisk-linux-amd64
/usr/bin/enroot remove --force -- my_builder

expected output
https://pastebin.com/VuAK3mS3

@meteorcloudy
Copy link
Member

Sorry, I won't have capacity to look into this further.

@meteorcloudy meteorcloudy added the P4 This is either out of scope or we don't have bandwidth to review a PR. (No assignee) label Mar 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P4 This is either out of scope or we don't have bandwidth to review a PR. (No assignee)
Projects
None yet
Development

No branches or pull requests

2 participants