Skip to content

Conversation

@kyleliang-nv
Copy link
Contributor

@kyleliang-nv kyleliang-nv commented Jul 2, 2025

Motivation

The current Dockerfile does not build for aarch64. This PR makes the main Dockerfile platform agnostic
which allows people to build it for both x86 and aarch64 targets.

Modifications

  • Change hard-coded x86_64 path to be platform dependent.
  • For aarch64 build, install fzyzcjy's DeepEP fork with commit 1b14ad661c7640137fcfe93cccb2694ede1220b0
  • In release-docker.yml, add in b200 and gb200 with cuda 12.9.1
  • In release-docker-dev.yml, build both blackwell and blackwell-cu129, with blackwell-cu129 being a multiplatform docker image
  • Change release-docker-gb200.yml to not auto-trigger on version change.
  • NOTE: This PR should make Dockerfile.gb200 and release-docker-gb200.yml obsolete, however, will keep these two files around and remove them in the figure, when we are sure everything works with the current github actions

Accuracy Tests

Checklist

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @kyleliang-nv, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the Dockerfile's compatibility by making it platform-agnostic, specifically enabling builds for both x86 and aarch64 architectures. My changes primarily involve replacing hardcoded x86_64 references with dynamic $(uname -m) commands or conditional logic. This ensures that architecture-specific dependencies, symlinks, and package sources are correctly resolved during the Docker build process, greatly improving the portability of the build environment.

Highlights

  • Platform Agnostic Symlinks: I've modified the libmlx5.so symlink creation in the Dockerfile to dynamically use $(uname -m) instead of a hardcoded x86_64 path. This ensures the symlink is correctly created regardless of the underlying architecture.
  • Dynamic SGLang Kernel Wheel Installation: The SGLang kernel wheel download URL has been updated to incorporate $(uname -m) for the architecture suffix. This allows the Docker build to fetch the correct pre-built wheel for either x86 or aarch64 targets.
  • Multi-Architecture NVIDIA Repositories: I've adjusted the apt source list and apt-key commands for NVIDIA developer tools and CUDA repositories. These now conditionally select amd64 or arm64 based on the output of uname -m, ensuring the correct repository is used for package management on different architectures.
  • CMake Installation for Multiple Architectures: The CMake download URL and extraction paths have been changed to dynamically include $(uname -m). This enables the Dockerfile to download and install the appropriate CMake binary for the detected build architecture, enhancing cross-platform compatibility.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the Dockerfile to support aarch64 builds by replacing hardcoded x86_64 values with dynamic architecture detection. A critical issue was identified where the architecture check for the NVIDIA CUDA repository key uses arm64 instead of sbsa for aarch64. Suggestions were provided to improve the maintainability of the Dockerfile by using variables to reduce repetitive code.

Copy link
Collaborator

@ishandhanani ishandhanani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add this into CI and allow users to pull either platform from the lmsys docker repository. WDYT @zhyncs?

@kyleliang-nv kyleliang-nv force-pushed the feature/update_sglang_docker_for_aarch64 branch 2 times, most recently from 9e7dee6 to 4af8600 Compare July 15, 2025 03:31
@kyleliang-nv kyleliang-nv force-pushed the feature/update_sglang_docker_for_aarch64 branch from 4ebda8c to f817e6e Compare July 16, 2025 19:02

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Blackwell" encompasses more than sm_100. (Same comment below for TORCH_CUDA_ARCH_LIST.) Maybe we should be more expansive in the list here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would CMAKE_CUDA_ARCHITECTURES=100;120 be sufficient?
And for TORCH_CUDA_ARCH_LIST, will TORCH_CUDA_ARCH_LIST="10.0 12.0" be sufficient?
Since this is a blackwell build, I'm thinking if I should include sm90.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FP8 or FP4 kernels add complexity to the explanation because of arch and/or family conditional compilation.

For the kernels that aren't fp8 or fp4, 10.0 + 12.0 together covers a lot of ground for the Blackwell families, yes. Thor is the other one; its number had been 10.1 but it will be renumbered as SM 11.0 in CUDA 13 here shortly. (We could perhaps skip Thor for the moment but expect to add it later.)

As to the "what about 9.0", I think my best advice is to have one build that serves all the primary targets, so at least here that's Hopper and Blackwell together. Instead of having the "blackwell" one not be the default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explaination Cliff. I'll work on these changes.
Claiming it to be Hopper and Blackwell is a bit challenging right now, due to the DeepEP library where they explictly check that TORCH_CUDA_ARCH_LIST have to be exactly "9.0", otherwise will disable some features.
https://github.com/deepseek-ai/DeepEP/blob/main/setup.py#L70-L73.
Not sure how much SGLang depends on this specific feature in DeepEP package.

@kyleliang-nv kyleliang-nv force-pushed the feature/update_sglang_docker_for_aarch64 branch from 390bf94 to bc6ffbe Compare July 18, 2025 22:28
@kyleliang-nv kyleliang-nv mentioned this pull request Jul 18, 2025
6 tasks
@kyleliang-nv kyleliang-nv force-pushed the feature/update_sglang_docker_for_aarch64 branch 2 times, most recently from bf6d118 to 8dcbbd1 Compare July 25, 2025 16:17
@kyleliang-nv kyleliang-nv marked this pull request as draft July 29, 2025 05:43
@kyleliang-nv kyleliang-nv force-pushed the feature/update_sglang_docker_for_aarch64 branch 3 times, most recently from 3427207 to 63c339c Compare August 6, 2025 18:21
@kyleliang-nv kyleliang-nv force-pushed the feature/update_sglang_docker_for_aarch64 branch 2 times, most recently from 29f5ee8 to 6c51a23 Compare August 13, 2025 16:38
@kyleliang-nv kyleliang-nv changed the title Make the main docker build file to work on aarch64 Make the main Dockerfile to work on aarch64 Aug 13, 2025
@kyleliang-nv kyleliang-nv marked this pull request as ready for review August 13, 2025 18:55
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we call this blackwell-cu128?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's more consistent, but will render blackwell tag obsolete. Or, we can tag it as blackwell-cu128 and instead add additional tag for blackwell-cu129asblackwell. So that blackwell` is the one that have multi-platform (with cu129).

Copy link
Collaborator

@ishandhanani ishandhanani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small comment on tag but LGTM

@kyleliang-nv kyleliang-nv force-pushed the feature/update_sglang_docker_for_aarch64 branch from e3f56bc to 1af3696 Compare August 14, 2025 16:45
@zhyncs zhyncs self-assigned this Aug 14, 2025
@kyleliang-nv kyleliang-nv force-pushed the feature/update_sglang_docker_for_aarch64 branch from b90f009 to c846af6 Compare August 15, 2025 20:35
@kyleliang-nv kyleliang-nv force-pushed the feature/update_sglang_docker_for_aarch64 branch from c846af6 to 63072e8 Compare August 18, 2025 15:53
@kyleliang-nv kyleliang-nv force-pushed the feature/update_sglang_docker_for_aarch64 branch from b18d7ed to ac12dc3 Compare September 3, 2025 23:29
@kyleliang-nv kyleliang-nv mentioned this pull request Sep 4, 2025
4 tasks
@kyleliang-nv kyleliang-nv force-pushed the feature/update_sglang_docker_for_aarch64 branch from 46744b3 to 3e85f4b Compare September 8, 2025 18:03
@kyleliang-nv
Copy link
Contributor Author

Closing this PR since #10705 is already merged in

@kyleliang-nv kyleliang-nv deleted the feature/update_sglang_docker_for_aarch64 branch October 17, 2025 16:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants