feat: multi-arch CUDA Dockerfile and sm_121 (DGX Spark GB10)#840
feat: multi-arch CUDA Dockerfile and sm_121 (DGX Spark GB10)#840alvarobartt merged 1 commit intohuggingface:mainfrom
Conversation
44f1190 to
8cf4772
Compare
alvarobartt
left a comment
There was a problem hiding this comment.
Thanks a lot for the PR @nazq, looks really clean!
Could you review and update also the table with the different images at https://github.com/huggingface/text-embeddings-inference/blob/main/docs/source/en/supported_models.md? Then I'll merge and validate that the CI is working as expected, hoping to release v1.9.3 next week.
And thanks for building on top of @z4y4ts PR and keeping them as co-author, much appreciated 🤗
Updated supported_models.md. I did update the CI too but I've not run it so all done by inspection. |
|
Hi @nazq thanks so much for that PR! I tested the PR on my Spark and I got a build failure: After searching a bit, I found out that this #842 PR should fix it. So I applied these changes and the build finished without any errors. So I guess only a rebase is needed. |
|
Great. Thanks for this i didn't buy a Spark till i knew we could get this PR in. Happy to rebase it |
a9395f8 to
ad55ed2
Compare
|
Hey @stefan-it — rebased onto upstream main, which now includes #842. Should fix the |
|
Hi @nazq many thanks! I did a fresh clone of the rebased branch and built it with: docker build . -f Dockerfile-cuda --no-cache --build-arg CUDA_COMPUTE_CAP=121 --platform linux/arm64 -t text-embeddings-inference:121-1.9-prresult was: [+] Building 895.2s (32/32) FINISHED docker:default
=> [internal] load build definition from Dockerfile-cuda 0.0s
=> => transferring dockerfile: 6.46kB 0.0s
=> [internal] load metadata for docker.io/nvidia/cuda:12.9.1-runtime-ubuntu24.04 0.2s
=> [internal] load metadata for docker.io/nvidia/cuda:12.9.1-devel-ubuntu24.04 0.2s
=> [internal] load .dockerignore 0.0s
=> => transferring context: 53B 0.0s
=> [internal] load build context 0.0s
=> => transferring context: 17.28kB 0.0s
=> CACHED [base-builder 1/6] FROM docker.io/nvidia/cuda:12.9.1-devel-ubuntu24.04@sha256:020bc241a628776338f4d4053fed4c38f6f7f3d7eb5919fecb8de313bb8ba47c 0.0s
=> CACHED [base 1/3] FROM docker.io/nvidia/cuda:12.9.1-runtime-ubuntu24.04@sha256:1287141d283b8f06f45681b56a48a85791398c615888b1f96bfb9fc981392d98 0.0s
=> [base-builder 2/6] RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends curl libssl-dev pkg-config && rm -rf /var/l 22.1s
=> [base 2/3] RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends ca-certificates libssl-dev curl cuda-compat-12-9 19.6s
=> [base 3/3] COPY --chmod=775 cuda-entrypoint.sh entrypoint.sh 0.0s
=> [base-builder 3/6] RUN case "arm64" in "amd64") SCCACHE_ARCH=x86_64-unknown-linux-musl ;; "arm64") SCCACHE_ARCH=aarch64-unknown-linux-musl ;; *) echo "Unsupported 2.9s
=> [base-builder 4/6] COPY rust-toolchain.toml rust-toolchain.toml 0.0s
=> [base-builder 5/6] RUN curl https://sh.rustup.rs -sSf | bash -s -- -y 32.5s
=> [base-builder 6/6] RUN cargo install cargo-chef --version 0.1.73 --locked 49.9s
=> [planner 1/7] WORKDIR /usr/src 0.0s
=> [builder 2/9] RUN --mount=type=secret,id=actions_results_url,env=ACTIONS_RESULTS_URL --mount=type=secret,id=actions_runtime_token,env=ACTIONS_RUNTIME_TOKEN if [ 121 -g 0.6s
=> [planner 2/7] COPY backends backends 0.1s
=> [planner 3/7] COPY core core 0.1s
=> [planner 4/7] COPY router router 0.1s
=> [planner 5/7] COPY Cargo.toml ./ 0.1s
=> [planner 6/7] COPY Cargo.lock ./ 0.1s
=> [planner 7/7] RUN cargo chef prepare --recipe-path recipe.json 0.2s
=> [builder 3/9] COPY --from=planner /usr/src/recipe.json recipe.json 0.1s
=> [builder 4/9] RUN --mount=type=secret,id=actions_results_url,env=ACTIONS_RESULTS_URL --mount=type=secret,id=actions_runtime_token,env=ACTIONS_RUNTIME_TOKEN if [ 121 360.7s
=> [builder 5/9] COPY backends backends 0.1s
=> [builder 6/9] COPY core core 0.1s
=> [builder 7/9] COPY router router 0.1s
=> [builder 8/9] COPY Cargo.toml ./ 0.1s
=> [builder 9/9] COPY Cargo.lock ./ 0.1s
=> [http-builder 1/1] RUN --mount=type=secret,id=actions_results_url,env=ACTIONS_RESULTS_URL --mount=type=secret,id=actions_runtime_token,env=ACTIONS_RUNTIME_TOKEN if [ 423.1s
=> [stage-7 1/1] COPY --from=http-builder /usr/src/target/release/text-embeddings-router /usr/local/bin/text-embeddings-router 0.4s
=> exporting to image 1.5s
=> => exporting layers 1.4s
=> => writing image sha256:2018875deaebfac387abad481f0f2bb7979853ad2b607297aa8bdba5b1d67ef4 0.0s
=> => naming to docker.io/library/text-embeddings-inference:121-1.9-prSo definitely working on a Spark 🥳 |
|
I'll put my order in then ;-) |
Independent DGX Spark Validationhardware: DGX Spark (spark-97dd), NVIDIA GB10 Buildbranch: feat/arm64-cuda-blackwell (ad55ed2) Smoke Teststest_1_model: BAAI/bge-small-en-v1.5 test_2_model: BAAI/bge-small-en-v1.5 test_3_model: BAAI/bge-reranker-base Key FindingFlash attention works on sm_121 (GB10) with this PR. This is an improvement Notes
Tested by: JC (jonathan.corners@voxell.ai) on DGX Spark (spark-97dd) |
alvarobartt
left a comment
There was a problem hiding this comment.
Thanks again @nazq 🙏🏻
Left some minor comments, happy to merge afterwards!
| | Blackwell 12.1 (DGX Spark GB10, ...) | ghcr.io/huggingface/text-embeddings-inference:121-1.9 (experimental) | | ||
| | CPU (ARM64 / aarch64) | ghcr.io/huggingface/text-embeddings-inference:cpu-arm64-1.9 | |
There was a problem hiding this comment.
Could you please align this table with the same table in the README.md, using it as reference.
| | Architecture | Platform | Image | | ||
| |----------------------------------------|----------|-------------------------------------------------------------------------| | ||
| | CPU | x86_64 | ghcr.io/huggingface/text-embeddings-inference:cpu-1.9 | | ||
| | CPU | aarch64 | ghcr.io/huggingface/text-embeddings-inference:cpu-arm64-1.9 | |
There was a problem hiding this comment.
Should we add (experimental) here too despite already being validated, at least until we run aarch64 for a couple of releases?
| For ARM64 hosts without NVIDIA GPUs, use the CPU Dockerfile. Inference runs on CPU cores | ||
| only (no Metal/MPS support via Docker). |
There was a problem hiding this comment.
| For ARM64 hosts without NVIDIA GPUs, use the CPU Dockerfile. Inference runs on CPU cores | |
| only (no Metal/MPS support via Docker). | |
| For ARM64 hosts without NVIDIA GPUs such as Apple Silicon, use the `Dockerfile` for CPU, | |
| where inference will run without any accelerator, as Metal / MPS is not supported via Docker. |
| For ARM64 hosts with NVIDIA GPUs, build `Dockerfile-cuda` with the appropriate compute | ||
| capability and `--platform linux/arm64`: |
There was a problem hiding this comment.
| For ARM64 hosts with NVIDIA GPUs, build `Dockerfile-cuda` with the appropriate compute | |
| capability and `--platform linux/arm64`: | |
| For ARM64 hosts with NVIDIA GPUs, use / build the `Dockerfile-cuda` with `--platform linux/arm64`, | |
| and also with the `--build-arg CUDA_COMPUTE_CAP` set to whatever your instance compute capability is (only required when building the image). |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
- Add Dockerfile-cuda supporting both x86_64 and ARM64 (aarch64) - Add sm_121 compute capability for NVIDIA GB10 (DGX Spark) - Add cpu-arm64 image variant - Update supported hardware documentation Co-Authored-By: z4y4ts <z4y4ts@users.noreply.github.com>
6ac9bc4 to
96b8bac
Compare
|
Hey @alvarobartt — rebased onto main and aligned the
Ready for your re-review when you get a chance. |
Summary
Builds on #827 (ARM64 CPU Dockerfile) by extending CUDA support to ARM64 and adding the DGX Spark GB10's sm_121 compute capability. Also adds the CI matrix entries and README updates needed to ship ARM64 images.
Changes
Dockerfile-cuda (multi-arch)
TARGETARCHto select correct sccache binary (x86_64 or aarch64)TARGETARCHto select correct protoc binary (x86_64 or aarch_64)nvprunesection for DGX Spark GB10compute_cap.rs
(120..=121, 120) => true— sm_121 runtime is compatible with sm_120 compiled binaries(121, 121) => true— exact match for native sm_121 buildsflash_attn.rs
runtime_compute_cap == 121to use flash attention v2 (same arch family as sm_120)build.yaml
matrix.platformswith fallback tolinux/amd64— enables per-variant platform selection without breaking existing entriesmatrix.json
blackwell-121entry (linux/arm64,CUDA_COMPUTE_CAP=121) for DGX Spark GB10cpu-arm64entry (linux/arm64,Dockerfile-arm64) for ARM64 CPU-only hostsREADME.md
Platformcolumn to Docker Images tablecpu-arm64-1.9and121-1.9image entriesMotivation
The NVIDIA DGX Spark uses the GB10 SoC with compute capability 12.1 (sm_121). This is a Blackwell-family chip (Grace + Blackwell GPU) on ARM64. Without these changes, TEI cannot run on the DGX Spark with CUDA acceleration.
Testing
docker build -f Dockerfile-cuda --build-arg CUDA_COMPUTE_CAP=121 --platform linux/arm64 .compute_cap_matchingwith sm_121121-{version}-grpcandcpu-arm64-{version}-grpcimagesCloses #769