Skip to content

[ROCm] Uniform docker to support AMD AINIC, BRCM Thor2 IBGDA NIC for MoRI-EP#23263

Merged
HaiShaw merged 4 commits into
sgl-project:mainfrom
HaiShaw:update/rocm.Docker
Apr 21, 2026
Merged

[ROCm] Uniform docker to support AMD AINIC, BRCM Thor2 IBGDA NIC for MoRI-EP#23263
HaiShaw merged 4 commits into
sgl-project:mainfrom
HaiShaw:update/rocm.Docker

Conversation

@kkHuang-amd
Copy link
Copy Markdown
Collaborator

@kkHuang-amd kkHuang-amd commented Apr 20, 2026

Co-authored-by: functionstackx 47992694+functionstackx@users.noreply.github.com

Motivation

In order to run mori-ep in Broadcom Thor2 nic cards.

Modifications

Install Broadcom Thor2 driver in rocm.Dockerfile

Accuracy Tests

Mori-EP micro-bench mark testing

Docker container creation

docker run -it \
  --device=/dev/kfd \
  --device=/dev/dri \
  --network=host \
  --pid=host \
  --ipc=host \
  --ulimit memlock=-1 \
  --cap-add=IPC_LOCK \
  --cap-add=SYS_PTRACE \
  --security-opt seccomp=unconfined \
  --privileged \
  --shm-size=64g \
  --group-add video \
  --group-add rdma \
  -v /:/dockerx \
  -v /dev/infiniband:/dev/infiniband \
  ${IMAGE}

Node 1

export PYTHONPATH=/sgl-workspace/mori:$PYTHONPATH
export GLOO_SOCKET_IFNAME=enp196s0 # need to change

master_addr needs to be changed
torchrun --nnodes=2 --node_rank=0 --nproc_per_node=1 \
    --master_addr=10.235.58.227 --master_port=29503 \
    examples/ops/dispatch_combine/test_dispatch_combine_internode.py --cmd bench

  
Node 2

export PYTHONPATH=/sgl-workspace/mori:$PYTHONPATH
export GLOO_SOCKET_IFNAME=enp196s0 # need to change

# master_addr needs to be changed
torchrun --nnodes=2 --node_rank=1 --nproc_per_node=1 \
    --master_addr=10.235.58.227 --master_port=29503 \
    examples/ops/dispatch_combine/test_dispatch_combine_internode.py --cmd bench

  
Result

+----------------------------------------------------------------------------------------------+
|                               Dispatch Performance (bfloat16)                                |
+---------+-----------------------+-----------------------+---------------------+--------------+
| Metrics | RDMA Bandwidth (GB/s) | XGMI Bandwidth (GB/s) | LL Bandwidth (GB/s) | Latency (us) |
+---------+-----------------------+-----------------------+---------------------+--------------+
|   Best  |           69          |          227          |         281         |     1687     |
|  Worst  |           63          |          206          |         255         |     1846     |
| Average |           65          |          215          |         266         |     1771     |
+---------+-----------------------+-----------------------+---------------------+--------------+
+----------------------------------------------------------------------------------------------+
|                                Combine Performance (bfloat16)                                |
+---------+-----------------------+-----------------------+---------------------+--------------+
| Metrics | RDMA Bandwidth (GB/s) | XGMI Bandwidth (GB/s) | LL Bandwidth (GB/s) | Latency (us) |
+---------+-----------------------+-----------------------+---------------------+--------------+
|   Best  |           76          |          249          |         308         |     1544     |
|  Worst  |           69          |          225          |         278         |     1693     |
| Average |           72          |          235          |         291         |     1618     |
+---------+-----------------------+-----------------------+---------------------+--------------+

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the AINIC_VERSION and enhances the rocm.Dockerfile by implementing driver installation logic for AMD and Broadcom NICs. Review feedback suggests removing redundant sed and grep commands for dependency pinning, consolidating multiple apt-get update calls to optimize image layers, and utilizing ldconfig for library path management instead of manual file copying.

Comment thread docker/rocm.Dockerfile
Comment thread docker/rocm.Dockerfile
Comment thread docker/rocm.Dockerfile Outdated
Comment thread docker/rocm.Dockerfile
@HaiShaw HaiShaw changed the title Broadcom Thor2 IBGDA NIC support for mori-ep [ROCm] Uniform docker to support AMD AINIC, BRCM Thor2 IBGDA NIC for MoRI-EP Apr 21, 2026
Copy link
Copy Markdown
Collaborator

@HaiShaw HaiShaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment thread docker/rocm.Dockerfile Outdated
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update this comment

Comment thread docker/rocm.Dockerfile Outdated
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove --build-arg NIC_BACKEND=ainic

@HaiShaw
Copy link
Copy Markdown
Collaborator

HaiShaw commented Apr 21, 2026

Also update https://github.com/sgl-project/sglang/tree/main/.github/workflows/release-docker-amd*.yml

@HaiShaw HaiShaw merged commit c122d34 into sgl-project:main Apr 21, 2026
59 of 61 checks passed
zhangying098 pushed a commit to zhangying098/sglang that referenced this pull request Apr 23, 2026
…MoRI-EP (sgl-project#23263)

Co-authored-by: wunhuang <wunhuang@amd.com>
Co-authored-by: HaiShaw <hixiao@gmail.com>
Co-authored-by: Lzy17 <36555117+Lzy17@users.noreply.github.com>
caitengwei pushed a commit to caitengwei/sglang that referenced this pull request Jun 1, 2026
…MoRI-EP (sgl-project#23263)

Co-authored-by: wunhuang <wunhuang@amd.com>
Co-authored-by: HaiShaw <hixiao@gmail.com>
Co-authored-by: Lzy17 <36555117+Lzy17@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants