[CI][ROCm] Ship RIXL with vllm/vllm-openai-rocm#41634
[CI][ROCm] Ship RIXL with vllm/vllm-openai-rocm#41634tjtanaa merged 5 commits intovllm-project:mainfrom
vllm/vllm-openai-rocm#41634Conversation
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
There was a problem hiding this comment.
Code Review
This pull request updates the ROCm Dockerfile to install the RIXL wheel and its RDMA runtime dependencies, and sets the HSA_ENABLE_IPC_MODE_LEGACY environment variable to avoid GPU memory pinning issues. Feedback was provided to optimize the system package installation by adding the --no-install-recommends flag, removing an invalid flag from the update command, and reordering the instructions to improve Docker layer caching.
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
vllm/vllm-openai-rocm
vllm/vllm-openai-rocmvllm/vllm-openai-rocm
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
|
Documentation preview: https://vllm--41634.org.readthedocs.build/en/41634/ |
divakar-amd
left a comment
There was a problem hiding this comment.
Added 2 comments. Looks good overall.
… earrlier Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
divakar-amd
left a comment
There was a problem hiding this comment.
LGTM. Tested the change with a single-node 1P-1D disaggregated setup on mi300.
|
@simondanielsson thanks for this PR for having RIXL out of the box (for models where rixl is better than mori) |
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com> Signed-off-by: Libin Tang <libin.tang@intel.com>
Purpose
Fixes #41637.
RIXL is not readily available in the official vLLM ROCm image:
In contrast to NIXL, there are no pre-built wheels for RIXL yet. Hence you cannot use the NixlConnector for PD disagg on AMD platforms without installing manually from source, limiting reproducibility and productivity. My suggestion is (to follow the current vLLM documentation and) ship RIXL with the ROCm image, at least until RIXL wheels are readily available.
RIXL is already installed in the
teststage of the image, but not in the final stage. This PR installs it also in the final stage.Note: This aligns the expected behavior with NV stack:
$ docker run --rm --entrypoint python3 vllm/vllm-openai:v0.20.1 -c "from nixl._api import nixl_agent; print('NIXL OK')" NIXL OKTest Plan
Testing on 8xMI300X node with Thor2 NICs.
1. Build
Build
docker build \ -f docker/Dockerfile.rocm \ --build-arg BASE_IMAGE=rocm/vllm-dev:base \ -t vllm/vllm-openai-rocm:local \ .Depending on platform you might need RDMA userspace libs. This is for Thor2:
Expand for details
which we can build with
2. Test
docker run --rm --entrypoint python3 vllm/vllm-openai-rocm:local -c "from rixl._api import nixl_agent; print('RIXL OK')"Test Result
$ docker run --rm --entrypoint python3 vllm/vllm-openai-rocm:local -c "from rixl._api import nixl_agent; print('RIXL OK')" RIXL OKEssential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing (anything written below this line will be removed by GitHub Actions)