[ROCm][CI] Added MI325 mirrors#34923
Conversation
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
There was a problem hiding this comment.
Code Review
This pull request adds CI test mirrors for the new MI325 ROCm hardware, expanding test coverage to this platform. The changes primarily involve modifications to Buildkite YAML configuration files to add new test steps or mirror existing ones. Additionally, a performance fix for ROCm is included in the Dockerfile by setting MIOpen environment variables to mitigate a known performance regression in 3D convolution kernels. My review identifies a couple of areas for improvement in the CI configuration to avoid redundant test execution and improve maintainability. Overall, the changes are a good step towards ensuring stability on the new hardware.
| commands: | ||
| - pytest -v -s v1/e2e | ||
| - pytest -v -s v1/engine |
There was a problem hiding this comment.
| commands: | ||
| - pip install git+https://github.com/TIGER-AI-Lab/Mantis.git | ||
| - pytest -v -s models/multimodal/processing |
There was a problem hiding this comment.
The commands in this mirror lead to redundant test execution. The pytest -v -s models/multimodal/processing command runs all tests in the directory, which includes tests already covered by the mirror for the Multi-Modal Processor Test (CPU) step. This is inefficient.
To fix this, the mirror should run the same command as the parent step (pytest -v -s models/multimodal/processing/test_tensor_schema.py). The simplest way is to remove the commands section from the mirror, allowing it to inherit the commands from the parent step. The pip install is also redundant.
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
|
cc @kenroche |
|
There is another PR that shall be merged before we merge this: |
|
All critical PRs have been merged. I just rebased this branch. Added the |
|
I had identified some mistakes in the |
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
| librdmacm1 \ | ||
| libibverbs1 \ | ||
| ibverbs-providers \ | ||
| ibverbs-utils \ |
There was a problem hiding this comment.
This is the only package that is needed by the test in order to skip, since the ibverbs-utils has a command line utility that is useful. The MORI library is already successfully using libibverbs as it is.
There was a problem hiding this comment.
If tests pass, I'll probably do a follow-up here. Otherwise I'll integrate it here, but let's hope things pass here 😅
There was a problem hiding this comment.
The RIXL wheel bundles UCX built with --with-verbs, but UCX's verbs transport dynamically loads system RDMA libraries at runtime:
- Without
ibverbs-providersplugins (e.g.libmlx5.so),ibv_devinfowould probably return "No IB devices found" even on RDMA-capable hosts, so the skip guard permanently skips. Technically a transitive dep ofibverbs-utilsbut worth being explicit. - UCX's connection manager needs
librdmacm.so.1. Solibrdmacm1is not a transitive dep ofibverbs-utils.test_moriio_handshake_returns_metadatacallsconnector.register_kv_caches()which goes through real MORI/UCX init, I think that without rdmacm, the transport layer would fail. libibverbs1is transitive dep ofibverbs-utils, so this one could be dropped from the explicit list.
Happy to drop libibverbs1 from the explicit install, but librdmacm1 and ibverbs-providers are needed. All in all I think we can keep then there for completeness. But happy to hear your thoughts on that as well.
| @pytest.mark.skipif( | ||
| not aiter_available, reason="Requires aiter package for ROCm FlashAttention backend" | ||
| ) | ||
| @pytest.mark.skipif(not rdma_available, reason="No RDMA devices available") |
There was a problem hiding this comment.
You can just skip at module level instead of skipping each individually:
if not rdma_available:
pytest.skip("No RDMA devices available", allow_module_level = True)
There was a problem hiding this comment.
I'm planning to do a follow-up PR based on your feedback. Regarding this one, right now, only two tests (test_register_kv_caches and test_moriio_handshake_returns_metadata) require RDMA and aiter. So if we put it on top of the file, wouldn't that skip all tests? Even though some are passing.
| - vllm/ | ||
| - tests/entrypoints/openai/responses | ||
| commands: | ||
| - export VLLM_WORKER_MULTIPROC_METHOD=spawn |
There was a problem hiding this comment.
haven't attempted to run it without it honestly. I just know we always use it in ROCm. But maybe it is set during execution and I can delete it later on.
There was a problem hiding this comment.
There is a mechanism to select this based on the need at
vllm/vllm/utils/system_utils.py
Line 114 in 60da0e1
If certain tests fail to utilize it, please make sure to address the tests in a follow up.
There was a problem hiding this comment.
This was actually something that I forgot there after I removed some of the mirrors. I am removing it, sorry for overseeing this.
|
|
||
| wait_for_clean_gpus() { | ||
| echo "--- Waiting for clean GPU state" | ||
| while true; do |
There was a problem hiding this comment.
Is there a timeout on this? Can it get stuck forever?
There was a problem hiding this comment.
Hopefully this version here resolves this concern. It was like that in the original script by the way, and for some reason we never observed that to be getting stuck. However, I did add a generous timeout there, let me know if it looks good now :)
… property in yaml Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com> Signed-off-by: Andrii Skliar <askliar@nvidia.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
In this PR, we are gating more tests. This PR is dependent on:
I've put this PR up to start the evaluation of these tests and get more gating tests ready for merging as soon as issues are confirmed to be resolved.