[CI][BugFix][AMD] Add check for model_config being None and update conftest.py to load AITER of available to fix Kernels MoE Test % N #33952
Conversation
…o work Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Signed-off-by: Randall Smith <Randall.Smith@amd.com>
Signed-off-by: Randall Smith <Randall.Smith@amd.com>
Signed-off-by: Randall Smith <Randall.Smith@amd.com>
Signed-off-by: Randall Smith <Randall.Smith@amd.com>
|
Hi @rasmith, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
There was a problem hiding this comment.
Code Review
This pull request introduces two important fixes. First, it resolves a potential AttributeError in vllm/config/vllm.py by adding a necessary null check for cfg.model_config before accessing its attributes. This is a good defensive programming practice that prevents crashes. Second, it addresses a CI failure for MoE tests on ROCm by centralizing the AITER op loading logic into tests/conftest.py. This ensures that the test environment is set up correctly for all tests in the suite, improving the reliability of the CI pipeline. The corresponding cleanup in test_rocm_aiter_topk.py is also appropriate. The changes are well-implemented and clearly explained. Overall, this is a solid contribution that improves both code robustness and test stability.
Signed-off-by: Randall Smith <Randall.Smith@amd.com>
tests/conftest.py
Outdated
| from torch._inductor.utils import fresh_cache | ||
|
|
||
|
|
||
| def use_aiter_if_available(): |
There was a problem hiding this comment.
hm, not sure if this is a good idea. I think many tests explicity set or dont set this env
There was a problem hiding this comment.
I updated it so that only the function pointers are loaded, which was happening before, but the environment variable is set to 0.
There was a problem hiding this comment.
@robertgshaw2-redhat I could also do this in _aiter_ops.py, which also works. Basically, the functions get loaded but the env var will be unset:
@@ -36,7 +36,7 @@ def is_aiter_found_and_supported() -> bool:
Checks: platform (ROCm), device arch (gfx9), library existence,
and VLLM_ROCM_USE_AITER env variable.
"""
- if current_platform.is_rocm() and IS_AITER_FOUND and envs.VLLM_ROCM_USE_AITER:
+ if current_platform.is_rocm() and IS_AITER_FOUND:
from vllm.platforms.rocm import on_gfx9
Signed-off-by: Randall Smith <Randall.Smith@amd.com>
@tjtanaa Yes, it will cause Closing this PR and opening this one to fix the rest of the issues in the test group. |
Purpose
This PR broke many tests (over 30) and this PR fixed one test in the
Kernels MoE Test %Ngroup, but when the test is run as a group usingpytest -sv kernels/moethe first test that run does not load AITER ops and when subsequent tests run, they will also not have AITER ops loaded.
This PR loads the ops in
vllm._aiter_opsbut then ensures thatVLLM_ROCM_USE_AITER=0when tests run. This ensures that tests that need the function pointers invllm._aiter_opsare available, but for tests that do not want to use AITER and may depend onVLLM_ROCM_USE_AITER=0will run properly.In the context of testing, it does seem reasonable to load AITER ops on ROCm if AITER is available. So, I added a function to
conftest.pyto load AITER ops if they are available, which now lets the entire group pass.This PR introduced a check in
vllm.pythat crashes if theVllmConfigmodel_configis None, so I added a check to see if themodel_config is not Noneto prevent this from happening.Test Plan
pytest -sv kernels/moeTest Result
1950 passed, 5399 skipped, 8 warningsEssential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.