Skip to content

[CI][BugFix][AMD] Add check for model_config being None and update conftest.py to load AITER of available to fix Kernels MoE Test % N #33952

Closed
rasmith wants to merge 15 commits intovllm-project:mainfrom
rasmith:rasmith_fix_test_triton_moe_ptpc_fp8
Closed

[CI][BugFix][AMD] Add check for model_config being None and update conftest.py to load AITER of available to fix Kernels MoE Test % N #33952
rasmith wants to merge 15 commits intovllm-project:mainfrom
rasmith:rasmith_fix_test_triton_moe_ptpc_fp8

Conversation

@rasmith
Copy link
Contributor

@rasmith rasmith commented Feb 6, 2026

Purpose

This PR broke many tests (over 30) and this PR fixed one test in the Kernels MoE Test %N group, but when the test is run as a group using

pytest -sv kernels/moe

the first test that run does not load AITER ops and when subsequent tests run, they will also not have AITER ops loaded.

This PR loads the ops in vllm._aiter_ops but then ensures that VLLM_ROCM_USE_AITER=0 when tests run. This ensures that tests that need the function pointers in vllm._aiter_ops are available, but for tests that do not want to use AITER and may depend on VLLM_ROCM_USE_AITER=0 will run properly.

In the context of testing, it does seem reasonable to load AITER ops on ROCm if AITER is available. So, I added a function to conftest.py to load AITER ops if they are available, which now lets the entire group pass.

This PR introduced a check in vllm.py that crashes if the VllmConfig model_config is None, so I added a check to see if the model_config is not None to prevent this from happening.

Test Plan

pytest -sv kernels/moe

Test Result

1950 passed, 5399 skipped, 8 warnings


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gshtras and others added 12 commits January 29, 2026 21:50
…o work

Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Signed-off-by: Randall Smith <Randall.Smith@amd.com>
Signed-off-by: Randall Smith <Randall.Smith@amd.com>
Signed-off-by: Randall Smith <Randall.Smith@amd.com>
Signed-off-by: Randall Smith <Randall.Smith@amd.com>
Signed-off-by: Randall Smith <Randall.Smith@amd.com>
Signed-off-by: Randall Smith <Randall.Smith@amd.com>
@mergify
Copy link

mergify bot commented Feb 6, 2026

Hi @rasmith, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces two important fixes. First, it resolves a potential AttributeError in vllm/config/vllm.py by adding a necessary null check for cfg.model_config before accessing its attributes. This is a good defensive programming practice that prevents crashes. Second, it addresses a CI failure for MoE tests on ROCm by centralizing the AITER op loading logic into tests/conftest.py. This ensures that the test environment is set up correctly for all tests in the suite, improving the reliability of the CI pipeline. The corresponding cleanup in test_rocm_aiter_topk.py is also appropriate. The changes are well-implemented and clearly explained. Overall, this is a solid contribution that improves both code robustness and test stability.

Signed-off-by: Randall Smith <Randall.Smith@amd.com>
Signed-off-by: Randall Smith <Randall.Smith@amd.com>
from torch._inductor.utils import fresh_cache


def use_aiter_if_available():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, not sure if this is a good idea. I think many tests explicity set or dont set this env

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated it so that only the function pointers are loaded, which was happening before, but the environment variable is set to 0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robertgshaw2-redhat I could also do this in _aiter_ops.py, which also works. Basically, the functions get loaded but the env var will be unset:

@@ -36,7 +36,7 @@ def is_aiter_found_and_supported() -> bool:
     Checks: platform (ROCm), device arch (gfx9), library existence,
     and VLLM_ROCM_USE_AITER env variable.
     """
-    if current_platform.is_rocm() and IS_AITER_FOUND and envs.VLLM_ROCM_USE_AITER:
+    if current_platform.is_rocm() and IS_AITER_FOUND:
         from vllm.platforms.rocm import on_gfx9

Signed-off-by: Randall Smith <Randall.Smith@amd.com>
@tjtanaa
Copy link
Collaborator

tjtanaa commented Feb 6, 2026

@rasmith can you check if this PR #33749 resolves the issue that this PR is trying to address? We have worked out a way to resolve the log regression without complicating the imports.

@rasmith
Copy link
Contributor Author

rasmith commented Feb 6, 2026

@rasmith can you check if this PR #33749 resolves the issue that this PR is trying to address? We have worked out a way to resolve the log regression without complicating the imports.

@tjtanaa Yes, it will cause vllm._aiter_ops functions to always get loaded if on ROCm and the aiter library is available, even if VLLM_ROCM_USE_AITER is 0, which is what should happen IMO.

Closing this PR and opening this one to fix the rest of the issues in the test group.

@rasmith rasmith closed this Feb 6, 2026
@github-project-automation github-project-automation bot moved this from Todo to Done in AMD Feb 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working rocm Related to AMD ROCm

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants