[CI/Build] Avoid CUDA IMA on Ada (SM 8.9) by fallback to apply_repetition_penalties_torch #28180

hl475 · 2025-11-06T03:38:28Z

Purpose

Recently evals/gsm8k/test_gsm8k_correctness.py::test_gsm8k_correctness_param[Qwen1.5-MoE-W4A16-CT-tp1] from LM Eval Small Models are failing in nightly:
https://buildkite.com/vllm/ci/builds/37631/steps/canvas?sid=019a5264-3636-4617-87f8-9867066b7a78
https://buildkite.com/vllm/ci/builds/37251/steps/canvas?sid=019a42b9-e1bc-42a6-8605-7900f4330ffd
https://buildkite.com/vllm/ci/builds/37196/steps/canvas?sid=019a3d93-774a-470b-a899-0f34ed601d55
https://buildkite.com/vllm/ci/builds/37041/steps/canvas?sid=019a386d-1b21-41bf-bb23-9d1a53bb4455
https://buildkite.com/vllm/ci/builds/36869/steps/canvas?sid=019a3346-ceb2-4e6b-ac05-46162dea7b7e

Error Msg


[2025-11-05T06:10:04Z] (EngineCore_DP0 pid=1261) ERROR 11-04 22:10:04 [core.py:845]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/sample/sampler.py", line 296, in apply_logits_processors
--
  | [2025-11-05T06:10:04Z] (EngineCore_DP0 pid=1261) ERROR 11-04 22:10:04 [core.py:845]     logits = self.apply_penalties(logits, sampling_metadata, output_token_ids)
  | [2025-11-05T06:10:04Z] (EngineCore_DP0 pid=1261) ERROR 11-04 22:10:04 [core.py:845]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  | [2025-11-05T06:10:04Z] (EngineCore_DP0 pid=1261) ERROR 11-04 22:10:04 [core.py:845]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/sample/sampler.py", line 309, in apply_penalties
  | [2025-11-05T06:10:04Z] (EngineCore_DP0 pid=1261) ERROR 11-04 22:10:04 [core.py:845]     return apply_all_penalties(
  | [2025-11-05T06:10:04Z] (EngineCore_DP0 pid=1261) ERROR 11-04 22:10:04 [core.py:845]            ^^^^^^^^^^^^^^^^^^^^
  | [2025-11-05T06:10:04Z] (EngineCore_DP0 pid=1261) ERROR 11-04 22:10:04 [core.py:845]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/sample/ops/penalties.py", line 32, in apply_all_penalties
  | [2025-11-05T06:10:04Z] (EngineCore_DP0 pid=1261) ERROR 11-04 22:10:04 [core.py:845]     return apply_penalties(
  | [2025-11-05T06:10:04Z] (EngineCore_DP0 pid=1261) ERROR 11-04 22:10:04 [core.py:845]            ^^^^^^^^^^^^^^^^
  | [2025-11-05T06:10:04Z] (EngineCore_DP0 pid=1261) ERROR 11-04 22:10:04 [core.py:845]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/utils.py", line 92, in apply_penalties
  | [2025-11-05T06:10:04Z] (EngineCore_DP0 pid=1261) ERROR 11-04 22:10:04 [core.py:845]     logits -= frequency_penalties.unsqueeze(dim=1) * output_bin_counts
  | [2025-11-05T06:10:04Z] (EngineCore_DP0 pid=1261) ERROR 11-04 22:10:04 [core.py:845]               ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~
  | [2025-11-05T06:10:04Z] (EngineCore_DP0 pid=1261) ERROR 11-04 22:10:04 [core.py:845] torch.AcceleratorError: CUDA error: an illegal memory access was encountered
  | [2025-11-05T06:10:04Z] (EngineCore_DP0 pid=1261) ERROR 11-04 22:10:04 [core.py:845] Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.

While we are only able to repro the CUDA IMA on L4 but not H100/MI300. On L4, the IMA issue is not deterministic, it could pass/crash on the same commit.

This PR adds a small helper function _should_use_cuda_repetition_penalties - if on Ada (SM 8.9), then fallback to apply_repetition_penalties_torch to avoid CUDA IMA

One caveat to callout - this PR maybe introduce performance regression on Ada since we switch from apply_repetition_penalties_cuda to apply_repetition_penalties_torch

Test Plan

CI

Test Result

https://buildkite.com/vllm/ci/builds/37789/steps/canvas?sid=019a5740-b01c-4ecf-9809-226c492b8aa4

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

yeqcharlotte · 2025-11-06T07:11:05Z

interesting. this means these distributed tests need to happen on h100/b200 be useful. this is causing other entry point test fialures.

hl475 · 2025-11-06T07:14:46Z

for Entrypoints Integration Test (API Server) failure, i don't think it is due to this PR - from https://app.hex.tech/533fe68e-dcd8-4a52-a101-aefba762f581/app/vLLM-CI-030kdEgDv6lSlh1UPYOkWP/latest , the test failed randomly

yeqcharlotte · 2025-11-06T07:23:25Z

cc: @houseroad @simon-mo to also take a look. we may need a few force merges to get all these fixes in place

vadiklyutiy · 2025-11-06T12:39:02Z

@hl475
Could you share how did you reproduce locally and some rough stat how-many-fail/how-many-passed?
no need

mgoin

Please use the current_platform interface to avoid initializing torch.cuda directly. You should be able to do current_platform.is_device_capability(89) as the check

Signed-off-by: Huamin Li <[email protected]>

hl475 · 2025-11-06T18:07:20Z

Thanks @mgoin for reviewing! I switched to use current_platform.is_device_capability(89) per suggestion. Please take another look!

vadiklyutiy · 2025-11-07T00:56:22Z

Seems that this is just a luck that calling apply_repetition_penalties_torch instead of apply_repetition_penalties_cuda "fix" a problem.
I added torch.cuda.synchronize() in the beginning Sampler.forward() and got fail there

�[1;36m(EngineCore_DP0 pid=657399)�[0;0m ERROR 11-07 04:41:14 [core.py:845]   File "/home/scratch.vgimpelson_ent/vllm_qwen2/vllm/v1/worker/gpu_model_runner.py", line 2655, in sample_tokens
�[1;36m(EngineCore_DP0 pid=657399)�[0;0m ERROR 11-07 04:41:14 [core.py:845]     sampler_output = self._sample(logits, spec_decode_metadata)
�[1;36m(EngineCore_DP0 pid=657399)�[0;0m ERROR 11-07 04:41:14 [core.py:845]                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(EngineCore_DP0 pid=657399)�[0;0m ERROR 11-07 04:41:14 [core.py:845]   File "/home/scratch.vgimpelson_ent/vllm_qwen2/vllm/v1/worker/gpu_model_runner.py", line 2250, in _sample
�[1;36m(EngineCore_DP0 pid=657399)�[0;0m ERROR 11-07 04:41:14 [core.py:845]     return self.sampler(
�[1;36m(EngineCore_DP0 pid=657399)�[0;0m ERROR 11-07 04:41:14 [core.py:845]            ^^^^^^^^^^^^^
�[1;36m(EngineCore_DP0 pid=657399)�[0;0m ERROR 11-07 04:41:14 [core.py:845]   File "/home/scratch.vgimpelson_ent/venv_l4/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
�[1;36m(EngineCore_DP0 pid=657399)�[0;0m ERROR 11-07 04:41:14 [core.py:845]     return self._call_impl(*args, **kwargs)
�[1;36m(EngineCore_DP0 pid=657399)�[0;0m ERROR 11-07 04:41:14 [core.py:845]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(EngineCore_DP0 pid=657399)�[0;0m ERROR 11-07 04:41:14 [core.py:845]   File "/home/scratch.vgimpelson_ent/venv_l4/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
�[1;36m(EngineCore_DP0 pid=657399)�[0;0m ERROR 11-07 04:41:14 [core.py:845]     return forward_call(*args, **kwargs)
�[1;36m(EngineCore_DP0 pid=657399)�[0;0m ERROR 11-07 04:41:14 [core.py:845]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(EngineCore_DP0 pid=657399)�[0;0m ERROR 11-07 04:41:14 [core.py:845]   File "/home/scratch.vgimpelson_ent/vllm_qwen2/vllm/v1/sample/sampler.py", line 74, in forward
�[1;36m(EngineCore_DP0 pid=657399)�[0;0m ERROR 11-07 04:41:14 [core.py:845]     torch.cuda.synchronize()
�[1;36m(EngineCore_DP0 pid=657399)�[0;0m ERROR 11-07 04:41:14 [core.py:845]   File "/home/scratch.vgimpelson_ent/venv_l4/lib/python3.12/site-packages/torch/cuda/__init__.py", line 1083, in synchronize
�[1;36m(EngineCore_DP0 pid=657399)�[0;0m ERROR 11-07 04:41:14 [core.py:845]     return torch._C._cuda_synchronize()
�[1;36m(EngineCore_DP0 pid=657399)�[0;0m ERROR 11-07 04:41:14 [core.py:845]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(EngineCore_DP0 pid=657399)�[0;0m ERROR 11-07 04:41:14 [core.py:845] torch.AcceleratorError: CUDA error: an illegal memory access was encountered
�[1;36m(EngineCore_DP0 pid=657399)�[0;0m ERROR 11-07 04:41:14 [core.py:845] Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
�[1;36m(EngineCore_DP0 pid=657399)�[0;0m ERROR 11-07 04:41:14 [core.py:845] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
�[1;36m(EngineCore_DP0 pid=657399)�[0;0m ERROR 11-07 04:41:14 [core.py:845] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
�[1;36m(EngineCore_DP0 pid=657399)�[0;0m ERROR 11-07 04:41:14 [core.py:845] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
�[1;36m(EngineCore_DP0 pid=657399)�[0;0m ERROR 11-07 04:41:14 [core.py:845] 
�[1;36m(APIServer pid=657034)�[0;0m ERROR 11-07 04:41:14 [core_client.py:598] Engine core proc EngineCore_DP0 died unexpectedly, shutting down client.

Fail also isn't stable: appear 1 from [5,10,20] times

vadiklyutiy · 2025-11-07T01:13:33Z

Set torch.cuda.synchronize() after self.model_executor.execute_model and caught bug here

�[1;36m(EngineCore_DP0 pid=673114)�[0;0m ERROR 11-07 05:05:01 [core.py:846] EngineCore encountered a fatal error.
�[1;36m(EngineCore_DP0 pid=673114)�[0;0m ERROR 11-07 05:05:01 [core.py:846] Traceback (most recent call last):
�[1;36m(EngineCore_DP0 pid=673114)�[0;0m ERROR 11-07 05:05:01 [core.py:846]   File "/home/scratch.vgimpelson_ent/vllm_qwen2/vllm/v1/engine/core.py", line 837, in run_engine_core
�[1;36m(EngineCore_DP0 pid=673114)�[0;0m ERROR 11-07 05:05:01 [core.py:846]     engine_core.run_busy_loop()
�[1;36m(EngineCore_DP0 pid=673114)�[0;0m ERROR 11-07 05:05:01 [core.py:846]   File "/home/scratch.vgimpelson_ent/vllm_qwen2/vllm/v1/engine/core.py", line 864, in run_busy_loop
�[1;36m(EngineCore_DP0 pid=673114)�[0;0m ERROR 11-07 05:05:01 [core.py:846]     self._process_engine_step()
�[1;36m(EngineCore_DP0 pid=673114)�[0;0m ERROR 11-07 05:05:01 [core.py:846]   File "/home/scratch.vgimpelson_ent/vllm_qwen2/vllm/v1/engine/core.py", line 893, in _process_engine_step
�[1;36m(EngineCore_DP0 pid=673114)�[0;0m ERROR 11-07 05:05:01 [core.py:846]     outputs, model_executed = self.step_fn()
�[1;36m(EngineCore_DP0 pid=673114)�[0;0m ERROR 11-07 05:05:01 [core.py:846]                               ^^^^^^^^^^^^^^
�[1;36m(EngineCore_DP0 pid=673114)�[0;0m ERROR 11-07 05:05:01 [core.py:846]   File "/home/scratch.vgimpelson_ent/vllm_qwen2/vllm/v1/engine/core.py", line 338, in step
�[1;36m(EngineCore_DP0 pid=673114)�[0;0m ERROR 11-07 05:05:01 [core.py:846]     torch.cuda.synchronize()
�[1;36m(EngineCore_DP0 pid=673114)�[0;0m ERROR 11-07 05:05:01 [core.py:846]   File "/home/scratch.vgimpelson_ent/venv_l4/lib/python3.12/site-packages/torch/cuda/__init__.py", line 1083, in synchronize
�[1;36m(EngineCore_DP0 pid=673114)�[0;0m ERROR 11-07 05:05:01 [core.py:846]     return torch._C._cuda_synchronize()
�[1;36m(EngineCore_DP0 pid=673114)�[0;0m ERROR 11-07 05:05:01 [core.py:846]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(EngineCore_DP0 pid=673114)�[0;0m ERROR 11-07 05:05:01 [core.py:846] torch.AcceleratorError: CUDA error: an illegal memory access was encountered
�[1;36m(EngineCore_DP0 pid=673114)�[0;0m ERROR 11-07 05:05:01 [core.py:846] Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
�[1;36m(EngineCore_DP0 pid=673114)�[0;0m ERROR 11-07 05:05:01 [core.py:846] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
�[1;36m(EngineCore_DP0 pid=673114)�[0;0m ERROR 11-07 05:05:01 [core.py:846] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
�[1;36m(EngineCore_DP0 pid=673114)�[0;0m ERROR 11-07 05:05:01 [core.py:846] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

vadiklyutiy · 2025-11-07T10:49:34Z

Additional finding #28220 (comment)

In short: seems the problem in Marlin kernel.

hl475 · 2025-11-10T07:25:43Z

Close this PR as we merged #28324

hl475 changed the title ~~[WIP] fallback_to_apply_repetition_penalties_torch_on_l4~~ Avoid CUDA IMA on Ada (SM 8.9) by fallback to apply_repetition_penalties_torch on Ada (SM 8.9) Nov 6, 2025

hl475 changed the title ~~Avoid CUDA IMA on Ada (SM 8.9) by fallback to apply_repetition_penalties_torch on Ada (SM 8.9)~~ Avoid CUDA IMA on Ada (SM 8.9) by fallback to apply_repetition_penalties_torch Nov 6, 2025

hl475 marked this pull request as ready for review November 6, 2025 05:34

yeqcharlotte changed the title ~~Avoid CUDA IMA on Ada (SM 8.9) by fallback to apply_repetition_penalties_torch~~ [CI/Build] Avoid CUDA IMA on Ada (SM 8.9) by fallback to apply_repetition_penalties_torch Nov 6, 2025

yeqcharlotte requested a review from houseroad November 6, 2025 07:22

yeqcharlotte requested a review from simon-mo November 6, 2025 07:23

yeqcharlotte added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 6, 2025

mgoin self-assigned this Nov 6, 2025

vadiklyutiy mentioned this pull request Nov 6, 2025

[Bug]: Find the root cause of SHARED_EXPERTS_STREAM fail #28220

Open

mgoin requested changes Nov 6, 2025

View reviewed changes

use current_platform.is_device_capability

eee43cc

Signed-off-by: Huamin Li <[email protected]>

hl475 force-pushed the fallback_to_apply_repetition_penalties_torch_on_l4 branch from 91f7fe8 to eee43cc Compare November 6, 2025 18:06

hl475 requested a review from mgoin November 6, 2025 18:07

hl475 closed this Nov 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[CI/Build] Avoid CUDA IMA on Ada (SM 8.9) by fallback to apply_repetition_penalties_torch #28180

[CI/Build] Avoid CUDA IMA on Ada (SM 8.9) by fallback to apply_repetition_penalties_torch #28180

Uh oh!

hl475 commented Nov 6, 2025 •

edited by github-actions bot

Loading

Uh oh!

yeqcharlotte commented Nov 6, 2025

Uh oh!

hl475 commented Nov 6, 2025 •

edited

Loading

Uh oh!

yeqcharlotte commented Nov 6, 2025

Uh oh!

vadiklyutiy commented Nov 6, 2025 •

edited

Loading

Uh oh!

mgoin left a comment

Uh oh!

hl475 commented Nov 6, 2025

Uh oh!

vadiklyutiy commented Nov 7, 2025 •

edited

Loading

Uh oh!

vadiklyutiy commented Nov 7, 2025

Uh oh!

vadiklyutiy commented Nov 7, 2025

Uh oh!

hl475 commented Nov 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

[CI/Build] Avoid CUDA IMA on Ada (SM 8.9) by fallback to apply_repetition_penalties_torch #28180

[CI/Build] Avoid CUDA IMA on Ada (SM 8.9) by fallback to apply_repetition_penalties_torch #28180

Uh oh!

Conversation

hl475 commented Nov 6, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

yeqcharlotte commented Nov 6, 2025

Uh oh!

hl475 commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yeqcharlotte commented Nov 6, 2025

Uh oh!

vadiklyutiy commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

hl475 commented Nov 6, 2025

Uh oh!

vadiklyutiy commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vadiklyutiy commented Nov 7, 2025

Uh oh!

vadiklyutiy commented Nov 7, 2025

Uh oh!

hl475 commented Nov 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[CI/Build] Avoid CUDA IMA on Ada (SM 8.9) by fallback to apply_repetition_penalties_torch #28180

[CI/Build] Avoid CUDA IMA on Ada (SM 8.9) by fallback to apply_repetition_penalties_torch #28180

hl475 commented Nov 6, 2025 •

edited by github-actions bot

Loading

hl475 commented Nov 6, 2025 •

edited

Loading

vadiklyutiy commented Nov 6, 2025 •

edited

Loading

vadiklyutiy commented Nov 7, 2025 •

edited

Loading