Skip to content

Conversation

@hl475
Copy link
Contributor

@hl475 hl475 commented Nov 6, 2025

Purpose

Recently evals/gsm8k/test_gsm8k_correctness.py::test_gsm8k_correctness_param[Qwen1.5-MoE-W4A16-CT-tp1] from LM Eval Small Models are failing in nightly:
https://buildkite.com/vllm/ci/builds/37631/steps/canvas?sid=019a5264-3636-4617-87f8-9867066b7a78
https://buildkite.com/vllm/ci/builds/37251/steps/canvas?sid=019a42b9-e1bc-42a6-8605-7900f4330ffd
https://buildkite.com/vllm/ci/builds/37196/steps/canvas?sid=019a3d93-774a-470b-a899-0f34ed601d55
https://buildkite.com/vllm/ci/builds/37041/steps/canvas?sid=019a386d-1b21-41bf-bb23-9d1a53bb4455
https://buildkite.com/vllm/ci/builds/36869/steps/canvas?sid=019a3346-ceb2-4e6b-ac05-46162dea7b7e

Error Msg


[2025-11-05T06:10:04Z] (EngineCore_DP0 pid=1261) ERROR 11-04 22:10:04 [core.py:845]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/sample/sampler.py", line 296, in apply_logits_processors
--
  | [2025-11-05T06:10:04Z] (EngineCore_DP0 pid=1261) ERROR 11-04 22:10:04 [core.py:845]     logits = self.apply_penalties(logits, sampling_metadata, output_token_ids)
  | [2025-11-05T06:10:04Z] (EngineCore_DP0 pid=1261) ERROR 11-04 22:10:04 [core.py:845]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  | [2025-11-05T06:10:04Z] (EngineCore_DP0 pid=1261) ERROR 11-04 22:10:04 [core.py:845]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/sample/sampler.py", line 309, in apply_penalties
  | [2025-11-05T06:10:04Z] (EngineCore_DP0 pid=1261) ERROR 11-04 22:10:04 [core.py:845]     return apply_all_penalties(
  | [2025-11-05T06:10:04Z] (EngineCore_DP0 pid=1261) ERROR 11-04 22:10:04 [core.py:845]            ^^^^^^^^^^^^^^^^^^^^
  | [2025-11-05T06:10:04Z] (EngineCore_DP0 pid=1261) ERROR 11-04 22:10:04 [core.py:845]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/sample/ops/penalties.py", line 32, in apply_all_penalties
  | [2025-11-05T06:10:04Z] (EngineCore_DP0 pid=1261) ERROR 11-04 22:10:04 [core.py:845]     return apply_penalties(
  | [2025-11-05T06:10:04Z] (EngineCore_DP0 pid=1261) ERROR 11-04 22:10:04 [core.py:845]            ^^^^^^^^^^^^^^^^
  | [2025-11-05T06:10:04Z] (EngineCore_DP0 pid=1261) ERROR 11-04 22:10:04 [core.py:845]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/utils.py", line 92, in apply_penalties
  | [2025-11-05T06:10:04Z] (EngineCore_DP0 pid=1261) ERROR 11-04 22:10:04 [core.py:845]     logits -= frequency_penalties.unsqueeze(dim=1) * output_bin_counts
  | [2025-11-05T06:10:04Z] (EngineCore_DP0 pid=1261) ERROR 11-04 22:10:04 [core.py:845]               ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~
  | [2025-11-05T06:10:04Z] (EngineCore_DP0 pid=1261) ERROR 11-04 22:10:04 [core.py:845] torch.AcceleratorError: CUDA error: an illegal memory access was encountered
  | [2025-11-05T06:10:04Z] (EngineCore_DP0 pid=1261) ERROR 11-04 22:10:04 [core.py:845] Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.

While we are only able to repro the CUDA IMA on L4 but not H100/MI300. On L4, the IMA issue is not deterministic, it could pass/crash on the same commit.

This PR adds a small helper function _should_use_cuda_repetition_penalties - if on Ada (SM 8.9), then fallback to apply_repetition_penalties_torch to avoid CUDA IMA

One caveat to callout - this PR maybe introduce performance regression on Ada since we switch from apply_repetition_penalties_cuda to apply_repetition_penalties_torch

Test Plan

CI

Test Result

https://buildkite.com/vllm/ci/builds/37789/steps/canvas?sid=019a5740-b01c-4ecf-9809-226c492b8aa4


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@hl475 hl475 changed the title [WIP] fallback_to_apply_repetition_penalties_torch_on_l4 Avoid CUDA IMA on Ada (SM 8.9) by fallback to apply_repetition_penalties_torch on Ada (SM 8.9) Nov 6, 2025
@hl475 hl475 changed the title Avoid CUDA IMA on Ada (SM 8.9) by fallback to apply_repetition_penalties_torch on Ada (SM 8.9) Avoid CUDA IMA on Ada (SM 8.9) by fallback to apply_repetition_penalties_torch Nov 6, 2025
@hl475 hl475 marked this pull request as ready for review November 6, 2025 05:34
@yeqcharlotte yeqcharlotte changed the title Avoid CUDA IMA on Ada (SM 8.9) by fallback to apply_repetition_penalties_torch [CI/Build] Avoid CUDA IMA on Ada (SM 8.9) by fallback to apply_repetition_penalties_torch Nov 6, 2025
@yeqcharlotte
Copy link
Collaborator

interesting. this means these distributed tests need to happen on h100/b200 be useful. this is causing other entry point test fialures.

@hl475
Copy link
Contributor Author

hl475 commented Nov 6, 2025

for Entrypoints Integration Test (API Server) failure, i don't think it is due to this PR - from https://app.hex.tech/533fe68e-dcd8-4a52-a101-aefba762f581/app/vLLM-CI-030kdEgDv6lSlh1UPYOkWP/latest , the test failed randomly

@yeqcharlotte
Copy link
Collaborator

cc: @houseroad @simon-mo to also take a look. we may need a few force merges to get all these fixes in place

@yeqcharlotte yeqcharlotte requested a review from simon-mo November 6, 2025 07:23
@yeqcharlotte yeqcharlotte added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 6, 2025
@vadiklyutiy
Copy link
Collaborator

vadiklyutiy commented Nov 6, 2025

@hl475
Could you share how did you reproduce locally and some rough stat how-many-fail/how-many-passed?

no need

Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use the current_platform interface to avoid initializing torch.cuda directly. You should be able to do current_platform.is_device_capability(89) as the check

@hl475 hl475 force-pushed the fallback_to_apply_repetition_penalties_torch_on_l4 branch from 91f7fe8 to eee43cc Compare November 6, 2025 18:06
@hl475
Copy link
Contributor Author

hl475 commented Nov 6, 2025

Thanks @mgoin for reviewing! I switched to use current_platform.is_device_capability(89) per suggestion. Please take another look!

@hl475 hl475 requested a review from mgoin November 6, 2025 18:07
@vadiklyutiy
Copy link
Collaborator

vadiklyutiy commented Nov 7, 2025

Seems that this is just a luck that calling apply_repetition_penalties_torch instead of apply_repetition_penalties_cuda "fix" a problem.
I added torch.cuda.synchronize() in the beginning Sampler.forward() and got fail there

�[1;36m(EngineCore_DP0 pid=657399)�[0;0m ERROR 11-07 04:41:14 [core.py:845]   File "/home/scratch.vgimpelson_ent/vllm_qwen2/vllm/v1/worker/gpu_model_runner.py", line 2655, in sample_tokens
�[1;36m(EngineCore_DP0 pid=657399)�[0;0m ERROR 11-07 04:41:14 [core.py:845]     sampler_output = self._sample(logits, spec_decode_metadata)
�[1;36m(EngineCore_DP0 pid=657399)�[0;0m ERROR 11-07 04:41:14 [core.py:845]                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(EngineCore_DP0 pid=657399)�[0;0m ERROR 11-07 04:41:14 [core.py:845]   File "/home/scratch.vgimpelson_ent/vllm_qwen2/vllm/v1/worker/gpu_model_runner.py", line 2250, in _sample
�[1;36m(EngineCore_DP0 pid=657399)�[0;0m ERROR 11-07 04:41:14 [core.py:845]     return self.sampler(
�[1;36m(EngineCore_DP0 pid=657399)�[0;0m ERROR 11-07 04:41:14 [core.py:845]            ^^^^^^^^^^^^^
�[1;36m(EngineCore_DP0 pid=657399)�[0;0m ERROR 11-07 04:41:14 [core.py:845]   File "/home/scratch.vgimpelson_ent/venv_l4/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
�[1;36m(EngineCore_DP0 pid=657399)�[0;0m ERROR 11-07 04:41:14 [core.py:845]     return self._call_impl(*args, **kwargs)
�[1;36m(EngineCore_DP0 pid=657399)�[0;0m ERROR 11-07 04:41:14 [core.py:845]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(EngineCore_DP0 pid=657399)�[0;0m ERROR 11-07 04:41:14 [core.py:845]   File "/home/scratch.vgimpelson_ent/venv_l4/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
�[1;36m(EngineCore_DP0 pid=657399)�[0;0m ERROR 11-07 04:41:14 [core.py:845]     return forward_call(*args, **kwargs)
�[1;36m(EngineCore_DP0 pid=657399)�[0;0m ERROR 11-07 04:41:14 [core.py:845]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(EngineCore_DP0 pid=657399)�[0;0m ERROR 11-07 04:41:14 [core.py:845]   File "/home/scratch.vgimpelson_ent/vllm_qwen2/vllm/v1/sample/sampler.py", line 74, in forward
�[1;36m(EngineCore_DP0 pid=657399)�[0;0m ERROR 11-07 04:41:14 [core.py:845]     torch.cuda.synchronize()
�[1;36m(EngineCore_DP0 pid=657399)�[0;0m ERROR 11-07 04:41:14 [core.py:845]   File "/home/scratch.vgimpelson_ent/venv_l4/lib/python3.12/site-packages/torch/cuda/__init__.py", line 1083, in synchronize
�[1;36m(EngineCore_DP0 pid=657399)�[0;0m ERROR 11-07 04:41:14 [core.py:845]     return torch._C._cuda_synchronize()
�[1;36m(EngineCore_DP0 pid=657399)�[0;0m ERROR 11-07 04:41:14 [core.py:845]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(EngineCore_DP0 pid=657399)�[0;0m ERROR 11-07 04:41:14 [core.py:845] torch.AcceleratorError: CUDA error: an illegal memory access was encountered
�[1;36m(EngineCore_DP0 pid=657399)�[0;0m ERROR 11-07 04:41:14 [core.py:845] Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
�[1;36m(EngineCore_DP0 pid=657399)�[0;0m ERROR 11-07 04:41:14 [core.py:845] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
�[1;36m(EngineCore_DP0 pid=657399)�[0;0m ERROR 11-07 04:41:14 [core.py:845] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
�[1;36m(EngineCore_DP0 pid=657399)�[0;0m ERROR 11-07 04:41:14 [core.py:845] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
�[1;36m(EngineCore_DP0 pid=657399)�[0;0m ERROR 11-07 04:41:14 [core.py:845] 
�[1;36m(APIServer pid=657034)�[0;0m ERROR 11-07 04:41:14 [core_client.py:598] Engine core proc EngineCore_DP0 died unexpectedly, shutting down client.

Fail also isn't stable: appear 1 from [5,10,20] times

@vadiklyutiy
Copy link
Collaborator

Set torch.cuda.synchronize() after self.model_executor.execute_model and caught bug here

�[1;36m(EngineCore_DP0 pid=673114)�[0;0m ERROR 11-07 05:05:01 [core.py:846] EngineCore encountered a fatal error.
�[1;36m(EngineCore_DP0 pid=673114)�[0;0m ERROR 11-07 05:05:01 [core.py:846] Traceback (most recent call last):
�[1;36m(EngineCore_DP0 pid=673114)�[0;0m ERROR 11-07 05:05:01 [core.py:846]   File "/home/scratch.vgimpelson_ent/vllm_qwen2/vllm/v1/engine/core.py", line 837, in run_engine_core
�[1;36m(EngineCore_DP0 pid=673114)�[0;0m ERROR 11-07 05:05:01 [core.py:846]     engine_core.run_busy_loop()
�[1;36m(EngineCore_DP0 pid=673114)�[0;0m ERROR 11-07 05:05:01 [core.py:846]   File "/home/scratch.vgimpelson_ent/vllm_qwen2/vllm/v1/engine/core.py", line 864, in run_busy_loop
�[1;36m(EngineCore_DP0 pid=673114)�[0;0m ERROR 11-07 05:05:01 [core.py:846]     self._process_engine_step()
�[1;36m(EngineCore_DP0 pid=673114)�[0;0m ERROR 11-07 05:05:01 [core.py:846]   File "/home/scratch.vgimpelson_ent/vllm_qwen2/vllm/v1/engine/core.py", line 893, in _process_engine_step
�[1;36m(EngineCore_DP0 pid=673114)�[0;0m ERROR 11-07 05:05:01 [core.py:846]     outputs, model_executed = self.step_fn()
�[1;36m(EngineCore_DP0 pid=673114)�[0;0m ERROR 11-07 05:05:01 [core.py:846]                               ^^^^^^^^^^^^^^
�[1;36m(EngineCore_DP0 pid=673114)�[0;0m ERROR 11-07 05:05:01 [core.py:846]   File "/home/scratch.vgimpelson_ent/vllm_qwen2/vllm/v1/engine/core.py", line 338, in step
�[1;36m(EngineCore_DP0 pid=673114)�[0;0m ERROR 11-07 05:05:01 [core.py:846]     torch.cuda.synchronize()
�[1;36m(EngineCore_DP0 pid=673114)�[0;0m ERROR 11-07 05:05:01 [core.py:846]   File "/home/scratch.vgimpelson_ent/venv_l4/lib/python3.12/site-packages/torch/cuda/__init__.py", line 1083, in synchronize
�[1;36m(EngineCore_DP0 pid=673114)�[0;0m ERROR 11-07 05:05:01 [core.py:846]     return torch._C._cuda_synchronize()
�[1;36m(EngineCore_DP0 pid=673114)�[0;0m ERROR 11-07 05:05:01 [core.py:846]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(EngineCore_DP0 pid=673114)�[0;0m ERROR 11-07 05:05:01 [core.py:846] torch.AcceleratorError: CUDA error: an illegal memory access was encountered
�[1;36m(EngineCore_DP0 pid=673114)�[0;0m ERROR 11-07 05:05:01 [core.py:846] Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
�[1;36m(EngineCore_DP0 pid=673114)�[0;0m ERROR 11-07 05:05:01 [core.py:846] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
�[1;36m(EngineCore_DP0 pid=673114)�[0;0m ERROR 11-07 05:05:01 [core.py:846] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
�[1;36m(EngineCore_DP0 pid=673114)�[0;0m ERROR 11-07 05:05:01 [core.py:846] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

@vadiklyutiy
Copy link
Collaborator

Additional finding #28220 (comment)

In short: seems the problem in Marlin kernel.

@hl475
Copy link
Contributor Author

hl475 commented Nov 10, 2025

Close this PR as we merged #28324

@hl475 hl475 closed this Nov 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants