Skip to content

[CI] Reject out-of-vocabulary before they reach the GPU logprob path#44042

Merged
AndreasKaratzas merged 8 commits into
vllm-project:mainfrom
ROCm:akaratza_stabilize_sampler
Jun 3, 2026
Merged

[CI] Reject out-of-vocabulary before they reach the GPU logprob path#44042
AndreasKaratzas merged 8 commits into
vllm-project:mainfrom
ROCm:akaratza_stabilize_sampler

Conversation

@AndreasKaratzas
Copy link
Copy Markdown
Member

This stabilizes a flakiness in ROCm CI seen at "AMD: Entrypoints Integration (API Server openai - Part 3) (mi325_1)":

The first real failure came from the Schemathesis-generated /inference/v1/generate request:

{
  "sampling_params": {
    "logprob_token_ids": [-35, -11378, 3392, 1873042417, 126]
  },
  "token_ids": [0]
}

After that request, the engine hit a ROCm/HSA hardware exception and died. The later failures in the same test were cascade failures from the module-scoped API server being dead. SamplingParams.verify() validated the length of logprob_token_ids, but did not validate that every requested token ID was in the model vocabulary. This was flaky because Schemathesis only sometimes generated the invalid logprob_token_ids field during the full OpenAI schema test group. When it did, the first bad request killed the server and unrelated endpoint checks failed afterward.

cc @kenroche

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
@AndreasKaratzas AndreasKaratzas marked this pull request as ready for review May 30, 2026 01:50
@AndreasKaratzas AndreasKaratzas added the ready ONLY add when PR is ready to merge/full CI is needed label May 30, 2026
@mergify mergify Bot added the v1 label May 30, 2026
@hclsys
Copy link
Copy Markdown

hclsys commented May 30, 2026

right shape — admission-time rejection on < 0 or >= vocab_size cuts the crash before it hits the kernel, and _validate_logprobs is the consistent place for it (same pattern as the recent max_logprobs ge=-1 fix).

one edge worth a thought: vocab_size = model_config.get_vocab_size() assumes the model config is fully populated by the time .verify() runs. is there a path where verify() is called before the model loads (or with a model that doesn't expose get_vocab_size — your test uses SimpleNamespace, real ModelConfig always has it?) so this'd raise AttributeError instead of returning a clean VLLMValidationError? probably fine in practice since the existing max_logprobs check above already uses model_config, just flagging.

@AndreasKaratzas
Copy link
Copy Markdown
Member Author

@hclsys I think this is okay as-is because verify is typed around ModelConfig, and real ModelConfig always exposes get_vocab_size() from model_arch_config; this does not depend on weights being loaded. We already use get_vocab_size() for logprobs=-1 / prompt_logprobs=-1, so this should not introduce a new model_config assumption.

hclsys added a commit to hclsys/vllm that referenced this pull request May 30, 2026
…l_token_threshold

Both fields were declared as bare `int` with no Field constraint, and the
downstream validation chain only handled specific values:

- `max_logprobs`: only `-1` is rewritten to vocab_size; other negatives
  flow through and either land in a confusing "max allowed: -5" error or
  silently no-op on the cap check.
- `long_prefill_token_threshold`: the clamp is guarded by
  `0 < threshold < num_new_tokens` and the cap by `> max_model_len`, so a
  negative value matches neither and silently passes through unvalidated.

Add field_validators (mode="after"), matching the pattern landed in vllm-project#43794
and the recent vllm-project#44002 / vllm-project#44042 / vllm-project#44057. `max_logprobs` keeps the `-1`
sentinel for auto-derive; `long_prefill_token_threshold` requires `>= 0`
(0 = off, > 0 = clamp).

Fixes vllm-project#43985.

Signed-off-by: Chenglun Hu <chenglunhu@gmail.com>
@AndreasKaratzas
Copy link
Copy Markdown
Member Author

Update: Hardened the model runner shutdown path on ROCm by explicitly releasing captured CUDA/HIP graph state before the worker clears static forward context, drops model references, resets workspace state, and tears down distributed state. The cause was an intermittent ROCm CI failure in:

tests/v1/sample/test_logprobs.py::test_spec_decode_logprobs[eagle0-raw_logits]

The condition is intentionally ROCm-only. CUDA and other platforms keep the previous shutdown behavior. The failing test has a very specific lifetime pattern:

ref_llm = LLM(...)
ref_llm.generate(...)
del ref_llm
torch.accelerator.empty_cache()
cleanup_dist_env_and_memory()

spec_llm = LLM(..., speculative_config=...)

The Buildkite failure occurs after the reference engine logs Shutdown complete, then the EAGLE speculative engine starts. The HSA fault appears right after distributed initialization and before model loading:

Initializing a V1 LLM engine ...
world_size=1 rank=0 ...
rank 0 in world size 1 ...
Memory access fault by GPU node-2 ... Reason: Write access to a read-only page.

That timing points at a delayed GPU resource lifetime problem from the previous engine, not a synchronous error in the new engine's Python config parsing or model loading.

@AndreasKaratzas
Copy link
Copy Markdown
Member Author

cc @njhill Lmk if you are still OK with this PR. Don't want to squash it with new changes, even though they are all ROCm specific.

@AndreasKaratzas
Copy link
Copy Markdown
Member Author

Another small patch to keep the existing index_put_ logits update path for non-ROCm platforms, while routing ROCm through a flattened index_fill_ write. The previous CI run of this PR showed a GPU memory access fault during the thinking-budget forced-token write. My take was that ROCm graph/runtime changes exposed a latent ROCm/PyTorch advanced-indexing issue rather than introducing a thinking-budget logic regression. Flattening row/token pairs into one-dimensional indices avoids the fragile 2-D advanced-indexing kernel while preserving the same logits result.

@njhill
Copy link
Copy Markdown
Member

njhill commented Jun 2, 2026

Thanks @AndreasKaratzas ... do you think we could split these other rocm fixes into a separate PR?

The OOV validation I think makes sense generally, the other things are partially hack workaround for rocm bug IIUC.

@AndreasKaratzas
Copy link
Copy Markdown
Member Author

@njhill I added them here so that CI is green, cause without some of these, I would get errors, that were always there, but due to the refactoring were exposed only now.

@AndreasKaratzas AndreasKaratzas merged commit 53b88d1 into vllm-project:main Jun 3, 2026
67 checks passed
@AndreasKaratzas AndreasKaratzas deleted the akaratza_stabilize_sampler branch June 3, 2026 03:27
mvanhorn pushed a commit to mvanhorn/vllm that referenced this pull request Jun 4, 2026
…llm-project#44042)

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
andakai pushed a commit to andakai/vllm that referenced this pull request Jun 4, 2026
JisoLya pushed a commit to JisoLya/vllm that referenced this pull request Jun 5, 2026
…llm-project#44042)

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: JisoLya <523420504@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants