[CI] Reject out-of-vocabulary before they reach the GPU logprob path#44042
Conversation
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
|
right shape — admission-time rejection on one edge worth a thought: |
|
@hclsys I think this is okay as-is because |
…l_token_threshold Both fields were declared as bare `int` with no Field constraint, and the downstream validation chain only handled specific values: - `max_logprobs`: only `-1` is rewritten to vocab_size; other negatives flow through and either land in a confusing "max allowed: -5" error or silently no-op on the cap check. - `long_prefill_token_threshold`: the clamp is guarded by `0 < threshold < num_new_tokens` and the cap by `> max_model_len`, so a negative value matches neither and silently passes through unvalidated. Add field_validators (mode="after"), matching the pattern landed in vllm-project#43794 and the recent vllm-project#44002 / vllm-project#44042 / vllm-project#44057. `max_logprobs` keeps the `-1` sentinel for auto-derive; `long_prefill_token_threshold` requires `>= 0` (0 = off, > 0 = clamp). Fixes vllm-project#43985. Signed-off-by: Chenglun Hu <chenglunhu@gmail.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
|
Update: Hardened the model runner shutdown path on ROCm by explicitly releasing captured CUDA/HIP graph state before the worker clears static forward context, drops model references, resets workspace state, and tears down distributed state. The cause was an intermittent ROCm CI failure in: The condition is intentionally ROCm-only. CUDA and other platforms keep the previous shutdown behavior. The failing test has a very specific lifetime pattern: ref_llm = LLM(...)
ref_llm.generate(...)
del ref_llm
torch.accelerator.empty_cache()
cleanup_dist_env_and_memory()
spec_llm = LLM(..., speculative_config=...)The Buildkite failure occurs after the reference engine logs That timing points at a delayed GPU resource lifetime problem from the previous engine, not a synchronous error in the new engine's Python config parsing or model loading. |
|
cc @njhill Lmk if you are still OK with this PR. Don't want to squash it with new changes, even though they are all ROCm specific. |
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
|
Another small patch to keep the existing |
|
Thanks @AndreasKaratzas ... do you think we could split these other rocm fixes into a separate PR? The OOV validation I think makes sense generally, the other things are partially hack workaround for rocm bug IIUC. |
|
@njhill I added them here so that CI is green, cause without some of these, I would get errors, that were always there, but due to the refactoring were exposed only now. |
…llm-project#44042) Signed-off-by: Andreas Karatzas <akaratza@amd.com> Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
…llm-project#44042) Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…llm-project#44042) Signed-off-by: Andreas Karatzas <akaratza@amd.com> Signed-off-by: JisoLya <523420504@qq.com>
This stabilizes a flakiness in ROCm CI seen at "AMD: Entrypoints Integration (API Server openai - Part 3) (mi325_1)":
The first real failure came from the Schemathesis-generated
/inference/v1/generaterequest:{ "sampling_params": { "logprob_token_ids": [-35, -11378, 3392, 1873042417, 126] }, "token_ids": [0] }After that request, the engine hit a ROCm/HSA hardware exception and died. The later failures in the same test were cascade failures from the module-scoped API server being dead.
SamplingParams.verify()validated the length oflogprob_token_ids, but did not validate that every requested token ID was in the model vocabulary. This was flaky because Schemathesis only sometimes generated the invalidlogprob_token_idsfield during the full OpenAI schema test group. When it did, the first bad request killed the server and unrelated endpoint checks failed afterward.cc @kenroche