Skip to content

[Spec Decode] Update extract_hidden_states to use deferred kv_connector clear#37013

Merged
mgoin merged 6 commits intovllm-project:mainfrom
fynnsu:extract_hidden_states_update
Mar 16, 2026
Merged

[Spec Decode] Update extract_hidden_states to use deferred kv_connector clear#37013
mgoin merged 6 commits intovllm-project:mainfrom
fynnsu:extract_hidden_states_update

Conversation

@fynnsu
Copy link
Contributor

@fynnsu fynnsu commented Mar 13, 2026

Purpose

#35158 defers kv_connector finalization until after draft models have run. extract_hidden_states previously re-initialized the kv_connector metadata as a workaround but that is no longer needed.

Also patches a small bug in ExampleHiddenStatesConnector which causes occasional crashes.

Test Plan

Ran/updated extract_hidden_states tests. Ran examples/offline_inference/extract_hidden_states.py

Tested generating a few thousand samples and confirmed that output shapes and token ids always matched expectations, even when new_block_ids is None.

Test Result

See above


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

fynnsu added 2 commits March 13, 2026 21:21
vllm-project#35158 defers kv_connector finalization until after draft models have run. `extract_hidden_states` 
previously re-initialized the kv_connector metadata as a workaround but that is no longer needed.

Signed-off-by: Fynn Schmitt-Ulms <fschmitt@redhat.com>
Signed-off-by: Fynn Schmitt-Ulms <fschmitt@redhat.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the extract_hidden_states speculative decoding method to align with recent changes in kv_connector finalization, removing a previous workaround. This simplifies the code in extract_hidden_states.py and gpu_model_runner.py by streamlining the kv_connector handling. Additionally, a bug in ExampleHiddenStatesConnector is fixed, preventing crashes when new_block_ids is None. The changes are well-implemented and improve code clarity and robustness. I have not identified any issues.

Copy link
Member

@tlrmchlsmth tlrmchlsmth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question, but otherwise LGTM and nice cleanup!

Comment on lines +289 to +290
if new_block_ids is None:
continue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can new_block_ids be None?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having difficulty tracing down the root cause for this. It happens when running a lot of async requests at once and infrequently (i.e. w/ 32 simultaneous requests I hit this after processing 700 requests successfully). It is a rare case on a rare section of the code.

I have verified that it is a real request that triggers it (isn't empty or something), but the request still produces the hidden states file output as expected with the correct token ids and hidden state shapes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fynnsu input_fits_in_drafter failing due to long sequences is not a concern for this style of mock-drafting, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fynnsu input_fits_in_drafter failing due to long sequences is not a concern for this style of mock-drafting, right?

I think that is actually a risk with the current setup. I believe it would only be triggered if max_seq_len == max_model_size since num_speculative_tokens=1 but with Eagle3 we just fall back to not drafting, which doesn't work for the extract_hidden_states method.

Claude's suggestion is to just always set this to True for this draft method:

There's a more direct fix to the original problem: since extract_hidden_states doesn't actually speculate (the "draft" tokens are always the sampled tokens, always verify), the
input_fits_in_drafter guard is protecting against a scenario that doesn't apply. The check exists because real drafters (Eagle, DraftModel) would write KV entries beyond
max_model_len. But the extract_hidden_states drafter's KV cache is a separate cache-only layer whose writes are indexed by the target model's slot_mapping — it writes at the same
positions the target already wrote at, so the max_model_len concern doesn't arise.

A targeted fix would be to make input_fits_in_drafter always True for extract_hidden_states:

 input_fits_in_drafter = spec_decode_common_attn_metadata is not None and (
      spec_config.uses_extract_hidden_states()  # always fits
      or spec_decode_common_attn_metadata.max_seq_len + self.num_spec_tokens
      <= self.effective_drafter_max_model_len
  )

That logic does make sense to me and we shouldn't actually need the extra position of the "drafted" token, we just can't set num_speculative_tokens=0 without updating a whole bunch of guards. What do you think?

@tlrmchlsmth tlrmchlsmth added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 13, 2026
fynnsu added 2 commits March 13, 2026 21:48
Signed-off-by: Fynn Schmitt-Ulms <fschmitt@redhat.com>
Signed-off-by: Fynn Schmitt-Ulms <fschmitt@redhat.com>
@fynnsu fynnsu force-pushed the extract_hidden_states_update branch from af10200 to c22855b Compare March 13, 2026 21:49
fynnsu added 2 commits March 14, 2026 03:20
Previously we needed to patch the `has_kv_transfer_group` import but that's no longer needed.

Signed-off-by: Fynn Schmitt-Ulms <fschmitt@redhat.com>
Signed-off-by: Fynn Schmitt-Ulms <fschmitt@redhat.com>
@mgoin mgoin merged commit 04bf5a3 into vllm-project:main Mar 16, 2026
49 of 50 checks passed
Lucaskabela pushed a commit to Lucaskabela/vllm that referenced this pull request Mar 17, 2026
wendyliu235 pushed a commit to wendyliu235/vllm-public that referenced this pull request Mar 18, 2026
…or clear (vllm-project#37013)

Signed-off-by: wendyliu235 <wenjun.liu@intel.com>
fxdawnn pushed a commit to fxdawnn/vllm that referenced this pull request Mar 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kv-connector ready ONLY add when PR is ready to merge/full CI is needed speculative-decoding v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants