[Spec Decode] Update extract_hidden_states to use deferred kv_connector clear#37013
Conversation
vllm-project#35158 defers kv_connector finalization until after draft models have run. `extract_hidden_states` previously re-initialized the kv_connector metadata as a workaround but that is no longer needed. Signed-off-by: Fynn Schmitt-Ulms <fschmitt@redhat.com>
Signed-off-by: Fynn Schmitt-Ulms <fschmitt@redhat.com>
There was a problem hiding this comment.
Code Review
This pull request refactors the extract_hidden_states speculative decoding method to align with recent changes in kv_connector finalization, removing a previous workaround. This simplifies the code in extract_hidden_states.py and gpu_model_runner.py by streamlining the kv_connector handling. Additionally, a bug in ExampleHiddenStatesConnector is fixed, preventing crashes when new_block_ids is None. The changes are well-implemented and improve code clarity and robustness. I have not identified any issues.
tlrmchlsmth
left a comment
There was a problem hiding this comment.
One question, but otherwise LGTM and nice cleanup!
| if new_block_ids is None: | ||
| continue |
There was a problem hiding this comment.
Why can new_block_ids be None?
There was a problem hiding this comment.
Having difficulty tracing down the root cause for this. It happens when running a lot of async requests at once and infrequently (i.e. w/ 32 simultaneous requests I hit this after processing 700 requests successfully). It is a rare case on a rare section of the code.
I have verified that it is a real request that triggers it (isn't empty or something), but the request still produces the hidden states file output as expected with the correct token ids and hidden state shapes.
There was a problem hiding this comment.
@fynnsu input_fits_in_drafter failing due to long sequences is not a concern for this style of mock-drafting, right?
There was a problem hiding this comment.
@fynnsu input_fits_in_drafter failing due to long sequences is not a concern for this style of mock-drafting, right?
I think that is actually a risk with the current setup. I believe it would only be triggered if max_seq_len == max_model_size since num_speculative_tokens=1 but with Eagle3 we just fall back to not drafting, which doesn't work for the extract_hidden_states method.
Claude's suggestion is to just always set this to True for this draft method:
There's a more direct fix to the original problem: since extract_hidden_states doesn't actually speculate (the "draft" tokens are always the sampled tokens, always verify), the
input_fits_in_drafter guard is protecting against a scenario that doesn't apply. The check exists because real drafters (Eagle, DraftModel) would write KV entries beyond
max_model_len. But the extract_hidden_states drafter's KV cache is a separate cache-only layer whose writes are indexed by the target model's slot_mapping — it writes at the same
positions the target already wrote at, so the max_model_len concern doesn't arise.A targeted fix would be to make input_fits_in_drafter always True for extract_hidden_states:
input_fits_in_drafter = spec_decode_common_attn_metadata is not None and (
spec_config.uses_extract_hidden_states() # always fits
or spec_decode_common_attn_metadata.max_seq_len + self.num_spec_tokens
<= self.effective_drafter_max_model_len
)
That logic does make sense to me and we shouldn't actually need the extra position of the "drafted" token, we just can't set num_speculative_tokens=0 without updating a whole bunch of guards. What do you think?
Signed-off-by: Fynn Schmitt-Ulms <fschmitt@redhat.com>
af10200 to
c22855b
Compare
Previously we needed to patch the `has_kv_transfer_group` import but that's no longer needed. Signed-off-by: Fynn Schmitt-Ulms <fschmitt@redhat.com>
Signed-off-by: Fynn Schmitt-Ulms <fschmitt@redhat.com>
…or clear (vllm-project#37013) Signed-off-by: wendyliu235 <wenjun.liu@intel.com>
Purpose
#35158 defers kv_connector finalization until after draft models have run.
extract_hidden_statespreviously re-initialized the kv_connector metadata as a workaround but that is no longer needed.Also patches a small bug in ExampleHiddenStatesConnector which causes occasional crashes.
Test Plan
Ran/updated
extract_hidden_statestests. Ranexamples/offline_inference/extract_hidden_states.pyTested generating a few thousand samples and confirmed that output shapes and token ids always matched expectations, even when
new_block_ids is None.Test Result
See above
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.