Support cross attention kv cache #187

larryliu0820 · 2025-11-18T08:30:08Z

To avoid excessive computation we want to support kv cache for cross attention in Whisper.

Fundamentally we only run k_proj and v_proj once on the encoder output hidden state, at the first token generation, then we should keep the key_states and value_states and reuse them in all the subsequent token generation.

For whisper-large-v3-turbo, where we have 4 layers of decoder:

WhisperDecoder(
  (embed_tokens): Embedding(51866, 1280, padding_idx=50257)
  (embed_positions): WhisperPositionalEmbedding(448, 1280)
  (layers): ModuleList(
    (0-3): 4 x WhisperDecoderLayer(
      (self_attn): WhisperAttention(
        (k_proj): Linear(in_features=1280, out_features=1280, bias=False)
        (v_proj): Linear(in_features=1280, out_features=1280, bias=True)
        (q_proj): Linear(in_features=1280, out_features=1280, bias=True)
        (out_proj): Linear(in_features=1280, out_features=1280, bias=True)
      )
      (activation_fn): GELUActivation()
      (self_attn_layer_norm): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
      (encoder_attn): WhisperAttention(
        (k_proj): Linear(in_features=1280, out_features=1280, bias=False)
        (v_proj): Linear(in_features=1280, out_features=1280, bias=True)
        (q_proj): Linear(in_features=1280, out_features=1280, bias=True)
        (out_proj): Linear(in_features=1280, out_features=1280, bias=True)
      )
      (encoder_attn_layer_norm): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
      (fc1): Linear(in_features=1280, out_features=5120, bias=True)
      (fc2): Linear(in_features=5120, out_features=1280, bias=True)
      (final_layer_norm): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
    )
  )
  (layer_norm): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
)

Without KV cache in encoder_attn, we are doing 2 1280x1280 MM for each layer, so in total 8 1280x1280 MM for each token generated. This largely impacts token/sec perf number.

This PR replaces encoder_attn with a WhisperCrossAttention class, where we replaces if condition with torch.cond. The logic becomes:

If KV cache values are all zero:
- Compute KV projections
Otherwise:
- Clone from KV cache. Note here we can't directly return KV cache, due to the non-aliasing requirement.
After torch.cond:
- Write back the values from either branch back to KV cache

Notice that we still have 1 extra read and 1 extra write, but it should be much faster than MM.

jackzhxng · 2025-11-19T17:12:43Z

optimum/exporters/executorch/integrations.py

+        self.cross_attention_cache = StaticCache(
+            config=self.config,
+            max_batch_size=batch_size,
+            max_cache_len=getattr(self.config, "max_source_positions", max_static_cache_length), # This is fixed in whisper


Pull this outside into a var like the other arguments

jackzhxng · 2025-11-19T17:13:16Z

optimum/exporters/executorch/integrations.py

+        self.cross_attention_cache = StaticCache(
+            config=self.config,
+            max_batch_size=batch_size,
+            max_cache_len=getattr(self.config, "max_source_positions", max_static_cache_length), # This is fixed in whisper


Also what do you mean this is fixed in whisper? Will this work for t5?

Basically they always have 1500 for max_source_positions and that translates to 30 seconds of audio. So we should use that for cache len. For T5 I don't know and that's why I name this class WhisperCrossAttention.

jackzhxng · 2025-11-19T17:14:56Z

optimum/exporters/executorch/integrations.py

+                f"cross_attention_value_cache_{i}", self.cross_attention_cache.layers[i].values, persistent=False
+            )
+
+        # Massage decoder to use cross attention.


Suggested change

# Massage decoder to use cross attention.

# Use custom cross attention for Whisper.

jackzhxng · 2025-11-19T17:16:35Z

optimum/exporters/executorch/whisper_attention.py

+# limitations under the License.
+
+# Export friendly cross attention implementation for Whisper. Adopted
+# from https://github.com/huggingface/transformers/blob/main/src/transformers/models/whisper/modeling_whisper.py#L241


Suggested change

# from https://github.com/huggingface/transformers/blob/main/src/transformers/models/whisper/modeling_whisper.py#L241

# from https://github.com/huggingface/transformers/blob/454c0a7ccf33f7fc13e3e2eb9b188a5c09ab708b/src/transformers/models/whisper/modeling_whisper.py#L241

Permalink is better in case code changes

jackzhxng · 2025-11-19T17:19:27Z

optimum/exporters/executorch/whisper_attention.py

+                {"cache_position": None},
+            )
+
+        else:


Should we remove this else branch if we aren't expecting to use it?

jackzhxng · 2025-11-19T17:21:55Z

optimum/exporters/executorch/whisper_attention.py

+            )
+
+            # Update the KV cache outside of torch.cond.
+            past_key_values.update(


Why not put this inside the recompute_kv branch?

jackzhxng

Oh also run make style for formatting

Support cross attention kv cache

956d964

larryliu0820 requested a review from jackzhxng November 18, 2025 08:30

jackzhxng reviewed Nov 19, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support cross attention kv cache #187

Support cross attention kv cache #187

Uh oh!

larryliu0820 commented Nov 18, 2025

Uh oh!

jackzhxng Nov 19, 2025

Uh oh!

jackzhxng Nov 19, 2025

Uh oh!

larryliu0820 Nov 19, 2025

Uh oh!

jackzhxng Nov 19, 2025

Uh oh!

jackzhxng Nov 19, 2025

Uh oh!

jackzhxng Nov 19, 2025

Uh oh!

jackzhxng Nov 19, 2025

Uh oh!

jackzhxng left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	# Massage decoder to use cross attention.
	# Use custom cross attention for Whisper.

	# from https://github.com/huggingface/transformers/blob/main/src/transformers/models/whisper/modeling_whisper.py#L241
	# from https://github.com/huggingface/transformers/blob/454c0a7ccf33f7fc13e3e2eb9b188a5c09ab708b/src/transformers/models/whisper/modeling_whisper.py#L241

Support cross attention kv cache #187

Are you sure you want to change the base?

Support cross attention kv cache #187

Uh oh!

Conversation

larryliu0820 commented Nov 18, 2025

Uh oh!

jackzhxng Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

jackzhxng Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

larryliu0820 Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

jackzhxng Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

jackzhxng Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

jackzhxng Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

jackzhxng Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

jackzhxng left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants