GQA models have not supported prefix caching by toslunar · Pull Request #2873 · vllm-project/vllm

toslunar · 2024-02-14T18:05:33Z

I found a model that uses GQA returns wrong result with prefix_pos. After some investigation, the code to support MQA/GQA

vllm/vllm/model_executor/layers/attention.py

Lines 141 to 155 in 7e45107

    
           if self.num_kv_heads != self.num_heads: 
        
               # As of Nov 2023, xformers only supports MHA. For MQA/GQA, 
        
               # project the key and value tensors to the desired number of 
        
               # heads. 
        
               # TODO(woosuk): Use MQA/GQA kernels for higher performance. 
        
               query = query.view(query.shape[0], self.num_kv_heads, 
        
                                  self.num_queries_per_kv, query.shape[-1]) 
        
               key = key[:, :, 
        
                         None, :].expand(key.shape[0], self.num_kv_heads, 
        
                                         self.num_queries_per_kv, 
        
                                         key.shape[-1]) 
        
               value = value[:, :, None, :].expand(value.shape[0], 
        
                                                   self.num_kv_heads, 
        
                                                   self.num_queries_per_kv, 
        
                                                   value.shape[-1])

, which repeats the inputs, is not compatible with the current implementation of prefix caching (context_attention_fwd).

To support MQA/GQA,

                if self.num_kv_heads != self.num_heads:
                    query = query.view(batch_size * seq_len, self.num_heads, self.head_size)
                    key = key.reshape(batch_size * seq_len, self.num_heads, self.head_size)
                    value = value.reshape(batch_size * seq_len, self.num_heads, self.head_size)

is closer, but KV of prefix should also be expanded (after they are read from key_cache and value_cache).

sighingnow · 2024-02-23T09:45:35Z

The issue was addressed by #3007

GQA models have not supported prefix caching

79fdfb3

WoosukKwon closed this Mar 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GQA models have not supported prefix caching#2873

GQA models have not supported prefix caching#2873
toslunar wants to merge 1 commit intovllm-project:mainfrom
toslunar:prefix-gqa-not-yet

toslunar commented Feb 14, 2024

Uh oh!

sighingnow commented Feb 23, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	if self.num_kv_heads != self.num_heads:
	# As of Nov 2023, xformers only supports MHA. For MQA/GQA,
	# project the key and value tensors to the desired number of
	# heads.
	# TODO(woosuk): Use MQA/GQA kernels for higher performance.
	query = query.view(query.shape[0], self.num_kv_heads,
	self.num_queries_per_kv, query.shape[-1])
	key = key[:, :,
	None, :].expand(key.shape[0], self.num_kv_heads,
	self.num_queries_per_kv,
	key.shape[-1])
	value = value[:, :, None, :].expand(value.shape[0],
	self.num_kv_heads,
	self.num_queries_per_kv,
	value.shape[-1])

Uh oh!

Conversation

toslunar commented Feb 14, 2024

Uh oh!

sighingnow commented Feb 23, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants