Implement Gemma 2 models #486

EricLBuehler · 2024-06-28T17:52:09Z

Need to modify Gemma model implementation with:

Changelist over original Gemma and status:

Sliding window attn - for layers that satisfy idx % 2 != 0, so every other, will use sliding window
- Affects KV cache retrieval
- Affects sliding window mask generation

Logit soft capping

In attention, between Q*K^T * s and matmul(V)

 if self.config.attn_logit_softcapping is not None:
          attn_weights = attn_weights / self.config.attn_logit_softcapping
          attn_weights = torch.tanh(attn_weights)
          attn_weights = attn_weights * self.config.attn_logit_softcapping

After lm head

if self.config.final_logit_softcapping is not None:
          logits = logits / self.config.final_logit_softcapping
          logits = torch.tanh(logits)
          logits = logits * self.config.final_logit_softcapping

Use query_pre_attn_scalar instead of 1/sqrt(head_dim)

Links

The text was updated successfully, but these errors were encountered:

EricLBuehler · 2024-06-29T15:48:53Z

Implemented in #490.

EricLBuehler added the models Additions to model or architectures label Jun 28, 2024

This was referenced Jun 29, 2024

Model Wishlist #156

Open

Gemma 2 support huggingface/candle#2297

Open

EricLBuehler closed this as completed Jun 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Gemma 2 models #486

Implement Gemma 2 models #486

EricLBuehler commented Jun 28, 2024 •

edited

Loading

EricLBuehler commented Jun 29, 2024

Implement Gemma 2 models #486

Implement Gemma 2 models #486

Comments

EricLBuehler commented Jun 28, 2024 • edited Loading

Changelist over original Gemma and status:

Links

EricLBuehler commented Jun 29, 2024

EricLBuehler commented Jun 28, 2024 •

edited

Loading