-
Notifications
You must be signed in to change notification settings - Fork 228
Implement rotary embeddings #7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
e8d4d1c
7844641
836d044
c9523ea
215a38a
0bd2138
a69c75d
fbae8b9
6bb1a33
1481556
0528e39
605f585
99be67e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -14,7 +14,7 @@ | |
| # limitations under the License. | ||
|
|
||
| import torch | ||
| from megatron.model.enums import AttnMaskType | ||
| from megatron.enums import AttnMaskType | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This move surprises me. Did you move
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Basically this caused circular dependency, when I added my enum in that file, and imported it in In order to remove this dependency I've moved enums outside model, as it's safe to say that importing enums should not be linked to model. Please see the section |
||
|
|
||
|
|
||
| class ScaledUpperTriangMaskedSoftmax(torch.autograd.Function): | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,51 @@ | ||
| # Extracted from: https://github.com/EleutherAI/gpt-neox | ||
| import torch | ||
|
|
||
|
|
||
| class RotaryEmbedding(torch.nn.Module): | ||
|
|
||
| def __init__(self, dim, base=10000, precision=torch.half): | ||
| super().__init__() | ||
| inv_freq = 1. / (base ** (torch.arange(0, dim, 2).float() / dim)) | ||
| self.register_buffer('inv_freq', inv_freq) | ||
| self.max_seq_len_cached = None | ||
| self.cos_cached = None | ||
| self.sin_cached = None | ||
| self.precision = precision | ||
|
|
||
| def forward(self, x, seq_dim=1, seq_len=None): | ||
| if seq_len is None: | ||
| seq_len = x.shape[seq_dim] | ||
| if self.max_seq_len_cached is None or (seq_len > self.max_seq_len_cached): | ||
| self.max_seq_len_cached = seq_len | ||
| t = torch.arange(self.max_seq_len_cached, device=x.device, dtype=self.inv_freq.dtype) | ||
| freqs = torch.einsum('i,j->ij', t, self.inv_freq) | ||
| # Different from paper, but it uses a different permutation in order to obtain the same calculation | ||
| emb = torch.cat((freqs, freqs), dim=-1).to(x.device) | ||
| if self.precision == torch.bfloat16: | ||
| emb = emb.float() | ||
| # [sx, 1 (b * np), hn] | ||
| self.cos_cached = emb.cos()[:, None, :] | ||
| self.sin_cached = emb.sin()[:, None, :] | ||
| if self.precision == torch.bfloat16: | ||
| self.cos_cached = self.cos_cached.bfloat16() | ||
| self.sin_cached = self.sin_cached.bfloat16() | ||
| return self.cos_cached[:seq_len, ...], self.sin_cached[:seq_len, ...] | ||
|
|
||
|
|
||
| # rotary pos emb helpers: | ||
|
|
||
| def rotate_half(x): | ||
| x1, x2 = x[..., :x.shape[-1] // 2], x[..., x.shape[-1] // 2:] | ||
| return torch.cat((-x2, x1), dim=x1.ndim - 1) # dim=-1 triggers a bug in earlier torch versions | ||
|
|
||
|
|
||
| @torch.jit.script | ||
| def apply_rotary_pos_emb(q, k, cos, sin, offset: int = 0): | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I like having helper functions - but is there a reason why we can't run them from the forward?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So I'm not familiar with |
||
| cos, sin = cos[offset:q.shape[0] + offset, ...], sin[offset:q.shape[0] + offset, ...] | ||
| return (q * cos) + (rotate_half(q) * sin), (k * cos) + (rotate_half(k) * sin) | ||
|
|
||
|
|
||
| def apply_rotary_pos_emb_torch(q, k, cos, sin, offset: int = 0): # jitting fails with bf16 | ||
| cos, sin = cos[offset:q.shape[0] + offset, ...], sin[offset:q.shape[0] + offset, ...] | ||
| return (q * cos) + (rotate_half(q) * sin), (k * cos) + (rotate_half(k) * sin) | ||
Uh oh!
There was an error while loading. Please reload this page.