-
Notifications
You must be signed in to change notification settings - Fork 590
Bump transformers to 4.25.1 #151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| huggingface-hub==0.11.1 | ||
| transformers==4.25.1 | ||
| protobuf>=3.20.3,<4.0dev | ||
| hivemind==1.1.3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also gonna bump it, but it's a separate PR
| @@ -0,0 +1,74 @@ | |||
| """ | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is not new. It was renamed from model.py, but git does not recognize the diff
borzunov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've found some bugs, pending their resolution.
|
|
||
| for i in range(0, num_embeddings, self.chunk_size): | ||
| chunk = word_embeddings[i : i + self.chunk_size].float() | ||
| output[..., i : i + self.chunk_size] = F.linear(hidden_states, chunk) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this is worth doing, but maybe you can do torch.matmul(hidden_states, chunk, out=output[..., i : i + self.chunk_size]) to avoid allocating memory for the intermediate result?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried to do the same thing, but to no avail
On GPU, it appears that F.linear has a better support for some optimizations like TF32 (enabled by default)
On CPU, this has no effect.
| key_past = key_cache.flatten(0, 1)[:, :, :prefix_length] # [batch * num_heads, head_dim, kv_length] | ||
| value_past = value_cache.flatten(0, 1)[:, :prefix_length, :] # [batch * num_heads, kv_length, head_dim] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't you just directly reshape the past tensors to these shapes like you've done in src/petals/server/handler.py?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, we cannot
- hypo_ids need shape [2, batch_size, ...]
- training needs key [batch_size * heads, ..., length] and value [..., length, :], making them non-concat-able
- handler needs them to be concat-able in a single tensor
Co-authored-by: Max Ryabinin <[email protected]>
Co-authored-by: Max Ryabinin <[email protected]>
Co-authored-by: Max Ryabinin <[email protected]>
Uh oh!
There was an error while loading. Please reload this page.