-
Notifications
You must be signed in to change notification settings - Fork 591
Speed up loading blocks using init with meta weights #285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
borzunov
reviewed
Mar 12, 2023
borzunov
reviewed
Mar 12, 2023
Collaborator
|
Awesome results! The total block loading time (including the time to move it on GPU) went down from 57 sec to 25 sec on my machines. This means that, given that the blocks are already downloaded, the servers will spend 2x less time for restarting after being preempted or during rebalancing. |
Co-authored-by: Alexander Borzunov <[email protected]>
borzunov
approved these changes
Mar 12, 2023
justheuristic
approved these changes
Mar 12, 2023
borzunov
added a commit
that referenced
this pull request
Apr 25, 2023
- After #285, `load_pretrained_block()` uses `accelerate.utils.set_module_tensor_to_device()` - In accelerate>=0.16.0, it saves the tensor in the dtype previously used by the model instead of dtype of the weights (huggingface/accelerate#920) - Because of that, blocks and attention caches used float32, which caused OOMs - This PR makes `load_pretrained_block()` respect `torch_dtype` (default: `"auto"`, which means reading `torch_dtype` from `config.json`)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR changes the block initialization logic to use PyTorch meta tensors before assigning values from state_dict. This helps avoid unnecessary memory allocation and parameter initialization steps, which might take a lot of time (and RAM) for large models.
Before this PR: 1 block loaded in ~16 seconds
After: 1 block loaded in ~2 seconds