-
Notifications
You must be signed in to change notification settings - Fork 4.7k
[Don't merge][snapshot for XPU customer] Support accelerator abstraction in DeepSpeed #2471
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
… to get_accelerator() call
… abstraction design
* don't gather partitioned activations for mp size 1 * add inline comment for the change Co-authored-by: Olatunji Ruwase <[email protected]>
don't gather partitioned activations for mp size 1 (deepspeedai#2454)
stage_1_and_2.py: no allreduce needed when mp size is 1 (deepspeedai#2494)
Collaborator
|
Hi @delock - as a part of clearing through some PRs, it looks like this was a snapshot, but this PR hasn't been pushed to in almost a year. But since closing this won't modify the branch, I'm going to close this for now. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is a snapshot of PR #2221 . The purpose of this PR is provide to XPU customer a stable DeepSpeed code base with accelerator abstraction, to get the latest accelerator abstraction code which may still going under development, use #2221
The following is a snapshot of description of PR #2221
This is a proposal to add device abstraction into DeepSpeed. Currently DeepSpeed has CUDA hard coded, which makes it works for device with CUDA abstraction only. In order to make more devices work for DeepSpeed. We need to make DeepSpeed not depending on CUDA, but depend on a device abstraction layer that could support different device types. In this proposal, we could support both CUDA device and Intel GPU device through pytorch XPU extension. In addition, we also support build SYCL kernels through SYCLOpBuilder for Intel GPU device.
This prosoal has the following design goals:
if...else...as possible when a piece of code needs to support both CUDA device and Intel GPU device.High level design of accelerator abstrction
The high level design and implementation of accelerator abstracion is based on and extended from #2320:
DeepSpeedAcceleratorabstract class to define all accelerator interfaceDeepSpeedAcceleratorobject can be actively or lazily initiated and can be used throughout DeepSpeed code and models to access accelerator functionalities. This object can be accessed throughget_accelerator()and set withset_accelerator()DeepSpeedAccelerator abstract class
DeepSpeedAccelerator abstract class define the interface a concrete accelerator needs to implement, it has the following interface types:
'cuda','cuda:0', etc. The interface name in this category isdevice_name()andcurrent_device_name().torch.cuda.<interface_name>such asis_available(),synchronize(), etc.pin_memory()andon_accelerator()'nccl'for CUDA device and'ccl'for XPU device. The interface name in this category iscommunication_backend_name()create_op_builder()Concrete accelerator selection
Concreate accelerator selection is through deepspeed.accelerator.real_accelerator, there are two interface to set/get concreate accelerator:
set_accelerator(accel_obj)-- set global accelerator to parameter, this interface can be used in the beginning of model before deepspeed intializationget_accelerator()-- get the global accelerator, if global accelerator had not been set, detect whether xpu or cuda support is present in the system and set global accelerator object accordingly, if no accelerator support is detected, return CUDA accelerator object by default.Implement concrete accelerator in external module
Concrete accelerator can be implemented in external module, the implementation should provide an accelerator class that derives DeepSpeedAccelerator, an example of implementation can be found in cuda_accelerator.py. A model can import this external module and initiate an accelerator object and use set_accelerator to set DeepSpeed to use this accelerator:
Write accelerator specific code in DeepSpeed and model
Accelerator runtime
The accelerator abstraction provide a single entrance for accelerator specific features, which takes the form:
For existing
torch.cuda.<interface name>runtime call, we convert it like the following example:-->
For CUDA specific device name such as
'cuda'or'cuda:0', or'cuda:1', we convert them toget_accelerator().device_name(),get_accelerator().device_name(0), andget_accelerator().device_name(1).It is a little bit trick when we convert places where
torch.cuda.current_device()are called. Current device return device index, but if we supply device index in Pytorch code where a device is needed, Pytorch will explain it as a CUDA device. To get current device that can be used as a device name, we need to callget_accelerator().current_device_name():Only when an integer number is expected we use
get_accelerator().current_device():Tensor operations
When we convert a torch tensor to accelerator device such as
my_tensor.cuda(), we usemy_tensor.to(get_accelerator().deivce_name())When we check whether a torch tensor is on accelerator device such as
my_tensor.is_cuda, we useget_accelerator().on_accelerator(my_tensor)When pin a tensor to GPU memory such as
my_tensor.pin_memory(), we useget_accelerator().pin_memory(my_tensor)Communication backend
When a communication backend string is used, the interface
get_accelerator().communication_backend_name()is used get get communication backend name. So instead oftorch.distributed.init_process_group('nccl'), we usetorch.distributed.init_process_group(get_accelerator().communication_backend_name())Op builder abstraction
Op builders are abstracted through
get_accelerator().create_op_builder(<op builder name>), if the op builder is implemented in the accelerator, an object ofOpBuildersubclass will be returned. If the op builder is not implemented,Nonewill be returned.A typical implementation can be referred to from the CUDA implementation, or from an XPU implementation which will be released later. Typical call such as
CPUAdamBuilder().load()can be convert toget_accelerator().create_op_builder("CPUAdamBuilder").load().