[Don't merge][snapshot for XPU customer] Support accelerator abstraction in DeepSpeed #2471

delock · 2022-11-03T09:18:11Z

This is a snapshot of PR #2221 . The purpose of this PR is provide to XPU customer a stable DeepSpeed code base with accelerator abstraction, to get the latest accelerator abstraction code which may still going under development, use #2221

The following is a snapshot of description of PR #2221

This is a proposal to add device abstraction into DeepSpeed. Currently DeepSpeed has CUDA hard coded, which makes it works for device with CUDA abstraction only. In order to make more devices work for DeepSpeed. We need to make DeepSpeed not depending on CUDA, but depend on a device abstraction layer that could support different device types. In this proposal, we could support both CUDA device and Intel GPU device through pytorch XPU extension. In addition, we also support build SYCL kernels through SYCLOpBuilder for Intel GPU device.

This prosoal has the following design goals:

Make DeepSpeed work for both CUDA device and Intel GPU device.
Friendly for extending to other partie's accelerator devices.
Minimal impact to current DeepSpeed models. Current models still work with DeepSpeed on CUDA device without modification. Model with CUDA hard coded with need modification to work on both CUDA device and Intel GPU.
Use as less if...else... as possible when a piece of code needs to support both CUDA device and Intel GPU device.

High level design of accelerator abstrction

The high level design and implementation of accelerator abstracion is based on and extended from #2320:

Use DeepSpeedAccelerator abstract class to define all accelerator interface
A single global DeepSpeedAccelerator object can be actively or lazily initiated and can be used throughout DeepSpeed code and models to access accelerator functionalities. This object can be accessed through get_accelerator() and set with set_accelerator()
Concrete accelerator implementation such as CUDA or XPU can be in external module and can be imported by DeepSpeed during initialization.

DeepSpeedAccelerator abstract class

DeepSpeedAccelerator abstract class define the interface a concrete accelerator needs to implement, it has the following interface types:

Relates to accelerator device name. This is mainly related to usage such as 'cuda', 'cuda:0', etc. The interface name in this category is device_name() and current_device_name().
Relates to accelerator runtime. This is mainly related to torch.cuda.<interface_name> such as is_available(), synchronize(), etc.
Relates to tensor operation. This is mainly related to tensor operation that relies to device type. The interface name in this category is pin_memory() and on_accelerator()
Relates to communication backend. This is used to select accelerator specific communication backend such as 'nccl' for CUDA device and 'ccl' for XPU device. The interface name in this category is communication_backend_name()
Relates to op builder. This is used to select op builder for building accelerator kernels. The interfacename in this category is create_op_builder()

Concrete accelerator selection

Concreate accelerator selection is through deepspeed.accelerator.real_accelerator, there are two interface to set/get concreate accelerator:
set_accelerator(accel_obj) -- set global accelerator to parameter, this interface can be used in the beginning of model before deepspeed intialization
get_accelerator() -- get the global accelerator, if global accelerator had not been set, detect whether xpu or cuda support is present in the system and set global accelerator object accordingly, if no accelerator support is detected, return CUDA accelerator object by default.

Implement concrete accelerator in external module

Concrete accelerator can be implemented in external module, the implementation should provide an accelerator class that derives DeepSpeedAccelerator, an example of implementation can be found in cuda_accelerator.py. A model can import this external module and initiate an accelerator object and use set_accelerator to set DeepSpeed to use this accelerator:

from <external-module-for-accelerator> import <EXTERNAL>_Accelerator
accel = <EXTERNAL>_Accelerator()
from deepspeed.accelerator import set_accelerator
set_accelerator(accel)

Write accelerator specific code in DeepSpeed and model

Accelerator runtime

The accelerator abstraction provide a single entrance for accelerator specific features, which takes the form:

from deepspeed.accelerator import get_accelerator

<code that use accelerator functionality> get_accelerator().<interface name>(...)

For existing torch.cuda.<interface name> runtime call, we convert it like the following example:

if torch.cuda.is_available():
    ...

-->

if get_accelerator().is_available():
    ...

For CUDA specific device name such as 'cuda' or 'cuda:0', or 'cuda:1', we convert them to get_accelerator().device_name(), get_accelerator().device_name(0), and get_accelerator().device_name(1).

It is a little bit trick when we convert places where torch.cuda.current_device() are called. Current device return device index, but if we supply device index in Pytorch code where a device is needed, Pytorch will explain it as a CUDA device. To get current device that can be used as a device name, we need to call get_accelerator().current_device_name():

my_tensor = torch.empty(3, 4, device=get_accelerator().current_device_name())

Only when an integer number is expected we use get_accelerator().current_device():

idx = get_accelerator().current_device()
default_generator = get_accelerator().default_generator(idx)

Tensor operations

When we convert a torch tensor to accelerator device such as my_tensor.cuda(), we use my_tensor.to(get_accelerator().deivce_name())

When we check whether a torch tensor is on accelerator device such as my_tensor.is_cuda, we use get_accelerator().on_accelerator(my_tensor)

When pin a tensor to GPU memory such as my_tensor.pin_memory(), we use get_accelerator().pin_memory(my_tensor)

Communication backend

When a communication backend string is used, the interface get_accelerator().communication_backend_name() is used get get communication backend name. So instead of torch.distributed.init_process_group('nccl'), we use torch.distributed.init_process_group(get_accelerator().communication_backend_name())

Op builder abstraction

Op builders are abstracted through get_accelerator().create_op_builder(<op builder name>), if the op builder is implemented in the accelerator, an object of OpBuilder subclass will be returned. If the op builder is not implemented, None will be returned.

A typical implementation can be referred to from the CUDA implementation, or from an XPU implementation which will be released later. Typical call such as CPUAdamBuilder().load() can be convert to get_accelerator().create_op_builder("CPUAdamBuilder").load().

…n CUDA be used

… to get_accelerator() call

… should be used.

… abstraction design

* don't gather partitioned activations for mp size 1 * add inline comment for the change Co-authored-by: Olatunji Ruwase <[email protected]>

don't gather partitioned activations for mp size 1 (deepspeedai#2454)

…2494)

stage_1_and_2.py: no allreduce needed when mp size is 1 (deepspeedai#2494)

loadams · 2023-08-18T18:10:47Z

Hi @delock - as a part of clearing through some PRs, it looks like this was a snapshot, but this PR hasn't been pushed to in almost a year. But since closing this won't modify the branch, I'm going to close this for now.

delock and others added 30 commits August 16, 2022 15:10

[device abstraction] add device abstraction to allow other device tha…

0a849d5

…n CUDA be used

Merge branch '202208-base' into 202208

e4f40f0

[rebase-202208] additional changes needed when rebase to 202208

4a216ea

Merge branch '20220824-base' into 20220824

2137642

[rebase] cleanup direct cuda usage after merge

089657e

[precommit] fix pre-commit issues

d5a8424

Merge branch 'master' into gma/device-abstraction

96d0765

[pin_memory] make pin_memory select device type

ac64c7a

Merge branch 'master' into gma/device-abstraction

02c3a57

[downstream] merge from xpu support downstream

522b24b

Merge branch 'master' into gma/device-abstraction

a3b1e02

Merge branch 'master' into gma/device-abstraction

4557c33

Merge branch 'up-master' into gma/merge-upstream-20220921

2ef7d6c

[device] port cuda device to literal_device() in new tests

9656321

[accel_runtime] add pin_memory to accelerator runtime interface.

65729e3

[accelerator abstraction] merge from deepspeedai#2320

f94d53e

Merge branch 'up-master' into gma/device-abstraction

6005abe

change call site of literal_device, on_accel_device and accel_runtime…

31c0997

… to get_accelerator() call

add new interface definition from olruwase/accelerator_abstraction

1785c26

[accelerator abstraction] remove name() from interface, device_name()…

17203a4

… should be used.

merge with master (ec13da6)

e8daea6

Merge branch 'up-master' into gma/device-abstraction

cfd23ed

[OpBuilder] Add op builder abstraction

13bbbdf

Merge branch 'up-master' into gma/device-abstraction

06e39a5

convert op builder usage in merged code

257490f

[OpBuilder] add create_op_builder interface in abstract_accelerator.py

c93b999

[OpBuilder] fix op builder usage in tests

9858d42

[OpBuilder] fix <op builder>.NAME usage in tests to follow op builder…

68ce006

… abstraction design

import get_accelerator from deepspeed.accelerator directly

4b62dab

[OpBuilder] remove unused function and sync with main

c5b2070

delock added 4 commits October 25, 2022 14:26

fix create_op_builder calls

be517d8

fix misuse of new accelerator abstraction interface in tests

3af870f

Merge from downstream for bug fixing

8fa64b9

merge from downstream

4873538

delock requested review from GuanhuaWang, RezaYazdaniAminabadi, ShadenSmith, arashb, awan-10, cli99, cmikeh2, conglongli, duli2012, eltonzheng, jeffra, minjiaz, mrwyattii, samadejacobs, samyam, tjruwase, xiaoxiawu-microsoft and yaozhewei as code owners November 3, 2022 09:18

delock marked this pull request as draft November 3, 2022 09:19

delock and others added 5 commits November 4, 2022 13:41

remove SYCL_KERNEL specific code

61b10b0

don't gather partitioned activations for mp size 1 (deepspeedai#2454)

b9c7c79

* don't gather partitioned activations for mp size 1 * add inline comment for the change Co-authored-by: Olatunji Ruwase <[email protected]>

Merge pull request #1 from guoyejun/forgma

b9211e8

don't gather partitioned activations for mp size 1 (deepspeedai#2454)

stage_1_and_2.py: no allreduce needed when mp size is 1 (deepspeedai#…

7ebaeaa

…2494)

Merge pull request #2 from guoyejun/forgma

996655c

stage_1_and_2.py: no allreduce needed when mp size is 1 (deepspeedai#2494)

loadams closed this Aug 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Don't merge][snapshot for XPU customer] Support accelerator abstraction in DeepSpeed #2471

[Don't merge][snapshot for XPU customer] Support accelerator abstraction in DeepSpeed #2471

Uh oh!

delock commented Nov 3, 2022

Uh oh!

loadams commented Aug 18, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[Don't merge][snapshot for XPU customer] Support accelerator abstraction in DeepSpeed #2471

[Don't merge][snapshot for XPU customer] Support accelerator abstraction in DeepSpeed #2471

Uh oh!

Conversation

delock commented Nov 3, 2022

This is a snapshot of PR #2221 . The purpose of this PR is provide to XPU customer a stable DeepSpeed code base with accelerator abstraction, to get the latest accelerator abstraction code which may still going under development, use #2221

The following is a snapshot of description of PR #2221

This prosoal has the following design goals:

High level design of accelerator abstrction

DeepSpeedAccelerator abstract class

Concrete accelerator selection

Implement concrete accelerator in external module

Write accelerator specific code in DeepSpeed and model

Accelerator runtime

Tensor operations

Communication backend

Op builder abstraction

Uh oh!

loadams commented Aug 18, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants