Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creates MultiDevice #2819

Merged
merged 1 commit into from
Oct 25, 2023
Merged

Creates MultiDevice #2819

merged 1 commit into from
Oct 25, 2023

Conversation

zachgk
Copy link
Contributor

@zachgk zachgk commented Oct 24, 2023

This creates an abstraction for combining devices into a single device. The main use case for now is in DJL Serving TP_parallel. It will allow us to create a WorkerGroup and a PyPredictor for a set of devices and then track the usage of devices properly. It could also be used later for multi-gpu training or other multi-device cases.

This creates an abstraction for combining devices into a single device. The main
use case for now is in DJL Serving TP_parallel. It will allow us to create a
WorkerGroup and a PyPredictor for a set of devices and then track the usage of
devices properly. It could also be used later for multi-gpu training or other
multi-device cases.
@zachgk zachgk requested review from frankfliu and a team as code owners October 24, 2023 22:40
@codecov-commenter
Copy link

codecov-commenter commented Oct 24, 2023

Codecov Report

Attention: 1365 lines in your changes are missing coverage. Please review.

Comparison is base (bb5073f) 72.08% compared to head (6ef8895) 72.29%.
Report is 899 commits behind head on master.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@             Coverage Diff              @@
##             master    #2819      +/-   ##
============================================
+ Coverage     72.08%   72.29%   +0.20%     
- Complexity     5126     7145    +2019     
============================================
  Files           473      707     +234     
  Lines         21970    31849    +9879     
  Branches       2351     3305     +954     
============================================
+ Hits          15838    23026    +7188     
- Misses         4925     7249    +2324     
- Partials       1207     1574     +367     
Files Coverage Δ
...ava/ai/djl/inference/streaming/StreamingBlock.java 100.00% <100.00%> (ø)
api/src/main/java/ai/djl/metric/Dimension.java 100.00% <100.00%> (ø)
api/src/main/java/ai/djl/metric/Unit.java 100.00% <100.00%> (ø)
api/src/main/java/ai/djl/modality/audio/Audio.java 100.00% <100.00%> (ø)
api/src/main/java/ai/djl/modality/cv/Image.java 69.23% <ø> (-4.11%) ⬇️
...rc/main/java/ai/djl/modality/cv/MultiBoxPrior.java 76.00% <ø> (ø)
...ava/ai/djl/modality/cv/output/DetectedObjects.java 96.29% <100.00%> (+1.29%) ⬆️
...rc/main/java/ai/djl/modality/cv/output/Joints.java 71.42% <100.00%> (ø)
.../main/java/ai/djl/modality/cv/output/Landmark.java 100.00% <ø> (ø)
...i/djl/modality/cv/transform/RandomResizedCrop.java 94.11% <100.00%> (+5.22%) ⬆️
... and 225 more

... and 378 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@zachgk zachgk merged commit 185981b into deepjavalibrary:master Oct 25, 2023
5 checks passed
@zachgk zachgk deleted the multiDevice branch October 25, 2023 20:53
@@ -101,6 +106,13 @@ public static Device fromName(String deviceName, Engine engine) {
return engine.defaultDevice();
}

if (deviceName.contains("+")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need think of the following use cases:

  1. A specific device id (existing Device implementation)
  2. A continuous range of device: GPU[1-3]
  3. Arbitrary device list: GPU1;GPU3
  4. Number of device at any free device id exclusively: GPU{2}
  5. All available devices exclusively: GPU+
  6. All devices sharable: GPU*

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually have two device naming systems. One is the base system used in DJL Device.fromName(). The other is the system used in Serving getLoadOnDevices(). For example, * exists in Serving but not in DJL. The main idea seems to be that all the ones in DJL are absolute descriptions of a device and the ones in serving also contain relative ones. In that case and with your list: DJL would contain 1, 2, 3, 5 and Serving would contain 4, 6.

First, I want to talk about the structure of Device. Here, I changed it to represent anything "Device-like", either real, virtual, a combination of devices, or parts of devices. Device is now open for interpretation. I think this works very well with respect to how it opens possibilities throughout all of the API, even if many would not be supported for now. It helps a lot with multi-device usage, tensor parallel, device sharing, and distributed training. I would support having a clearer recognition of physical devices, though. Would it help to either add a function device.isPhysicalDevice() or a class PhysicalDevice extends Device?

Also for your list, you need to deal with both levels of lists of device considering tensor parallel. That is, you need something equivalent to "gpu0+gpu1;gpu2+gpu3". Which is, two workers of TP 2. I could also see {gpu0;gpu1};{gpu2;gpu3}. We also don't want to use , because it is used elsewhere. Then, would we want to have ranges like gpu[0-3/2] which would allow for TP? Also, with the current system we could still use a + without anything else even with the current system similarly to how we are using *. Both of these infer the device.

zachgk added a commit to zachgk/djl that referenced this pull request Oct 26, 2023
This improves upon the creation of MultiDevice in deepjavalibrary#2819 by moving the getDevices
function to the main Device class. This can simplify the usage of something
which is potentially a MultiDevice and make it easier to check for the presence
of a MultiDevice.
zachgk added a commit that referenced this pull request Jan 9, 2024
This improves upon the creation of MultiDevice in #2819 by moving the getDevices
function to the main Device class. This can simplify the usage of something
which is potentially a MultiDevice and make it easier to check for the presence
of a MultiDevice.
frankfliu pushed a commit that referenced this pull request Apr 26, 2024
This creates an abstraction for combining devices into a single device. The main
use case for now is in DJL Serving TP_parallel. It will allow us to create a
WorkerGroup and a PyPredictor for a set of devices and then track the usage of
devices properly. It could also be used later for multi-gpu training or other
multi-device cases.
frankfliu pushed a commit that referenced this pull request Apr 26, 2024
This improves upon the creation of MultiDevice in #2819 by moving the getDevices
function to the main Device class. This can simplify the usage of something
which is potentially a MultiDevice and make it easier to check for the presence
of a MultiDevice.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants