-
Notifications
You must be signed in to change notification settings - Fork 658
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Creates MultiDevice #2819
Creates MultiDevice #2819
Conversation
This creates an abstraction for combining devices into a single device. The main use case for now is in DJL Serving TP_parallel. It will allow us to create a WorkerGroup and a PyPredictor for a set of devices and then track the usage of devices properly. It could also be used later for multi-gpu training or other multi-device cases.
Codecov ReportAttention:
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## master #2819 +/- ##
============================================
+ Coverage 72.08% 72.29% +0.20%
- Complexity 5126 7145 +2019
============================================
Files 473 707 +234
Lines 21970 31849 +9879
Branches 2351 3305 +954
============================================
+ Hits 15838 23026 +7188
- Misses 4925 7249 +2324
- Partials 1207 1574 +367
☔ View full report in Codecov by Sentry. |
@@ -101,6 +106,13 @@ public static Device fromName(String deviceName, Engine engine) { | |||
return engine.defaultDevice(); | |||
} | |||
|
|||
if (deviceName.contains("+")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need think of the following use cases:
- A specific device id (existing Device implementation)
- A continuous range of device: GPU[1-3]
- Arbitrary device list: GPU1;GPU3
- Number of device at any free device id exclusively: GPU{2}
- All available devices exclusively: GPU+
- All devices sharable: GPU*
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We actually have two device naming systems. One is the base system used in DJL Device.fromName()
. The other is the system used in Serving getLoadOnDevices()
. For example, *
exists in Serving but not in DJL. The main idea seems to be that all the ones in DJL are absolute descriptions of a device and the ones in serving also contain relative ones. In that case and with your list: DJL would contain 1, 2, 3, 5 and Serving would contain 4, 6.
First, I want to talk about the structure of Device. Here, I changed it to represent anything "Device-like", either real, virtual, a combination of devices, or parts of devices. Device is now open for interpretation. I think this works very well with respect to how it opens possibilities throughout all of the API, even if many would not be supported for now. It helps a lot with multi-device usage, tensor parallel, device sharing, and distributed training. I would support having a clearer recognition of physical devices, though. Would it help to either add a function device.isPhysicalDevice()
or a class PhysicalDevice extends Device
?
Also for your list, you need to deal with both levels of lists of device considering tensor parallel. That is, you need something equivalent to "gpu0+gpu1;gpu2+gpu3". Which is, two workers of TP 2. I could also see {gpu0;gpu1};{gpu2;gpu3}
. We also don't want to use ,
because it is used elsewhere. Then, would we want to have ranges like gpu[0-3/2]
which would allow for TP? Also, with the current system we could still use a +
without anything else even with the current system similarly to how we are using *
. Both of these infer the device.
This improves upon the creation of MultiDevice in deepjavalibrary#2819 by moving the getDevices function to the main Device class. This can simplify the usage of something which is potentially a MultiDevice and make it easier to check for the presence of a MultiDevice.
This improves upon the creation of MultiDevice in #2819 by moving the getDevices function to the main Device class. This can simplify the usage of something which is potentially a MultiDevice and make it easier to check for the presence of a MultiDevice.
This creates an abstraction for combining devices into a single device. The main use case for now is in DJL Serving TP_parallel. It will allow us to create a WorkerGroup and a PyPredictor for a set of devices and then track the usage of devices properly. It could also be used later for multi-gpu training or other multi-device cases.
This improves upon the creation of MultiDevice in #2819 by moving the getDevices function to the main Device class. This can simplify the usage of something which is potentially a MultiDevice and make it easier to check for the presence of a MultiDevice.
This creates an abstraction for combining devices into a single device. The main use case for now is in DJL Serving TP_parallel. It will allow us to create a WorkerGroup and a PyPredictor for a set of devices and then track the usage of devices properly. It could also be used later for multi-gpu training or other multi-device cases.