-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Enable external CUDA allocator in ORTModule. #6745
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable external CUDA allocator in ORTModule. #6745
Conversation
|
|
||
| # CPP extension to get torch CUDA allocator's alloc and free function addresses | ||
| self._use_external_cuda_allocator = True | ||
| if self._use_external_cuda_allocator: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it allowed to do something like
model1 = ORTModule(model1)
model2 = ORTModule(model2)
in the same process (python interpreter)? Just checking load_inline or self._torch_cuda_allocator.cuda_caching_allocator_raw_delete_address()` can handle this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't personally tried but I don't see why it wouldn't work. It will just recompile and recreate the binary file.
| providers = ["CUDAExecutionProvider", "CPUExecutionProvider"] | ||
| provider_options = [{"device_id": str(self._device.index)}, {}] | ||
| if self._use_external_cuda_allocator: | ||
| provider_options = [{"device_id": str(self._device.index), "cuda_external_alloc": str(self._torch_alloc), "cuda_external_free": str(self._torch_free)}, {}] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| provider_options = [{"device_id": str(self._device.index), "cuda_external_alloc": str(self._torch_alloc), "cuda_external_free": str(self._torch_free)}, {}] | |
| provider_options = [{"device_id": str(_utils.get_device_index(self._device)), "cuda_external_alloc": str(self._torch_alloc), "cuda_external_free": str(self._torch_free)}, {}] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @thiagocrepaldi , seems like a good suggestion but the standard software engineering practice would be to make this change as a different PR since I'm not touching this particular piece of code (i.e "device_id": str(self._device.index)). Your suggestion is a clean up and I do not want to mix it with enabling an external allocator.
| if self._use_external_cuda_allocator: | ||
| provider_options = [{"device_id": str(self._device.index), "cuda_external_alloc": str(self._torch_alloc), "cuda_external_free": str(self._torch_free)}, {}] | ||
| else: | ||
| provider_options = [{"device_id": str(self._device.index)}] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| provider_options = [{"device_id": str(self._device.index)}] | |
| provider_options = [{"device_id": str(_utils.get_device_index(self._device))}] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above.
|
Thanks, @SherlockNoMad and @thiagocrepaldi for the review. Thanks @baijumeswani for answering questions regarding the torch no_grad memory test. |
Enables external CUDA allocator (i.e PyTorch CUDA caching allocator) and also removes all references to torch.cuda.empty_cache() since we should not be clearing the cache as it can affect throughput and PyTorch allocator does that on need basis.