Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gpu: nvidia: amd: Get native context through device #1765

Merged
merged 14 commits into from
Jan 26, 2024

Conversation

hdelan
Copy link
Contributor

@hdelan hdelan commented Dec 5, 2023

Description

Since UR now supports multi device context in HIP adapter oneapi-src/unified-runtime#999 (and the same work is soon to follow in the CUDA adapter) it is no longer sensible to use the get_native method for contexts.

The mapping of native contexts to a sycl::context is now many to one, so when get_native_context is called the plugin no longer knows which native context to return. The multi device context UR PR has made the get native context return the native context of the first device in the context. See https://github.com/oneapi-src/unified-runtime/pull/999/files#diff-259fb15eb14976a3bc1939b9bb8197f51d129a111309343bb84677a655758b54R125 . But this will break for multi GPU systems.

This change instead uses the sycl::device to get the native context, since there is a one to one mapping of sycl::device to native contexts.

This change means that old versions of oneDNN will be compatible with newer plugins only if a one GPU context is being used. In order for a multi GPU system to work with the new plugin, a oneDNN version with this patch included will be necessary.

Test results to follow

Checklist

General

  • Do all unit and benchdnn tests (make test and make test_benchdnn_*) pass locally for each commit?

No. The following tests fail for the master branch as well as for my PR branch:

On Nvidia A100:

$ ninja test
...
The following tests FAILED:
4 - gpu-cnn-training-bf16-cpp (Failed)
11 - gpu-primitives-batch-normalization-cpp (Failed)
28 - test_batch_normalization_gpu (Failed)
29 - test_batch_normalization_buffer_gpu (Failed)
30 - test_binary_gpu (SEGFAULT)
31 - test_binary_buffer_gpu (SEGFAULT)
40 - test_convolution_eltwise_forward_f32_gpu (Failed)
41 - test_convolution_eltwise_forward_f32_buffer_gpu (Failed)
59 - test_iface_attr_quantization_gpu (Failed)
60 - test_iface_attr_quantization_buffer_gpu (Failed)

HIP tests all passing for ninja test on gfx1031.

@mgouicem mgouicem requested a review from densamoilov December 6, 2023 08:21
@hdelan hdelan changed the title [draft] Get native context through device [HIP][CUDA] Get native context through device Dec 6, 2023
@hdelan hdelan changed the title [HIP][CUDA] Get native context through device gpu: nvidia: amd: Get native context through device Dec 6, 2023
@vpirogov
Copy link
Member

vpirogov commented Jan 9, 2024

@hdelan, could you please address questions and feedback on this PR?

@hdelan
Copy link
Contributor Author

hdelan commented Jan 10, 2024

@vpirogov thanks for ping. Changes made

@hdelan hdelan force-pushed the deprecated-get-native-context branch from 58a625d to 0c643e9 Compare January 10, 2024 11:52
@hdelan hdelan force-pushed the deprecated-get-native-context branch from a0fa401 to aaced73 Compare January 12, 2024 11:30
@vpirogov
Copy link
Member

Thanks, @hdelan! @densamoilov is currently out, we'll have this promoted after the long weekend in US.

@vpirogov vpirogov added this to the v3.4 milestone Jan 12, 2024
A lot of code removed that was necessary for primary contexts
All native contexts in DPC++ are primary contexts for CUDA
and HIP, so we don't need a lot of the checking in oneDNN
any more.
@hdelan hdelan force-pushed the deprecated-get-native-context branch from c0969cd to 460d15f Compare January 22, 2024 17:36
@hdelan hdelan force-pushed the deprecated-get-native-context branch from 460d15f to 824ea18 Compare January 22, 2024 18:22
@densamoilov
Copy link
Contributor

densamoilov commented Jan 23, 2024

@hdelan, please let me know once the PR is ready for review.

@hdelan
Copy link
Contributor Author

hdelan commented Jan 23, 2024

@densamoilov I am running tests but am seeing a lot of failures on AMD one the master branch of oneDNN 443600a30fc3427d6ad1522bceefb52100180a17 with DPC++ build e0f74157c87a2a6eb3438936eaad861e501cc418. Tests are failing with PI_ERROR_INVALID_MEM_OBJECT from piextUSMFree as well as some other things.

As a result it is hard to test the correctness of these changes. Can you recommend a particular setup where tests pass on the master branch of oneDNN, ie a certain DPC++ or release version? So that I can test just the contents of this PR

@hdelan
Copy link
Contributor Author

hdelan commented Jan 24, 2024

@densamoilov, using release version 2024.0.2 I have the same failures with this patch as the failures on the master branch.

AMD MI210 fail list (same for master branch as well as my patch):

The following tests FAILED:
	 17 - test_concat_gpu (Failed)
	 18 - test_concat_buffer_gpu (Failed)
	 35 - test_cross_engine_reorder_buffer (Failed)
	 38 - test_eltwise_gpu (Failed)
	 39 - test_eltwise_buffer_gpu (Failed)
	 44 - test_iface_attr_quantization_gpu (Failed)
	 45 - test_iface_attr_quantization_buffer_gpu (Failed)
	 50 - test_iface_pd_gpu (Failed)
	 51 - test_iface_pd_buffer_gpu (Failed)
	 52 - test_iface_pd_iter_gpu (Failed)
	 53 - test_iface_pd_iter_buffer_gpu (Failed)
	 56 - test_iface_runtime_dims_gpu (Failed)
	 57 - test_iface_runtime_dims_buffer_gpu (Failed)
	 62 - test_inner_product_backward_data_gpu (Failed)
	 63 - test_inner_product_backward_data_buffer_gpu (Failed)
	 64 - test_inner_product_backward_weights_gpu (Failed)
	 65 - test_inner_product_backward_weights_buffer_gpu (Failed)
	 66 - test_inner_product_forward_gpu (Failed)
	 67 - test_inner_product_forward_buffer_gpu (Failed)
	 68 - test_layer_normalization_gpu (Failed)
	 69 - test_layer_normalization_buffer_gpu (Failed)
	 72 - test_matmul_gpu (Failed)
	 73 - test_matmul_buffer_gpu (Failed)
	 80 - test_prelu_gpu (Failed)
	 81 - test_prelu_buffer_gpu (Failed)
	 82 - test_primitive_cache_mt_gpu (Failed)
	 83 - test_primitive_cache_mt_buffer_gpu (Failed)
	 86 - test_reorder_gpu (Failed)
	 87 - test_reorder_buffer_gpu (Failed)
	 88 - test_resampling_gpu (Failed)
	 89 - test_resampling_buffer_gpu (Failed)
	 92 - test_shuffle_gpu (Failed)
	 93 - test_shuffle_buffer_gpu (Failed)
	 96 - test_sum_gpu (Failed)
	 97 - test_sum_buffer_gpu (Failed)

NVIDIA A100 fail list (same for master as well as my patch):

	  4 - gpu-cnn-training-bf16-cpp (Failed)
	 11 - gpu-primitives-batch-normalization-cpp (Failed)
	 28 - test_batch_normalization_gpu (Failed)
	 29 - test_batch_normalization_buffer_gpu (Failed)
	 30 - test_binary_gpu (SEGFAULT)
	 31 - test_binary_buffer_gpu (SEGFAULT)
	 40 - test_convolution_eltwise_forward_f32_gpu (Failed)
	 41 - test_convolution_eltwise_forward_f32_buffer_gpu (Failed)
	 59 - test_iface_attr_quantization_gpu (Failed)
	 60 - test_iface_attr_quantization_buffer_gpu (Failed)

So I suppose this PR is ready for review

@densamoilov densamoilov merged commit ba51695 into oneapi-src:main Jan 26, 2024
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants