-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increasing utilization - gdrcopy_copybw #288
Comments
@pakmarkthub @drossetti can one of you please take a look at this question? |
This question has been asked multiple times over the years, see for example Pak's comments in #286. Regarding mm512, in the past I tried to use AVX2 without any improvement. That is architecturally understandable as the cache line bandwidth between a CPU core and its L1 is already saturated using finer grain loads. That being said, there might be architectural innovations in the new CPUs, so a quick check would not hurt. |
Thank you for your response! We will try to do it and will let you know if we get any improvement in our attempt. Are any of the previously attempted implementations accessible anywhere for reference? |
Hi,
I am running the gdrcopy_copybw benchmark on NVIDIA A100 80GB PCIe with Gen 4 PCIe.
It appears that the utilization doesn't reach its maximal possible value, getting about 20 GB/s out of the possible 32 GB/s, for buffers of sizes 32kB-8MB.
Upon looking into the code, it appears that in your implementation of memcpy_uncached_store_avx you are using 256 bit functions _mm256_load_pd and _mm256_stream_pd.
What could be the reason for that? Is there a reason for not using 512 bit functions _mm512_load_pd and _mm512_stream_pd instead? Could using the 512 bit functions increase the utilization?
Thanks!
The text was updated successfully, but these errors were encountered: