Increasing utilization - gdrcopy_copybw #288

GuyZilberman · 2023-12-25T16:47:20Z

Hi,

I am running the gdrcopy_copybw benchmark on NVIDIA A100 80GB PCIe with Gen 4 PCIe.
It appears that the utilization doesn't reach its maximal possible value, getting about 20 GB/s out of the possible 32 GB/s, for buffers of sizes 32kB-8MB.
Upon looking into the code, it appears that in your implementation of memcpy_uncached_store_avx you are using 256 bit functions _mm256_load_pd and _mm256_stream_pd.
What could be the reason for that? Is there a reason for not using 512 bit functions _mm512_load_pd and _mm512_stream_pd instead? Could using the 512 bit functions increase the utilization?

Thanks!

Eshcar · 2023-12-27T11:54:17Z

@pakmarkthub @drossetti can one of you please take a look at this question?
We are thinking of implementing 512b copy ourselves (and potentially contribute to the repository)
but we want to make sure this is not a waste of time
have you considered supporting _mm512_load_pd in the past and rejected it for some reason?
appreciate your help
Eshcar

drossetti · 2023-12-28T23:12:44Z

It appears that the utilization doesn't reach its maximal possible value, getting about 20 GB/s out of the possible 32 GB/s, for buffers of sizes 32kB-8MB.

This question has been asked multiple times over the years, see for example Pak's comments in #286.
The short story is that CPU cores are are not designed as perfect DMA controllers, and can only generate small PCIe write packets, typically up to 64B instead of 256B which is the maximum supported by the GPU.

Regarding mm512, in the past I tried to use AVX2 without any improvement. That is architecturally understandable as the cache line bandwidth between a CPU core and its L1 is already saturated using finer grain loads.

That being said, there might be architectural innovations in the new CPUs, so a quick check would not hurt.

GuyZilberman · 2023-12-31T12:01:10Z

Thank you for your response!

We will try to do it and will let you know if we get any improvement in our attempt.

Are any of the previously attempted implementations accessible anywhere for reference?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increasing utilization - gdrcopy_copybw #288

Increasing utilization - gdrcopy_copybw #288

GuyZilberman commented Dec 25, 2023

Eshcar commented Dec 27, 2023

drossetti commented Dec 28, 2023

GuyZilberman commented Dec 31, 2023

Increasing utilization - gdrcopy_copybw #288

Increasing utilization - gdrcopy_copybw #288

Comments

GuyZilberman commented Dec 25, 2023

Eshcar commented Dec 27, 2023

drossetti commented Dec 28, 2023

GuyZilberman commented Dec 31, 2023