You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thank you for posting the question. As you noticed, there are differences between the all_gather_perf and all_reduce_perf tests, especially for smaller data amounts. The main reason lies in the nature of these operations:
Reduction Operation: This allows for the overlap of computation (e.g., “sum” in your log) and communication. As data is received from other processes, it can be immediately computed with local data, and the result can be sent to the next process. This pipelining effect can lead to more efficient use of the communication bus.
Gather Operation: This involves collecting data from all processes and distributing the combined data to all processes. Each process ends up with a complete set of data from all other processes. This operation primarily involves data movement with minimal computation.
For both operations, it is also noted that the larger the data, the higher the efficiency. This is due to the overhead of initiating transfers being amortized over more data, leading to better utilization of the communication pipeline.
Problem Description
I use the aws ofi rccl pluggin ith libfabric 1.23 on a cray SS11.
I run on 4 nodes, each with 4 mi250x. See cpu/gpu details.
I use slurm:
There is no cgroup getting in DMA's way.
All gather performance seems low until 16777216.
The low performance up to 8388608 bytes seems unexpected. The same node pool on all_reduce_perf gives:
Operating System
NAME="Red Hat Enterprise Linux" VERSION="8.10 (Ootpa)"
CPU
AMD EPYC 7A53 64-Core Processor
GPU
Name: AMD EPYC 7A53 64-Core Processor Marketing Name: AMD EPYC 7A53 64-Core Processor Name: AMD EPYC 7A53 64-Core Processor Marketing Name: AMD EPYC 7A53 64-Core Processor Name: AMD EPYC 7A53 64-Core Processor Marketing Name: AMD EPYC 7A53 64-Core Processor Name: AMD EPYC 7A53 64-Core Processor Marketing Name: AMD EPYC 7A53 64-Core Processor Name: gfx90a Marketing Name: AMD Instinct MI250X Name: amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack- Name: gfx90a Marketing Name: AMD Instinct MI250X Name: amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack- Name: gfx90a Marketing Name: AMD Instinct MI250X Name: amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack- Name: gfx90a Marketing Name: AMD Instinct MI250X Name: amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack- Name: gfx90a Marketing Name: AMD Instinct MI250X Name: amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack- Name: gfx90a Marketing Name: AMD Instinct MI250X Name: amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack- Name: gfx90a Marketing Name: AMD Instinct MI250X Name: amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack- Name: gfx90a Marketing Name: AMD Instinct MI250X Name: amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
ROCm Version
ROCm 6.2.1
ROCm Component
rccl
Steps to Reproduce
No response
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response
The text was updated successfully, but these errors were encountered: