Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue]: Unexpected Behavior LL Protocol #1234

Open
arkhadem opened this issue Jun 28, 2024 · 1 comment
Open

[Issue]: Unexpected Behavior LL Protocol #1234

arkhadem opened this issue Jun 28, 2024 · 1 comment
Assignees

Comments

@arkhadem
Copy link

arkhadem commented Jun 28, 2024

Problem Description

Hi Everyone,

I have found a very strange behavior in rccl-rocm-6.1.2 that I cannot understand based on my limited knowledge of LL implementation. The behavior is for AllGather - RING - LL test.

In the LL implementation, each channel has 256 threads.
Each thread in each trip, sends/receives 8B data.
So each trip of any primitive of LL transfers 256 threads x 8B = 2KB data.
I found that if the data size is not divisible by 128B (16 threads), the latency is very high.

In the following experiment, I increase the data by 16B in each step, meaning that 2 more threads will transfer data.
Every 8 steps (x 2 threads = 16 threads or 1/4 of warp size), the latency is low.
Otherwise, latency is huge (~150us difference).
Can anyone understand why this is happening?

3MI300X

Sincerely,

  • Alireza

Operating System

Ubuntu 22.04.3 LTS (Jammy Jellyfish)

CPU

Intel(R) Xeon(R) Platinum 8480C

GPU

AMD Instinct MI300X

ROCm Version

ROCm 6.1.0

ROCm Component

No response

Steps to Reproduce

Using 3 fully-connected GPUs:

RCCL_MSCCL_ENABLE=0 NCCL_PROTO=LL NCCL_ALGO=RING NCCL_MIN_NRINGS=16 NCCL_MAX_NRINGS=16 LD_LIBRARY_PATH=rccl-rocm-6.1.2/build/release/:$LD_LIBRARY_PATH ./build/all_gather_perf -g 3 -b 50331648 -e 50334720 -i 48 -s 1

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

@gilbertlee-amd
Copy link
Collaborator

Hi @arkhadem,

Thanks for reporting this - I've created an internal ticket to look into this and will update this ticket when we have some information about it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants