Skip to content

fix: set device before hipStreamWaitEvent in multi-GPU pipeline parallel#31

Merged
spiritbuun merged 1 commit into
spiritbuun:masterfrom
williamtwomey:fix/rocm-event-wait-set-device
Apr 30, 2026
Merged

fix: set device before hipStreamWaitEvent in multi-GPU pipeline parallel#31
spiritbuun merged 1 commit into
spiritbuun:masterfrom
williamtwomey:fix/rocm-event-wait-set-device

Conversation

@williamtwomey
Copy link
Copy Markdown

Overview

ROCm 7.x requires the current device to match a stream's device before calling hipStreamWaitEvent. In pipeline-parallel configs (two GPUs with -ts 1,1), the ggml scheduler alternates splits across devices. After a split runs on GPU 1, the current device is 1 -- but the scheduler may then call ggml_backend_cuda_event_wait for GPU 0's backend without switching devices first. This causes hipStreamWaitEvent to fail with "illegal memory access".

Fix: add ggml_cuda_set_device(cuda_ctx->device) at the top of both ggml_backend_cuda_event_record and ggml_backend_cuda_event_wait to ensure the current device always matches the stream and event.

Also fix dflash_cross_ring_gpu_write and dflash_cross_ring_gpu_interleave in cross-ring-interleave.cu: these used cudaStreamPerThread without setting the device first. The GPU ring buffer is allocated on one device, but the target model decode may leave a different device current when ring_write is called. Added a device field to the ring struct (captured at alloc time with cudaGetDevice) and call cudaSetDevice before each cudaStreamPerThread operation so the H2D memcpy and interleave kernel land on the right device.

Tested on ROCm 7.2.1 with two R9700 GPUs, Qwen3.6-27B + DFlash drafter using -ts 1,1. Before this patch: crash on the second draft cycle. After: stable.

Requirements

ROCm 7.x requires the current device to match a stream's device before
calling hipStreamWaitEvent. In pipeline-parallel configs (two GPUs with
-ts 1,1), the ggml scheduler alternates splits across devices. After a
split runs on GPU 1, the current device is 1 -- but the scheduler may
then call ggml_backend_cuda_event_wait for GPU 0's backend without
switching devices first. This causes hipStreamWaitEvent to fail with
"illegal memory access".

Fix: add ggml_cuda_set_device(cuda_ctx->device) at the top of both
ggml_backend_cuda_event_record and ggml_backend_cuda_event_wait to
ensure the current device always matches the stream and event.

Also fix dflash_cross_ring_gpu_write and dflash_cross_ring_gpu_interleave
in cross-ring-interleave.cu: these used cudaStreamPerThread without
setting the device first. The GPU ring buffer is allocated on one device,
but the target model decode may leave a different device current when
ring_write is called. Added a device field to the ring struct (captured
at alloc time with cudaGetDevice) and call cudaSetDevice before each
cudaStreamPerThread operation so the H2D memcpy and interleave kernel
land on the right device.

Tested on ROCm 7.2 with two RX 9700 GPUs, Qwen3.6-27B + DFlash drafter
using -ts 1,1. Before this patch: crash on the second draft cycle.
After: stable.
@spiritbuun spiritbuun merged commit 0fb3fdc into spiritbuun:master Apr 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants