fix: set device before hipStreamWaitEvent in multi-GPU pipeline parallel#31
Merged
spiritbuun merged 1 commit intoApr 30, 2026
Conversation
ROCm 7.x requires the current device to match a stream's device before calling hipStreamWaitEvent. In pipeline-parallel configs (two GPUs with -ts 1,1), the ggml scheduler alternates splits across devices. After a split runs on GPU 1, the current device is 1 -- but the scheduler may then call ggml_backend_cuda_event_wait for GPU 0's backend without switching devices first. This causes hipStreamWaitEvent to fail with "illegal memory access". Fix: add ggml_cuda_set_device(cuda_ctx->device) at the top of both ggml_backend_cuda_event_record and ggml_backend_cuda_event_wait to ensure the current device always matches the stream and event. Also fix dflash_cross_ring_gpu_write and dflash_cross_ring_gpu_interleave in cross-ring-interleave.cu: these used cudaStreamPerThread without setting the device first. The GPU ring buffer is allocated on one device, but the target model decode may leave a different device current when ring_write is called. Added a device field to the ring struct (captured at alloc time with cudaGetDevice) and call cudaSetDevice before each cudaStreamPerThread operation so the H2D memcpy and interleave kernel land on the right device. Tested on ROCm 7.2 with two RX 9700 GPUs, Qwen3.6-27B + DFlash drafter using -ts 1,1. Before this patch: crash on the second draft cycle. After: stable.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
ROCm 7.x requires the current device to match a stream's device before calling hipStreamWaitEvent. In pipeline-parallel configs (two GPUs with -ts 1,1), the ggml scheduler alternates splits across devices. After a split runs on GPU 1, the current device is 1 -- but the scheduler may then call ggml_backend_cuda_event_wait for GPU 0's backend without switching devices first. This causes hipStreamWaitEvent to fail with "illegal memory access".
Fix: add ggml_cuda_set_device(cuda_ctx->device) at the top of both ggml_backend_cuda_event_record and ggml_backend_cuda_event_wait to ensure the current device always matches the stream and event.
Also fix dflash_cross_ring_gpu_write and dflash_cross_ring_gpu_interleave in cross-ring-interleave.cu: these used cudaStreamPerThread without setting the device first. The GPU ring buffer is allocated on one device, but the target model decode may leave a different device current when ring_write is called. Added a device field to the ring struct (captured at alloc time with cudaGetDevice) and call cudaSetDevice before each cudaStreamPerThread operation so the H2D memcpy and interleave kernel land on the right device.
Tested on ROCm 7.2.1 with two R9700 GPUs, Qwen3.6-27B + DFlash drafter using -ts 1,1. Before this patch: crash on the second draft cycle. After: stable.
Requirements