fix: set device before hipStreamWaitEvent in multi-GPU pipeline parallel by williamtwomey · Pull Request #31 · spiritbuun/buun-llama-cpp

williamtwomey · 2026-04-28T13:27:59Z

Overview

ROCm 7.x requires the current device to match a stream's device before calling hipStreamWaitEvent. In pipeline-parallel configs (two GPUs with -ts 1,1), the ggml scheduler alternates splits across devices. After a split runs on GPU 1, the current device is 1 -- but the scheduler may then call ggml_backend_cuda_event_wait for GPU 0's backend without switching devices first. This causes hipStreamWaitEvent to fail with "illegal memory access".

Fix: add ggml_cuda_set_device(cuda_ctx->device) at the top of both ggml_backend_cuda_event_record and ggml_backend_cuda_event_wait to ensure the current device always matches the stream and event.

Also fix dflash_cross_ring_gpu_write and dflash_cross_ring_gpu_interleave in cross-ring-interleave.cu: these used cudaStreamPerThread without setting the device first. The GPU ring buffer is allocated on one device, but the target model decode may leave a different device current when ring_write is called. Added a device field to the ring struct (captured at alloc time with cudaGetDevice) and call cudaSetDevice before each cudaStreamPerThread operation so the H2D memcpy and interleave kernel land on the right device.

Tested on ROCm 7.2.1 with two R9700 GPUs, Qwen3.6-27B + DFlash drafter using -ts 1,1. Before this patch: crash on the second draft cycle. After: stable.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: NO

ROCm 7.x requires the current device to match a stream's device before calling hipStreamWaitEvent. In pipeline-parallel configs (two GPUs with -ts 1,1), the ggml scheduler alternates splits across devices. After a split runs on GPU 1, the current device is 1 -- but the scheduler may then call ggml_backend_cuda_event_wait for GPU 0's backend without switching devices first. This causes hipStreamWaitEvent to fail with "illegal memory access". Fix: add ggml_cuda_set_device(cuda_ctx->device) at the top of both ggml_backend_cuda_event_record and ggml_backend_cuda_event_wait to ensure the current device always matches the stream and event. Also fix dflash_cross_ring_gpu_write and dflash_cross_ring_gpu_interleave in cross-ring-interleave.cu: these used cudaStreamPerThread without setting the device first. The GPU ring buffer is allocated on one device, but the target model decode may leave a different device current when ring_write is called. Added a device field to the ring struct (captured at alloc time with cudaGetDevice) and call cudaSetDevice before each cudaStreamPerThread operation so the H2D memcpy and interleave kernel land on the right device. Tested on ROCm 7.2 with two RX 9700 GPUs, Qwen3.6-27B + DFlash drafter using -ts 1,1. Before this patch: crash on the second draft cycle. After: stable.

spiritbuun merged commit 0fb3fdc into spiritbuun:master Apr 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: set device before hipStreamWaitEvent in multi-GPU pipeline parallel#31

fix: set device before hipStreamWaitEvent in multi-GPU pipeline parallel#31
spiritbuun merged 1 commit into
spiritbuun:masterfrom
williamtwomey:fix/rocm-event-wait-set-device

williamtwomey commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

williamtwomey commented Apr 28, 2026

Overview

Requirements

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants