CUDA: add stream-based concurrency#16991
Conversation
1e97a91 to
1c4d8f3
Compare
|
Sorry, I wanted to tell you this but I forgot: a long time ago I tried something similar, see #4719 . There the performance did not improve, I think the reason was the lack of CUDA graphs to reduce the overhead. |
|
Yeah, I think CUDA graphs are essential for this to work (hence this PR only looks at batch_size=1) |
1c4d8f3 to
70a5a01
Compare
|
Minimal changes to make this work on hip: If used for real, cudaStreamWaitEvent error needs to handled of course with
|
|
The almost exact same numbers make me think that this change is not launching the streams. I would expect a shift in performance either for the worse or the better. |
|
yeah ill run a trace on it later. |
70a5a01 to
2c3cfa9
Compare
12d5f82 to
d3a8d93
Compare
|
@ggerganov would you mind testing this on your DGX spark? I want to see if the low memory bandwidth GPUs benefit from this change |
I'm not really clear on what the problem is here you're trying to solve. If the order is: MUL_MAT+ADD+MUL_MAT+ADD+MUL_MAT+ADD, then you have the nodes conveniently consecutive (for fusion), the intermediate MUL_MAT outputs aren't needed and the ADDs will all have different outputs. This is the order ggml-vulkan will use and it gets both fusion and concurrency. |
|
My problem is that the buffer gets reused in this case. The graph assumes serial execution, and thinks the first mul-mats buffer is no longer required. (Assume the no fusion case for now) |
|
If you're not doing fusion, then you'd want graph_optimize to reorder these to MUL_MAT+MUL_MAT+MUL_MAT+ADD+ADD+ADD. Then the MUL_MAT results will stay live until the ADDs. |
|
Yeah that's what the current re-order does. But that doesn't allow for fusion. I don't want these two things to be intertwined. Ideally want something that just lets me extend the lifetime for a particular output till a certain node |
|
IMO they are fundamentally intertwined. The code that detects fusions looks for specific sequences of operations, and graph_optimize should generate or preserve those sequences. If the backend supports fusing something, then graph_optimize should make them consecutive both to make the fusion logic simpler and to shorten the lifetime of transient allocations. |
|
I think it will be good to isolate these two behaviours. If you see the graph above it can launch a concurrent graph from the mul-mat till set rows. We don't have fusion for that entire sequence, and reasoning about which output stays alive would involve inspecting the graph in any case. Secondly fusion is a common source of bugs in the cuda backend, I don't want to add another layer of complexity on top of it. |
This comment was marked as outdated.
This comment was marked as outdated.
|
Did you pass the env flag? |
Why of course not 🫠
Updated numbers
|
If my current understanding of ggml is correct, we should be able to get the same behavior (fusion + concurrency) on both vulkan + cuda, as Q should go out of life after flash-attention (and K + V go out of life after being inserted into the KV-cache). Have we root-caused this? |
e1b6274 to
2c9c3c2
Compare
My git just tweaked in the middle of auto-compaction |
Not sure if you have a specific logic in mind for assigning the Atm, I don't have a good feeling how difficult would be to modify the allocator to respect the stream ids. |
|
IMO ggml_visit_parents should just assign everything to stream 0 and then graph_optimize can reassign to streams based on what the backend wants. |
We need to ensure that nodes are only recycled after they've been used for the last time and after their results have "become visible", meaning that we need to ensure that we don't recycle nodes where the execution can be in the future. I would suggest allocating an array like
I would also be fine with determining stream ids in the graph optimization step. |
|
When I tested performance:
As expected, for typical model sizes there is more speedup for fast GPUs running small models on an empty context. For very small models this trend does not seem to monotonically continue, however. |
|
For my understanding, why do we gate the functionality behind |
My understanding is the former. We currently still have issues with BS > 1 for example, where it's disabled #16991 (comment). |
|
Yes this is the initial experimental support, wherever we use CUDA graphs it should be possible to default to this being true (without CUDA graphs there is a performance drop). So at the moment it can be true for BS=1 universally for fully offloaded models, however I would like to give it sometime to make this flag something like |
I don't think it should become a CLI argument. Ideally, it should be an env variable that is enabled by default and can optionally be disabled (mainly for debugging purposes). Similar to how we do it with |
This PR enables concurrent streams introduced in ggml-org#16991 by default. To disable a new env flag `GGML_CUDA_DISABLE_GRAPH_OPT` is introduced
This PR enables concurrent streams introduced in ggml-org#16991 by default. To disable a new env flag `GGML_CUDA_DISABLE_GRAPH_OPT` is introduced
This PR enables concurrent streams introduced in ggml-org#16991 by default. To disable a new env flag `GGML_CUDA_DISABLE_GRAPH_OPT` is introduced
This PR enables concurrent streams introduced in ggml-org#16991 by default. To disable a new env flag `GGML_CUDA_DISABLE_GRAPH_OPT` is introduced
This PR enables concurrent streams introduced in ggml-org#16991 by default. To disable a new env flag `GGML_CUDA_DISABLE_GRAPH_OPT` is introduced
This PR enables concurrent streams introduced in ggml-org#16991 by default. To disable a new env flag `GGML_CUDA_DISABLE_GRAPH_OPT` is introduced
This PR enables concurrent streams introduced in ggml-org#16991 by default. To disable a new env flag `GGML_CUDA_DISABLE_GRAPH_OPT` is introduced
* CUDA: add stream-based concurrency * HIP: fix hipStreamWaitEvent define and nodiscard warnings * ggml-cuda: fix fusion inside stream * ggml-cuda: fix bug w.r.t first stream launch * ggml-cuda: format * ggml-cuda: improve assert message * ggml-cuda: use lambda instead of duplicating code * ggml-cuda: add some more comments * ggml-cuda: add more detailed comments about concurrency * ggml-cuda: rename + remove unused var * ggml-cuda: fix condition for stream launch * ggml-cuda: address review comments, add destructor * common.cuh: add is_valid for concurrent events * common.cuh: make comment better * update comment Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * update comment Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * common.cuh: fix lower_bound condition + remove join_node data from write_ranges * ggml-cuda: fix overlap condition + shadowing parameter --------- Co-authored-by: Carl Philipp Klemm <carl@uvos.xyz> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
|
@am17an, thanks for this PR, impressive speed up. I am wondering if this graph optimization can be adopted by image generation models. Both UNET and Dit have attention (self and cross) so should benefit from the concurrency. Could you provide some guidance? Thanks. |
|
@bssrdf happy to help. Perhaps you can email me and we can discuss. My email is on my profile page |
* CUDA: add stream-based concurrency * HIP: fix hipStreamWaitEvent define and nodiscard warnings * ggml-cuda: fix fusion inside stream * ggml-cuda: fix bug w.r.t first stream launch * ggml-cuda: format * ggml-cuda: improve assert message * ggml-cuda: use lambda instead of duplicating code * ggml-cuda: add some more comments * ggml-cuda: add more detailed comments about concurrency * ggml-cuda: rename + remove unused var * ggml-cuda: fix condition for stream launch * ggml-cuda: address review comments, add destructor * common.cuh: add is_valid for concurrent events * common.cuh: make comment better * update comment Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * update comment Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * common.cuh: fix lower_bound condition + remove join_node data from write_ranges * ggml-cuda: fix overlap condition + shadowing parameter --------- Co-authored-by: Carl Philipp Klemm <carl@uvos.xyz> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>


Possibly supersede #16813.
This PR adds support to run concurrent CUDA streams on single GPU setups.
At the moment this only targets the Q, K, V branch. I feel this is the "correct" approach in case the Q, K, V tensors are of different types/not in the same place in memory. The downside is that this approach doesn't come for free and there's some complexity involved, but I'm not an expert at the ggml graph and I feel it could be simplified.
Currently this is hidden by an env variable flag. To run you can use
GGML_CUDA_GRAPH_OPT=1TG Performance gain is more than the previous PR (1-9% gain depending on the model/GPU), probably because we parallelize MUL_MAT + NORM + ROPE rather than just MUL_MAT. At the moment we leave some performance on the table where we don't fuse operations in the parallel streams themselves (e.g. MUL_MAT + BIAS, RMS_NORM + MUL etc.), I couldn't find a simple enough way to enable fusion there.
Performance details:
Details
5090:
4090:
And just for comparison, this is without fusing ops inside a stream
Details
5090:
4090:
TODO: