ggml: backend-agnostic tensor parallelism#19378
ggml: backend-agnostic tensor parallelism#19378JohannesGaessler wants to merge 34 commits intoggml-org:masterfrom
Conversation
I'm not seeing a deadlock, just a crash in the driver with an invalid descriptor. I ran It seems like tensor->data (and tensor->view_src->data) are too large. I haven't debugged further. |
|
works for me on 2x3090 for llama 3 8B and Mistral Nemo 12B on Devstral I have OOM (expected because model size?)but on MInistral 14B tooqwen 4B has a different issue |
|
Interesting. If the CPU backend is able to be virtualized into multiple devices as described here, would it be possible to allow multiple NUMA nodes to be parallelized? |
|
will this pr solve if we use multiple rpc connected to gpu with cpu and when we use cpu moe flag ? |
|
@jacekpoplawski the combination of @DocShotgun longer-term I intend to also enable this code for better NUMA support though I'm not yet sure what to do in terms of hardware for development. Originally I had intended to buy 1.5 TiB of DDR5 RAM and 2 EPYC CPUs but at the current prices that would be financially irresponsible of me to do. @gopinath87607 I don't understand what you mean. |
If you're looking for a DDR4 platform, I might be able to help with that, but unfortunately, not so much with RAM for the system. I'm also in Germany. Wouldn't mind giving you access to my systems if you want. Have two dual Xeon systems one with P40s and the other with Mi50s. Would be very happy to help either way. |
|
Started doing some initial tests to get familiar with the changes. I'm using virtual Metal devices and things appear to be mostly working - e.g. seeing graph execution on both devices, However, the following command does not produce identical results on each run: GGML_METAL_DEVICES=2 ./bin/llama-completion -m ~/models/llama-3.1-8b/ggml-model-f16.gguf -no-cnv -p "I believe the meaning of life is" -n 32 --top-k 1 -sm tensorI think this either means that there could be a problem with the fallback for the missing backend API, or that I could have made an error in the implementation of the events and the async copy in Metal. Will investigate more.
In the meantime, @JohannesGaessler do you confirm that the command above has deterministic results on your end? Also, if is it deterministic with the fallback calls? |
|
Generally speaking you cannot expect bit-for-bit identical results if you split the computation across multiple virtual devices. The order in which floats are summed up will be different which will in turn change the rounding error. If I run If you use |
|
Sorry, I think I misread your post. If you are saying that the results are not deterministic with 2 virtual GPUs but they are with 1 GPU then that I think is indicative of a bug w.r.t. the synchronization. |
|
Yes, I understand that 1GPU vs 2GPU will not be bit-for-bit identical. What I mean is that in my test, running the command with 2 GPUs several times produces non-deterministic results from one run to the other: GGML_METAL_DEVICES=2 ./bin/llama-completion -m ~/models/llama-3.1-8b/ggml-model-f16.gguf -no-cnv -p "I believe the meaning of life is" -n 32 --top-k 1 -sm tensor
# run 1
I believe the meaning of life is to find your gift. The purpose of life is to give it away.
I believe that the meaning of life is to find your gift. The purpose of life
# run 2
I believe the meaning of life is to find your gift. The purpose of life is to give it away. To give it away, you have to find it. [end of text]
# run 3
I believe the meaning of life is to be happy. I believe that happiness is the only thing that matters. I believe that happiness is the only thing that matters. I believe that happiness is the
Yes, seems like a synchronization issue. Was wondering if you observe it on your end with and without the fallback backend API. This will give me indication where to look for the issue. |
|
With the command you posted (minus the |
…expose the same type. This allows using pinned memory instead of pageable memory for CUDA. Fix compilation errors.
* Fix crash with Qwen-30B-A3B Q4_0 Qwen-30B-A3B Q4_0 has an intermediate dimension of 768. Using a granularity of 256 forces an uneven split between GPUs, which is not supported by the current implementation. * Decide block size based on tensor quantization type
KV cache serialization requires non-zero offsets on the tensor. Add support in the meta backend to set/get a tensor with a non-zero offset.
68d8043 to
82786ff
Compare
|
I rebased on top of the end-to-end tests added via Model list
|
ggml/src/ggml-backend-meta.cpp
Outdated
| const size_t j_src = j_src_0 == 0 ? j_dst : (j_src_0 - (j_src_0 <= j_dst ? 1 : 0)); | ||
| auto & bcj_src = backend_ctx->backend_configs[j_src]; | ||
|
|
||
| ggml_backend_tensor_copy_async(bcj_dst.backend, bcj_src.backend, dst_views[j_dst*n_backends], src_views[j_dst*n_backends + j_src]); |
There was a problem hiding this comment.
I think this has a bug due to which there is a device mismatch in bcj_src.backend and src_views[j_dst*n_backends + j_src]. This is because the tensors are pushed back in the src_views in a different order at line 1014. Either we should fix the index at which we write the value, or read from the right index.
PR JohannesGaessler#11 fixes this by modifying the read index.
Perf improvement: RTX 6000 Pro Blackwell, Qwen3-4b-Q4_K_M:
| Before | After | Speed-up | |
|---|---|---|---|
| pp512 | 12573.87 | 14227.6 | 1.13 |
| tg128 | 123.82 | 164 | 1.32 |
There is a mismatch between the dst buffer device and the backend device, causing the use of sync copies
…rf and stability (#12)
|
I agree with the assessment of @gaugarg-nv that the generic fallback for AllReduce should be reverted to the previous version. I had implemented it in a coupled way with other changes so I pushed it but there are correctness issues and the performance for those cases where it works correctly is not good. |
|
I pushed a MoE optimization that delays the AllReduce after the FFN until after the results per expert have been summedd up on each GPU, effectively reducing I/O by Performance
|
I can confirm that this improves perf for Qwen3-235B-A22B Q4_0 as well on 2x RTX 6000 Pro Blackwell:
|
|
@JohannesGaessler I added a change that improves scaling of FFN-down GEMVs: JohannesGaessler#13 Please see if there is anything you would like me to test. Also, this change is unrelated to TP itself, but improves kernel performance for small-K values. Let me know if you would like me to file this on master branch.
|
|
I would suggest you make a PR to master since these changes are independent and this PR is already getting quite large. |










This PR adds support for backend-agnostic tensor parallelism, enabled via specifying
--split-mode tensor. This is done by adding a new "meta" backend that internally wraps multiple "simple" backends but can be used in the same way as a regular ggml backend.ggml Backend Interface Changes
This PR extends the ggml backend interface with some new functions for tensor copying:
set_tensor_2d_async/get_tensor_2d_asyncwhich are equivalent tomemcpy2DAsyncin CUDA. This is not needed for the computation of the meta backend itself but rather for setting/getting weights or the output. Currently not implemented, as a workaround the one-dimensional version is used in a loop.shfl_tensor_asyncto allow two ggml backends to exchange two tensors and to synchronize on the completion of the exchange. As a fallbackcpy_tensor_asynccan be used but this has a higher latency because the copy in one direction can only start once the one in the other direction has finished. Needed for a generic AllReduce between ggml backends. Implemented.allreduce_tensor_asyncto allow ggml backends to specify a backend-specific way to do an AllReduce operation. Intended to be used for NCCL support in cooperation with @gaugarg-nv . Not yet implemented.@slaren please provide feedback regarding whether you agree that these operations should be in the ggml backend interface. For context, all of them are optional and can use existing operations as a fallback.
Meta Backend Implementation Details
The meta backend implements an entire ggml backend stack starting from a meta device. The meta device is created from multiple simple backend devices as well as a function to determine how the data should be split across devices ("split states"). Backend buffer types, buffers, and backends are created as per usual. When calling
ggml_backend_graph_computethe code infers the split states of the nodes in the compute graph based on the split states assigned for the weights/kv cache. The basic pattern is to make all tensors mirrored by default. For the weight matrices, do a split in dimension 1, then a split in dimension 0, then an AllReduce. For a transformer this means two AllReduce operations, one after the attention and one after the FFN. The attention is effectively split by dimension 0, which equates to a split by attention head.An generic AllReduce operation is performed in the meta backend by splitting the graph into subgraphs. After a subgraph is executed, call
shfl_tensor_asyncto make backends exchange partial results, and then have them execute auxiliary graphs that contain only aGGML_ADDoperation to combine the results.The memory allocation for the compute graph is rather tricky - the way I solved it is to allocate the memory for the meta backend as per usual and to then transplant the calculated addresses relative to the backend buffer base pointer to the underlying simple backends. Because the simple tensors only require a fraction of the full memory this yields correct results, though it does result in overallocation for the compute graphs. For the weights/kv cache the memory allocation for the meta backend is done via a new function
ggml_backend_meta_alloc_ctx_tensors_from_buftto prevent duplicated weights (which are much larger in size). I'm not yet sure what the best approach will be long-term, I think the graph allocation code inggml-alloc.cwill need to be adjusted.Current Issues/Limitations
--tensor-splithas no effect.llama_params_fitis not implemented so the context size has to be set manually.llama_params_fit.ggml_tensor::datato dummy values since that is what is checked inggml-alloc.cto determine whether or not a tensor is considered allocated. This dummy value should never actually be used for any computations but I don't consider this a good solution.Performance
LLaMA 3 on 2x RTX 4090
Generally speaking it can be observed that parallelizing larger models has better performance than parallelizing smaller models. Similarly, parallelizing the model becomes more worthwhile as the context depth increases. This makes sense as both of these result in a larger workload per GPU vs. the overhead from parallelization. Token generation benefits more from parallelization than prompt processing because the amount of data that needs to be transferred between GPUs is proportional to batch size - long-term it may make sense to implement support for FP16/BF16 compute types which would count the I/O in half vs. FP32. For pp512 pipeline parallelism is effectively disabled while for pp2048 it's enabled. With pipeline parallelism
-sm layeris still faster than-sm tensoreven at high context depths.