UPSTREAM PR #17004: sampling : add support for GPU sampling (wip)#102
UPSTREAM PR #17004: sampling : add support for GPU sampling (wip)#102
Conversation
b16251e to
95f6e9b
Compare
aa2fc28 to
0ad40ce
Compare
5d18032 to
a4d9044
Compare
|
Explore the complete analysis inside the Version Insights Performance Analysis Summary - PR #102: GPU Sampling SupportOverviewThis PR introduces GPU-accelerated sampling infrastructure across 44 files with 4,098 additions and 307 deletions. The implementation adds backend sampling capabilities, enabling token selection to execute on GPU. Analysis reveals measurable overhead in the sampler subsystem with no impact on core inference performance. Key FindingsPerformance-Critical Areas ImpactSampler Subsystem (libllama.so):
The absolute changes are minimal in nanoseconds. The primary contributor is the new
Core Inference Functions: Tokens Per Second ImpactNo inference performance degradation. The core tokenization and inference pipeline remains unchanged. Functions responsible for token generation ( Power Consumption Analysislibllama.so: +2.83% power consumption increase (+5,269 nJ per execution cycle) The increase stems from:
The power consumption increase is isolated to the sampling subsystem. The feature is opt-in via Implementation CharacteristicsThe PR implements a two-phase sampler initialization pattern and extends the Memory allocation changes include expanded output buffers (5x increase when backend sampling is enabled) and per-sequence sampler configuration support. Graph construction now includes |
|
Explore the complete analysis inside the Version Insights Based on the analysis of the llama.cpp project comparing version Key FindingsPerformance-Critical Areas Impact: The token sampling subsystem shows structural changes with the addition of backend sampling support. The Tokens Per Second Impact: Core inference functions Power Consumption: Binary-level analysis shows minimal power consumption changes. The Code Changes: The modifications add backend sampling capabilities through new API structures ( |
By default, we perform a warm-up step where the ggml_cgraph is computed once. For backend-sampling, this graph contains the sampler, and thus the RNG state of the backend's dist sampler is advanced once. Solution to this is to reset the samplers after the warmup has finished
We sample in double precision and cast to float to match rnd numbers of llama_dampler_dist which uses double precision (sampling from std::uniform_real_distribution<double> and std::uniform_real_distribution<float> with same rng will produce different sequences).
Gives best perf for backend-sampling on CUDA. Flag can be removed once CCCL 3.2 is bundled within CTK and that CTK version is used in llama.cpp
This commit updates the include directive in cumsum.cu to use
cub/cub.cuh instead of cub/block/block_scan.cuh.
The motivation of this change is that without it compilation fails
with the following error:
```console
/llama.cpp/ggml/src/ggml-cuda/cumsum.cu(196): error: name followed by "::" must be a class or namespace name
cub::DeviceScan::InclusiveSum(nullptr,
^
/llama.cpp/ggml/src/ggml-cuda/cumsum.cu(207): error: name followed by "::" must be a class or namespace name
cub::DeviceScan::InclusiveSum((void *) tmp_alloc.get(), tmp_size, src, dst, ne, stream);
^
2 errors detected in the compilation of "/llama.cpp/ggml/src/ggml-cuda/cumsum.cu".
gmake[2]: *** [ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/build.make:317: ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/cumsum.cu.o] Error 2
```
Commit 83b3b1c ("cuda: optimize
cumsum cub path (#18362)") updated the include directive replacing
device_scan.cuh which is causing this issue.
This commit uses cub/cub.cuh umbrella header which is consistent with
other files in the ggml-cuda directory like mean.cu, sum.cu, etc.
DeviceTopK::MaxPairs is an iterative algorithm, where `d_keys_out` is written after every iteration. As a consequence, it must not overlap with `d_keys_in`, or otherwise undefined behavior occurs (keys are no longer unique in d_keys_in and may map to different values between iterations)
By using the fancy [`counting_iterator`](https://nvidia.github.io/cccl/thrust/api/classthrust_1_1counting__iterator.html#classthrust_1_1counting__iterator) exposed by CCCL, we can avoid materializing the index to GPU memory, saving VRAM + 1 kernel invocation
Since we use cuda::discard_iterator to avoid writing out the keys, we can directly pass in src instead of copying it to `temp_keys`
Mirrored from ggml-org/llama.cpp#17004
This is a work in progress to add support for GPU sampling.
The motivation for this feature is to enable sampling to be performed directly on the GPU as part of the computation graph being executed, allowing for some or all of the sampling to be done on the GPU.
For example, the GPU sampler chain might select/sample a token directly in which case only the sampled token needs to be transferred from device memory to host memory.
It is also possible for the GPU samplers to perform filtering of the logits, or compute and filter the probability distribution, in which case only the filtered logits or probabilites need to be transferred back to system memory for further processing by CPU samplers.
Currently the GPU sampling works in a similar manner to how pooling works, it is a function that is called by build_graph:
GPU samplers can be configured by creating sampler chains, where each sampler chain is associated with a specific sequence id:
The struct is defined as:
These sampler configs are then passed as context params:
llama_context_params cparams = llama_context_default_params(); cparams.samplers = sampler_configs.data(); cparams.n_samplers = sampler_configs.size();When the graph is built, the configured sampler's _apply function is called which allows them to add operations/nodes to the computation graph.
This enables the sampling to happen fully, or partially on the GPU. The samplers could sample a single token in which case that is what will be transferred from the device memory to host memory after llama_decode has been called. The sampled token can then be retrieved using:
Is it also possible to run a GPU sampler that only filters the logits and then only the filtered logits are transferred back to the host and the sampling can proceed on the CPU with the normal (CPU) sampler chain. In this case the CPU samplers are configured as usual but they will now operate on already filtered logits.
Similar to the above handling of logits, it is possible for a GPU samplers to compute the full probability distribution and transfer that to the host. And the CPU samplers can then operate on the those probabilities.
Building and running the tests
Download a model for testing:
$ cd models && wget https://huggingface.co/ggml-org/models/resolve/main/tinyllamas/stories15M-q4_0.ggufBuilding the test:
$ cmake --build build --target test-gpu-sampling -j8Runing all tests:
The following individual tests are available:
These can be run individually, for example:
TODO