Skip to content

UPSTREAM PR #17004: sampling : add support for GPU sampling (wip)#102

Open
DajanaV wants to merge 179 commits intomainfrom
upstream-PR17004-branch_danbev-gpu-sampling
Open

UPSTREAM PR #17004: sampling : add support for GPU sampling (wip)#102
DajanaV wants to merge 179 commits intomainfrom
upstream-PR17004-branch_danbev-gpu-sampling

Conversation

@DajanaV
Copy link
Collaborator

@DajanaV DajanaV commented Nov 6, 2025

Mirrored from ggml-org/llama.cpp#17004

This is a work in progress to add support for GPU sampling.

The motivation for this feature is to enable sampling to be performed directly on the GPU as part of the computation graph being executed, allowing for some or all of the sampling to be done on the GPU.

For example, the GPU sampler chain might select/sample a token directly in which case only the sampled token needs to be transferred from device memory to host memory.

It is also possible for the GPU samplers to perform filtering of the logits, or compute and filter the probability distribution, in which case only the filtered logits or probabilites need to be transferred back to system memory for further processing by CPU samplers.

Currently the GPU sampling works in a similar manner to how pooling works, it is a function that is called by build_graph:

    // add GPU sampling layers (if any)
    llm->build_sampling(*this, params);

GPU samplers can be configured by creating sampler chains, where each sampler chain is associated with a specific sequence id:

    struct llama_sampler_chain_params params = llama_sampler_chain_default_params();
    struct llama_sampler * chain = llama_sampler_chain_init(params);
    llama_sampler_chain_add(chain, llama_sampler_gpu_init_greedy());
    std::vector<llama_sampler_seq_config> sampler_configs = {
        { 0, gpu_sampler_chain }
    };

The struct is defined as:

    struct llama_sampler_seq_config {
        llama_seq_id           seq_id;
        struct llama_sampler * sampler;
    };

These sampler configs are then passed as context params:

        llama_context_params cparams = llama_context_default_params();
        cparams.samplers = sampler_configs.data();
        cparams.n_samplers = sampler_configs.size();

When the graph is built, the configured sampler's _apply function is called which allows them to add operations/nodes to the computation graph.

This enables the sampling to happen fully, or partially on the GPU. The samplers could sample a single token in which case that is what will be transferred from the device memory to host memory after llama_decode has been called. The sampled token can then be retrieved using:

    llama_token id = llama_get_sampled_token_ith(test_ctx.ctx, index);

Is it also possible to run a GPU sampler that only filters the logits and then only the filtered logits are transferred back to the host and the sampling can proceed on the CPU with the normal (CPU) sampler chain. In this case the CPU samplers are configured as usual but they will now operate on already filtered logits.

Similar to the above handling of logits, it is possible for a GPU samplers to compute the full probability distribution and transfer that to the host. And the CPU samplers can then operate on the those probabilities.

Building and running the tests

Download a model for testing:

$ cd models && wget https://huggingface.co/ggml-org/models/resolve/main/tinyllamas/stories15M-q4_0.gguf

Building the test:

$ cmake --build build --target test-gpu-sampling -j8

Runing all tests:

$ env LLAMACPP_TEST_MODELFILE=../models/stories15M-q4_0.gguf \
    ctest --test-dir build -R '^test-gpu-sampling$' -V

The following individual tests are available:

$ ctest --test-dir build -N -R test-gpu-sampling-
  Test 35: test-gpu-sampling-greedy
  Test 36: test-gpu-sampling-temp
  Test 37: test-gpu-sampling-softmax
  Test 38: test-gpu-sampling-top_k
  Test 39: test-gpu-sampling-top_p
  Test 40: test-gpu-sampling-mul_seq

Total Tests: 6

These can be run individually, for example:

$ env LLAMACPP_TEST_MODELFILE=../models/stories15M-q4_0.gguf \
    ctest --test-dir build -R 'test-gpu-sampling-temp' -V

TODO

  • Implement GPU dist sampler
  • Allow GPU samplers to pre-allocate state tensors
  • Integrate GPU samplers with llama-server
  • Implement true top-p sampler on GPU
  • Add missing GPU samplers (e.g. typical, mirostat, etc)

@DajanaV DajanaV force-pushed the main branch 3 times, most recently from b16251e to 95f6e9b Compare November 6, 2025 13:17
@DajanaV DajanaV force-pushed the main branch 22 times, most recently from aa2fc28 to 0ad40ce Compare November 9, 2025 17:06
@DajanaV DajanaV force-pushed the upstream-PR17004-branch_danbev-gpu-sampling branch from 5d18032 to a4d9044 Compare November 9, 2025 17:33
@loci-review
Copy link

loci-review bot commented Dec 18, 2025

Explore the complete analysis inside the Version Insights

Performance Analysis Summary - PR #102: GPU Sampling Support

Overview

This PR introduces GPU-accelerated sampling infrastructure across 44 files with 4,098 additions and 307 deletions. The implementation adds backend sampling capabilities, enabling token selection to execute on GPU. Analysis reveals measurable overhead in the sampler subsystem with no impact on core inference performance.

Key Findings

Performance-Critical Areas Impact

Sampler Subsystem (libllama.so):

  • llama_sampler_init_greedy: +116 ns throughput increase (734% relative change)
  • llama_sampler_chain_add: +54 ns throughput increase (306% relative change)
  • Sampler name functions: +14 ns each (218-221% relative change)

The absolute changes are minimal in nanoseconds. The primary contributor is the new llama_sampler_backend infrastructure which adds:

  • Backend capability checking during initialization (~200-400 ns per sampler)
  • Dynamic name generation with string concatenation
  • Wrapper struct for chain management (changed from vector<llama_sampler*> to vector<info> containing {bool is_backend, llama_sampler* ptr})

Core Inference Functions:
No changes detected in llama_decode, llama_encode, or llama_tokenize response time or throughput. The sampling infrastructure changes occur after token generation, not during the decode/encode phase.

Tokens Per Second Impact

No inference performance degradation. The core tokenization and inference pipeline remains unchanged. Functions responsible for token generation (llama_decode, llama_encode, llama_tokenize) show no measurable performance changes. The sampler overhead occurs post-inference during token selection, which represents a small fraction of total inference time.

Power Consumption Analysis

libllama.so: +2.83% power consumption increase (+5,269 nJ per execution cycle)

The increase stems from:

  • Sampler initialization overhead (backend support checking, input tensor allocation)
  • Chain management complexity (wrapper struct construction)
  • Output buffer management (conditional allocation based on batch content)
  • Graph construction additions (~820 ns per decode with 5 samplers)

The power consumption increase is isolated to the sampling subsystem. The feature is opt-in via --backend-sampling flag and disabled by default, ensuring no impact on existing workloads.

Implementation Characteristics

The PR implements a two-phase sampler initialization pattern and extends the llama_sampler_i interface from 6 to 10 methods. New backend methods (backend_init, backend_accept, backend_apply, backend_set_input) enable GPU execution. The implementation includes comprehensive fallback logic to CPU when backend support is unavailable or when using grammar-based sampling.

Memory allocation changes include expanded output buffers (5x increase when backend sampling is enabled) and per-sequence sampler configuration support. Graph construction now includes build_sampling() which adds sampling operations to the computation graph.

@loci-review
Copy link

loci-review bot commented Dec 19, 2025

Explore the complete analysis inside the Version Insights

Based on the analysis of the llama.cpp project comparing version bc5195c585e2aeaf987c400a0b31bd6d4766d5b5 against base e02e9be6d0da81980bca74728bf5bd980f9dee08, the changes introduce backend sampling infrastructure with minimal performance impact on core inference paths.

Key Findings

Performance-Critical Areas Impact:

The token sampling subsystem shows structural changes with the addition of backend sampling support. The llama_sampler_sample function in llama-sampling.cpp increased throughput by 11 ns (from 1906 ns to 1917 ns), representing a 1% change. The common_sampler_sample function in common/sampling.cpp shows 8 ns increase (from 1906 ns to 1914 ns). These changes are within measurement noise and do not materially affect inference performance.

Tokens Per Second Impact:

Core inference functions llama_decode and llama_encode show no measurable throughput changes. The tokenization path via llama_tokenize remains unaffected. Given the reference that 2 ms degradation in llama_decode correlates to 7% reduction in tokens per second on the test configuration, the observed sub-microsecond changes in sampling functions translate to negligible impact on overall throughput.

Power Consumption:

Binary-level analysis shows minimal power consumption changes. The llama-cli binary exhibits negligible variation in predicted execution time. The sampling infrastructure additions do not introduce measurable energy overhead in the default execution path.

Code Changes:

The modifications add backend sampling capabilities through new API structures (llama_sampler_data, llama_sampler_seq_config) and extend sampler interfaces with backend-specific methods. The implementation maintains backward compatibility while enabling GPU-accelerated sampling for supported backends. Argument parsing in common/arg.cpp adds --backend-sampling flag handling without affecting existing code paths.

ORippler and others added 26 commits December 19, 2025 11:42
By default, we perform a warm-up step where the ggml_cgraph is computed
once. For backend-sampling, this graph contains the sampler, and thus
the RNG state of the backend's dist sampler is advanced once.

Solution to this is to reset the samplers after the warmup has finished
We sample in double precision and cast to float to match rnd numbers of
llama_dampler_dist which uses double precision (sampling from
std::uniform_real_distribution<double> and
std::uniform_real_distribution<float> with same rng will produce
different sequences).
Gives best perf for backend-sampling on CUDA. Flag can be removed once
CCCL 3.2 is bundled within CTK and that CTK version is used in llama.cpp
This commit updates the include directive in cumsum.cu to use
cub/cub.cuh instead of cub/block/block_scan.cuh.

The motivation of this change is that without it compilation fails
with the following error:
```console
/llama.cpp/ggml/src/ggml-cuda/cumsum.cu(196): error: name followed by "::" must be a class or namespace name
      cub::DeviceScan::InclusiveSum(nullptr,
           ^

/llama.cpp/ggml/src/ggml-cuda/cumsum.cu(207): error: name followed by "::" must be a class or namespace name
      cub::DeviceScan::InclusiveSum((void *) tmp_alloc.get(), tmp_size, src, dst, ne, stream);
           ^

2 errors detected in the compilation of "/llama.cpp/ggml/src/ggml-cuda/cumsum.cu".
gmake[2]: *** [ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/build.make:317: ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/cumsum.cu.o] Error 2
```
Commit 83b3b1c ("cuda: optimize
cumsum cub path (#18362)") updated the include directive replacing
device_scan.cuh which is causing this issue.

This commit uses cub/cub.cuh umbrella header which is consistent with
other files in the ggml-cuda directory like mean.cu, sum.cu, etc.
DeviceTopK::MaxPairs is an iterative algorithm, where `d_keys_out` is
written after every iteration. As a consequence, it must not overlap
with `d_keys_in`, or otherwise undefined behavior occurs (keys are no
longer unique in d_keys_in and may map to different values between
iterations)
By using the fancy
[`counting_iterator`](https://nvidia.github.io/cccl/thrust/api/classthrust_1_1counting__iterator.html#classthrust_1_1counting__iterator)
exposed by CCCL, we can avoid materializing the index to GPU memory,
saving VRAM + 1 kernel invocation
Since we use cuda::discard_iterator to avoid writing out the keys, we
can directly pass in src instead of copying it to `temp_keys`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants