UPSTREAM PR #17004: sampling : add support for GPU sampling (wip) by DajanaV · Pull Request #102 · auroralabs-loci/llama.cpp

DajanaV · 2025-11-06T06:42:57Z

This is a work in progress to add support for GPU sampling.

The motivation for this feature is to enable sampling to be performed directly on the GPU as part of the computation graph being executed, allowing for some or all of the sampling to be done on the GPU.

For example, the GPU sampler chain might select/sample a token directly in which case only the sampled token needs to be transferred from device memory to host memory.

It is also possible for the GPU samplers to perform filtering of the logits, or compute and filter the probability distribution, in which case only the filtered logits or probabilites need to be transferred back to system memory for further processing by CPU samplers.

Currently the GPU sampling works in a similar manner to how pooling works, it is a function that is called by build_graph:

    // add GPU sampling layers (if any)
    llm->build_sampling(*this, params);

GPU samplers can be configured by creating sampler chains, where each sampler chain is associated with a specific sequence id:

    struct llama_sampler_chain_params params = llama_sampler_chain_default_params();
    struct llama_sampler * chain = llama_sampler_chain_init(params);
    llama_sampler_chain_add(chain, llama_sampler_gpu_init_greedy());
    std::vector<llama_sampler_seq_config> sampler_configs = {
        { 0, gpu_sampler_chain }
    };

The struct is defined as:

    struct llama_sampler_seq_config {
        llama_seq_id           seq_id;
        struct llama_sampler * sampler;
    };

These sampler configs are then passed as context params:

        llama_context_params cparams = llama_context_default_params();
        cparams.samplers = sampler_configs.data();
        cparams.n_samplers = sampler_configs.size();

When the graph is built, the configured sampler's _apply function is called which allows them to add operations/nodes to the computation graph.

This enables the sampling to happen fully, or partially on the GPU. The samplers could sample a single token in which case that is what will be transferred from the device memory to host memory after llama_decode has been called. The sampled token can then be retrieved using:

    llama_token id = llama_get_sampled_token_ith(test_ctx.ctx, index);

Is it also possible to run a GPU sampler that only filters the logits and then only the filtered logits are transferred back to the host and the sampling can proceed on the CPU with the normal (CPU) sampler chain. In this case the CPU samplers are configured as usual but they will now operate on already filtered logits.

Similar to the above handling of logits, it is possible for a GPU samplers to compute the full probability distribution and transfer that to the host. And the CPU samplers can then operate on the those probabilities.

Building and running the tests

Download a model for testing:

$ cd models && wget https://huggingface.co/ggml-org/models/resolve/main/tinyllamas/stories15M-q4_0.gguf

Building the test:

$ cmake --build build --target test-gpu-sampling -j8

Runing all tests:

$ env LLAMACPP_TEST_MODELFILE=../models/stories15M-q4_0.gguf \
    ctest --test-dir build -R '^test-gpu-sampling$' -V

The following individual tests are available:

$ ctest --test-dir build -N -R test-gpu-sampling-
  Test 35: test-gpu-sampling-greedy
  Test 36: test-gpu-sampling-temp
  Test 37: test-gpu-sampling-softmax
  Test 38: test-gpu-sampling-top_k
  Test 39: test-gpu-sampling-top_p
  Test 40: test-gpu-sampling-mul_seq

Total Tests: 6

These can be run individually, for example:

$ env LLAMACPP_TEST_MODELFILE=../models/stories15M-q4_0.gguf \
    ctest --test-dir build -R 'test-gpu-sampling-temp' -V

TODO

Implement GPU dist sampler
Allow GPU samplers to pre-allocate state tensors
Integrate GPU samplers with llama-server
Implement true top-p sampler on GPU
Add missing GPU samplers (e.g. typical, mirostat, etc)

loci-review · 2025-12-18T10:36:53Z

Explore the complete analysis inside the Version Insights

Performance Analysis Summary - PR #102: GPU Sampling Support

Overview

This PR introduces GPU-accelerated sampling infrastructure across 44 files with 4,098 additions and 307 deletions. The implementation adds backend sampling capabilities, enabling token selection to execute on GPU. Analysis reveals measurable overhead in the sampler subsystem with no impact on core inference performance.

Key Findings

Performance-Critical Areas Impact

Sampler Subsystem (libllama.so):

llama_sampler_init_greedy: +116 ns throughput increase (734% relative change)
llama_sampler_chain_add: +54 ns throughput increase (306% relative change)
Sampler name functions: +14 ns each (218-221% relative change)

The absolute changes are minimal in nanoseconds. The primary contributor is the new llama_sampler_backend infrastructure which adds:

Backend capability checking during initialization (~200-400 ns per sampler)
Dynamic name generation with string concatenation
Wrapper struct for chain management (changed from vector<llama_sampler*> to vector<info> containing {bool is_backend, llama_sampler* ptr})

Core Inference Functions:
No changes detected in llama_decode, llama_encode, or llama_tokenize response time or throughput. The sampling infrastructure changes occur after token generation, not during the decode/encode phase.

Tokens Per Second Impact

No inference performance degradation. The core tokenization and inference pipeline remains unchanged. Functions responsible for token generation (llama_decode, llama_encode, llama_tokenize) show no measurable performance changes. The sampler overhead occurs post-inference during token selection, which represents a small fraction of total inference time.

Power Consumption Analysis

libllama.so: +2.83% power consumption increase (+5,269 nJ per execution cycle)

The increase stems from:

Sampler initialization overhead (backend support checking, input tensor allocation)
Chain management complexity (wrapper struct construction)
Output buffer management (conditional allocation based on batch content)
Graph construction additions (~820 ns per decode with 5 samplers)

The power consumption increase is isolated to the sampling subsystem. The feature is opt-in via --backend-sampling flag and disabled by default, ensuring no impact on existing workloads.

Implementation Characteristics

The PR implements a two-phase sampler initialization pattern and extends the llama_sampler_i interface from 6 to 10 methods. New backend methods (backend_init, backend_accept, backend_apply, backend_set_input) enable GPU execution. The implementation includes comprehensive fallback logic to CPU when backend support is unavailable or when using grammar-based sampling.

Memory allocation changes include expanded output buffers (5x increase when backend sampling is enabled) and per-sequence sampler configuration support. Graph construction now includes build_sampling() which adds sampling operations to the computation graph.

loci-review · 2025-12-19T09:41:49Z

Explore the complete analysis inside the Version Insights

Based on the analysis of the llama.cpp project comparing version bc5195c585e2aeaf987c400a0b31bd6d4766d5b5 against base e02e9be6d0da81980bca74728bf5bd980f9dee08, the changes introduce backend sampling infrastructure with minimal performance impact on core inference paths.

Key Findings

Performance-Critical Areas Impact:

The token sampling subsystem shows structural changes with the addition of backend sampling support. The llama_sampler_sample function in llama-sampling.cpp increased throughput by 11 ns (from 1906 ns to 1917 ns), representing a 1% change. The common_sampler_sample function in common/sampling.cpp shows 8 ns increase (from 1906 ns to 1914 ns). These changes are within measurement noise and do not materially affect inference performance.

Tokens Per Second Impact:

Core inference functions llama_decode and llama_encode show no measurable throughput changes. The tokenization path via llama_tokenize remains unaffected. Given the reference that 2 ms degradation in llama_decode correlates to 7% reduction in tokens per second on the test configuration, the observed sub-microsecond changes in sampling functions translate to negligible impact on overall throughput.

Power Consumption:

Binary-level analysis shows minimal power consumption changes. The llama-cli binary exhibits negligible variation in predicted execution time. The sampling infrastructure additions do not introduce measurable energy overhead in the default execution path.

Code Changes:

The modifications add backend sampling capabilities through new API structures (llama_sampler_data, llama_sampler_seq_config) and extend sampler interfaces with backend-specific methods. The implementation maintains backward compatibility while enabling GPU-accelerated sampling for supported backends. Argument parsing in common/arg.cpp adds --backend-sampling flag handling without affecting existing code paths.

By default, we perform a warm-up step where the ggml_cgraph is computed once. For backend-sampling, this graph contains the sampler, and thus the RNG state of the backend's dist sampler is advanced once. Solution to this is to reset the samplers after the warmup has finished

We sample in double precision and cast to float to match rnd numbers of llama_dampler_dist which uses double precision (sampling from std::uniform_real_distribution<double> and std::uniform_real_distribution<float> with same rng will produce different sequences).

Gives best perf for backend-sampling on CUDA. Flag can be removed once CCCL 3.2 is bundled within CTK and that CTK version is used in llama.cpp

This commit updates the include directive in cumsum.cu to use cub/cub.cuh instead of cub/block/block_scan.cuh. The motivation of this change is that without it compilation fails with the following error: ```console /llama.cpp/ggml/src/ggml-cuda/cumsum.cu(196): error: name followed by "::" must be a class or namespace name cub::DeviceScan::InclusiveSum(nullptr, ^ /llama.cpp/ggml/src/ggml-cuda/cumsum.cu(207): error: name followed by "::" must be a class or namespace name cub::DeviceScan::InclusiveSum((void *) tmp_alloc.get(), tmp_size, src, dst, ne, stream); ^ 2 errors detected in the compilation of "/llama.cpp/ggml/src/ggml-cuda/cumsum.cu". gmake[2]: *** [ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/build.make:317: ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/cumsum.cu.o] Error 2 ``` Commit 83b3b1c ("cuda: optimize cumsum cub path (#18362)") updated the include directive replacing device_scan.cuh which is causing this issue. This commit uses cub/cub.cuh umbrella header which is consistent with other files in the ggml-cuda directory like mean.cu, sum.cu, etc.

DeviceTopK::MaxPairs is an iterative algorithm, where `d_keys_out` is written after every iteration. As a consequence, it must not overlap with `d_keys_in`, or otherwise undefined behavior occurs (keys are no longer unique in d_keys_in and may map to different values between iterations)

By using the fancy [`counting_iterator`](https://nvidia.github.io/cccl/thrust/api/classthrust_1_1counting__iterator.html#classthrust_1_1counting__iterator) exposed by CCCL, we can avoid materializing the index to GPU memory, saving VRAM + 1 kernel invocation

Since we use cuda::discard_iterator to avoid writing out the keys, we can directly pass in src instead of copying it to `temp_keys`

DajanaV had a problem deploying to PROD__AL_DEMO November 6, 2025 06:43 — with GitHub Actions Failure

DajanaV force-pushed the main branch 3 times, most recently from b16251e to 95f6e9b Compare November 6, 2025 13:17

DajanaV had a problem deploying to PROD__AL_DEMO November 6, 2025 14:00 — with GitHub Actions Failure

DajanaV had a problem deploying to PROD__AL_DEMO November 6, 2025 15:55 — with GitHub Actions Failure

DajanaV force-pushed the main branch 22 times, most recently from aa2fc28 to 0ad40ce Compare November 9, 2025 17:06

DajanaV force-pushed the upstream-PR17004-branch_danbev-gpu-sampling branch from 5d18032 to a4d9044 Compare November 9, 2025 17:33

DajanaV had a problem deploying to PROD__AL_DEMO November 9, 2025 17:33 — with GitHub Actions Failure

common : disable backend sampling when grammar is involved

3b3f5fe

Merge remote-tracking branch 'upstream/master' into backend-sampling

bc5195c

ORippler and others added 26 commits December 19, 2025 11:42

Update CCCL version to v3.2.0-rc2

b5ec0fd

Build with CCCL 3.2 for CUDA backends

1da013c

Gives best perf for backend-sampling on CUDA. Flag can be removed once CCCL 3.2 is bundled within CTK and that CTK version is used in llama.cpp

Merge remote-tracking branch 'upstream/master' into backend-sampling

f1310ab

Merge branch 'master' into HEAD

0ce0359

tests : revert server test changes (no longer needed)

c0a351c

Merge remote-tracking branch 'upstream/master' into backend-sampling

82c2600

Merge remote-tracking branch 'upstream/master' into backend-sampling

ebfe545

arg : add shorthand for --backend-sampling

23e8bb4

ci : add server workflow with backend sampling

5d2156e

sampling : fix reshapes

610e50a

server : remove printfs

588299c

Merge branch 'master' into HEAD

c5de759

sampling : zero-initialize input buffers

791ecb9

minor : add comments + some cleanup

4c3d542

llama : assert at most one output token per sequence

435c967

tests : add more top_k tests

0d85c5c

Merge branch 'master' into HEAD

8071a57

Apply code-formatting to top-k.cu

194401a

Merge remote-tracking branch 'origin/master' into gpu-sampling

9f6c1f3

CUDA: Remove obsolete temp_keys from CUB

03454de

Since we use cuda::discard_iterator to avoid writing out the keys, we can directly pass in src instead of copying it to `temp_keys`

minor : cleanup, TODOs, etc.

2e54b1d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #17004: sampling : add support for GPU sampling (wip)#102

UPSTREAM PR #17004: sampling : add support for GPU sampling (wip)#102
DajanaV wants to merge 179 commits intomainfrom
upstream-PR17004-branch_danbev-gpu-sampling

DajanaV commented Nov 6, 2025

Uh oh!

loci-review bot commented Dec 18, 2025

Uh oh!

loci-review bot commented Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

DajanaV commented Nov 6, 2025

Building and running the tests

TODO

Uh oh!

loci-review bot commented Dec 18, 2025

Performance Analysis Summary - PR #102: GPU Sampling Support

Overview

Key Findings

Performance-Critical Areas Impact

Tokens Per Second Impact

Power Consumption Analysis

Implementation Characteristics

Uh oh!

loci-review bot commented Dec 19, 2025

Key Findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants