server: expose prompt token counts in /slots endpoint by ScrewTSW · Pull Request #23454 · ggml-org/llama.cpp

ScrewTSW · 2026-05-21T01:48:00Z

Overview

server_slot already tracks n_prompt_tokens_processed and n_prompt_tokens_cache, but to_json() doesn't include them. This adds those fields plus n_prompt_tokens (total) to the /slots response so clients can monitor prompt evaluation progress.

Additional information

Relates to #14685.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: Yes - code analysis, patch suggestion.

Add n_prompt_tokens, n_prompt_tokens_processed, and n_prompt_tokens_cache to the /slots JSON response. These fields are already tracked internally but were not exposed, making it impossible for clients to monitor prompt evaluation progress during processing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ScrewTSW · 2026-05-21T01:51:48Z

It's been really frustrating that this feature is missing, as any extension using llamacpp-server as OpenAI-compatible endpoint is flying blind. This allows extensions like Continue to poll the progress

In addition I'm heavily utilizing this endpoint in my implementation of llama.cpp orchestrator.

ggml-gh-bot · 2026-05-21T01:52:21Z

Hi @ScrewTSW, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

ScrewTSW · 2026-05-21T01:53:32Z

Hi @ScrewTSW, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

It's a 3-line change that's utilizing a pre-existing struct, just exposing it properly in the resulting json output.

ggerganov · 2026-05-21T08:15:31Z

I thought we already have a way to track the prompt processing progress? How does it work in the WebUI?

allozaur · 2026-05-21T10:01:28Z

I thought we already have a way to track the prompt processing progress? How does it work in the WebUI?

Yes, llama-ui already tracks it via the chat completions endpoint, which streams prompt_progress updates in real-time. This PR just exposes the same data on the /slots polling endpoint, useful for clients that don't use streaming and need to poll for progress instead.

Add n_prompt_tokens, n_prompt_tokens_processed, and n_prompt_tokens_cache to the /slots JSON response. These fields are already tracked internally but were not exposed, making it impossible for clients to monitor prompt evaluation progress during processing.

* origin/master: (138 commits) fix(flash-attn): replace f32 with kv_type and q_type (ggml-org#23372) tests : move save-load-state from examples to tests (ggml-org#23336) server: expose prompt token counts in /slots endpoint (ggml-org#23454) metal : optimize concat kernel and fix set kernel threads (ggml-org#23411) server : free draft/MTP resources on sleep to fix VRAM leak (ggml-org#23461) server: re-inject subcommand when router spawns children under unified binary (ggml-org#23442) app : add batched-bench, fit-params, quantize & perplexity (ggml-org#23459) mtp: use inp_out_ids for skipping logit computation (ggml-org#23433) vocab : add Carbon-3B (HybridDNATokenizer) support (ggml-org#23410) doc: fix spec mtp typo (ggml-org#23435) ui: Improve Git Hooks for UI development (ggml-org#23403) ggml : Check the right iface method before using the fallback 2d get (ggml-org#23306) llama-graph: fix null-buffer crash in llm_graph_input_attn_kv_iswa for SWA-only models (ggml-org#23131) hexagon: ssm-conv fix for large prompts (ggml-org#23307) app : show version (ggml-org#23426) mtmd, model : merge HunyuanOCR into HunyuanVL and fix OCR vision precision (ggml-org#23329) ui: Add max image size option (ggml-org#22849) Move to backend sampling for MTP draft path (ggml-org#23287) opencl: refactor backend initilization (ggml-org#23318) common/speculative : fix nullptr crash in get_devices_str (ggml-org#23386) ...

Add n_prompt_tokens, n_prompt_tokens_processed, and n_prompt_tokens_cache to the /slots JSON response. These fields are already tracked internally but were not exposed, making it impossible for clients to monitor prompt evaluation progress during processing.

Add n_prompt_tokens, n_prompt_tokens_processed, and n_prompt_tokens_cache to the /slots JSON response. These fields are already tracked internally but were not exposed, making it impossible for clients to monitor prompt evaluation progress during processing. (cherry picked from commit b65bb4b)

Add n_prompt_tokens, n_prompt_tokens_processed, and n_prompt_tokens_cache to the /slots JSON response. These fields are already tracked internally but were not exposed, making it impossible for clients to monitor prompt evaluation progress during processing.

ScrewTSW requested a review from a team as a code owner May 21, 2026 01:48

github-actions Bot added examples server labels May 21, 2026

allozaur approved these changes May 21, 2026

View reviewed changes

ggerganov approved these changes May 21, 2026

View reviewed changes

ggerganov added the merge ready A maintainer can use this label to indicate that they consider the changes final and ready to merge. label May 21, 2026

ngxson approved these changes May 21, 2026

View reviewed changes

ngxson merged commit b65bb4b into ggml-org:master May 21, 2026
40 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: expose prompt token counts in /slots endpoint#23454

server: expose prompt token counts in /slots endpoint#23454
ngxson merged 1 commit into
ggml-org:masterfrom
ScrewTSW:feat/slots-prompt-tokens

ScrewTSW commented May 21, 2026 •

edited

Loading

Uh oh!

ScrewTSW commented May 21, 2026 •

edited

Loading

Uh oh!

ggml-gh-bot Bot commented May 21, 2026

Uh oh!

ScrewTSW commented May 21, 2026

Uh oh!

ggerganov commented May 21, 2026

Uh oh!

allozaur commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ScrewTSW commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Additional information

Requirements

Uh oh!

ScrewTSW commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggml-gh-bot Bot commented May 21, 2026

Uh oh!

ScrewTSW commented May 21, 2026

Uh oh!

ggerganov commented May 21, 2026

Uh oh!

allozaur commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ScrewTSW commented May 21, 2026 •

edited

Loading

ScrewTSW commented May 21, 2026 •

edited

Loading