Skip to content

server: expose prompt token counts in /slots endpoint#23454

Merged
ngxson merged 1 commit into
ggml-org:masterfrom
ScrewTSW:feat/slots-prompt-tokens
May 21, 2026
Merged

server: expose prompt token counts in /slots endpoint#23454
ngxson merged 1 commit into
ggml-org:masterfrom
ScrewTSW:feat/slots-prompt-tokens

Conversation

@ScrewTSW
Copy link
Copy Markdown
Contributor

@ScrewTSW ScrewTSW commented May 21, 2026

Overview

server_slot already tracks n_prompt_tokens_processed and n_prompt_tokens_cache, but to_json() doesn't include them. This adds those fields plus n_prompt_tokens (total) to the /slots response so clients can monitor prompt evaluation progress.

Additional information

Relates to #14685.

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: Yes - code analysis, patch suggestion.

Add n_prompt_tokens, n_prompt_tokens_processed, and n_prompt_tokens_cache
to the /slots JSON response. These fields are already tracked internally
but were not exposed, making it impossible for clients to monitor prompt
evaluation progress during processing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ScrewTSW ScrewTSW requested a review from a team as a code owner May 21, 2026 01:48
@ScrewTSW
Copy link
Copy Markdown
Contributor Author

ScrewTSW commented May 21, 2026

It's been really frustrating that this feature is missing, as any extension using llamacpp-server as OpenAI-compatible endpoint is flying blind. This allows extensions like Continue to poll the progress

In addition I'm heavily utilizing this endpoint in my implementation of llama.cpp orchestrator.

@ggml-gh-bot
Copy link
Copy Markdown

ggml-gh-bot Bot commented May 21, 2026

Hi @ScrewTSW, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

  • AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

@ScrewTSW
Copy link
Copy Markdown
Contributor Author

Hi @ScrewTSW, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

  • AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

It's a 3-line change that's utilizing a pre-existing struct, just exposing it properly in the resulting json output.

@ggerganov
Copy link
Copy Markdown
Member

I thought we already have a way to track the prompt processing progress? How does it work in the WebUI?

@allozaur
Copy link
Copy Markdown
Contributor

I thought we already have a way to track the prompt processing progress? How does it work in the WebUI?

Yes, llama-ui already tracks it via the chat completions endpoint, which streams prompt_progress updates in real-time. This PR just exposes the same data on the /slots polling endpoint, useful for clients that don't use streaming and need to poll for progress instead.

@ggerganov ggerganov added the merge ready A maintainer can use this label to indicate that they consider the changes final and ready to merge. label May 21, 2026
@ngxson ngxson merged commit b65bb4b into ggml-org:master May 21, 2026
40 checks passed
ProTekk pushed a commit to ProTekk/buun-llama-cpp that referenced this pull request May 21, 2026
Add n_prompt_tokens, n_prompt_tokens_processed, and n_prompt_tokens_cache
to the /slots JSON response. These fields are already tracked internally
but were not exposed, making it impossible for clients to monitor prompt
evaluation progress during processing.
gabe-l-hart added a commit to gabe-l-hart/llama.cpp that referenced this pull request May 21, 2026
* origin/master: (138 commits)
fix(flash-attn): replace f32 with kv_type and q_type (ggml-org#23372)
tests : move save-load-state from examples to tests (ggml-org#23336)
server: expose prompt token counts in /slots endpoint (ggml-org#23454)
metal : optimize concat kernel and fix set kernel threads (ggml-org#23411)
server : free draft/MTP resources on sleep to fix VRAM leak (ggml-org#23461)
server: re-inject subcommand when router spawns children under unified binary (ggml-org#23442)
app : add batched-bench, fit-params, quantize & perplexity (ggml-org#23459)
mtp: use inp_out_ids for skipping logit computation (ggml-org#23433)
vocab : add Carbon-3B (HybridDNATokenizer) support (ggml-org#23410)
doc: fix spec mtp typo (ggml-org#23435)
ui: Improve Git Hooks for UI development (ggml-org#23403)
ggml : Check the right iface method before using the fallback 2d get (ggml-org#23306)
llama-graph: fix null-buffer crash in llm_graph_input_attn_kv_iswa for SWA-only models (ggml-org#23131)
hexagon: ssm-conv fix for large prompts (ggml-org#23307)
app : show version (ggml-org#23426)
mtmd, model : merge HunyuanOCR into HunyuanVL and fix OCR vision precision (ggml-org#23329)
ui: Add max image size option (ggml-org#22849)
Move to backend sampling for MTP draft path (ggml-org#23287)
opencl: refactor backend initilization (ggml-org#23318)
common/speculative : fix nullptr crash in get_devices_str (ggml-org#23386)
...
baramofme pushed a commit to baramofme/llama-cpp-turboquant that referenced this pull request May 23, 2026
Add n_prompt_tokens, n_prompt_tokens_processed, and n_prompt_tokens_cache
to the /slots JSON response. These fields are already tracked internally
but were not exposed, making it impossible for clients to monitor prompt
evaluation progress during processing.
srossitto79 pushed a commit to srossitto79/llama.cpp that referenced this pull request May 23, 2026
Add n_prompt_tokens, n_prompt_tokens_processed, and n_prompt_tokens_cache
to the /slots JSON response. These fields are already tracked internally
but were not exposed, making it impossible for clients to monitor prompt
evaluation progress during processing.
carlosfundora pushed a commit to carlosfundora/llama.cpp-1-bit-turbo that referenced this pull request May 24, 2026
Add n_prompt_tokens, n_prompt_tokens_processed, and n_prompt_tokens_cache
to the /slots JSON response. These fields are already tracked internally
but were not exposed, making it impossible for clients to monitor prompt
evaluation progress during processing.

(cherry picked from commit b65bb4b)
fewtarius pushed a commit to fewtarius/llama.cpp that referenced this pull request May 30, 2026
Add n_prompt_tokens, n_prompt_tokens_processed, and n_prompt_tokens_cache
to the /slots JSON response. These fields are already tracked internally
but were not exposed, making it impossible for clients to monitor prompt
evaluation progress during processing.
turbo-tan pushed a commit to turbo-tan/llama.cpp-tq3 that referenced this pull request Jun 2, 2026
Add n_prompt_tokens, n_prompt_tokens_processed, and n_prompt_tokens_cache
to the /slots JSON response. These fields are already tracked internally
but were not exposed, making it impossible for clients to monitor prompt
evaluation progress during processing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples merge ready A maintainer can use this label to indicate that they consider the changes final and ready to merge. server

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants