[DRAFT] RFC: Native /embeddings in llama-server + CI E2E baseline for embedding CLI #16957

SamMalayek · 2025-11-03T04:17:42Z

SamMalayek
Nov 3, 2025

IN PROGRESS

Currently exploring:

Thin adapter that sits on top of llama.h and servers both CLI and server. Also exploring if llama.h might have a missing hook for this configuration.
Serving embeddings through llama.h directly to both server and CLI (most simple on the surface, but server needs request-scoped timeouts, aborts, and queueing)
Thin adapter that sits on top of llama.h and under server, but no interaction with CLI, but then this can cause differences in behavior when we want unified embeddings behavior from both.

0) TL;DR

Ship a minimal, opt-in native embeddings endpoint in llama-server that calls the same function-level embedding path as the CLI. Land a fast, deterministic CI E2E that proves shape/dtype/determinism and basic parallel safety. No default behavior changes in phase 1.

Phases & Milestones (incremental, mergeable)

Phase 0 — Groundwork (no code paths changed)

Decide names/placement: helper file path (shared/embed.* vs server/lib/*), route name (/v2/embeddings vs /embeddings-native), flag names (--embd-*).
Error & config contract: finalize error codes ↔ HTTP mapping; config precedence (req > env > flags > defaults).
Determinism envelope: agree on seed, threads=1, ctx=1024, JSON float32 serialization rules.
Acceptance: Maintainer ACK on names, error table, precedence.

Phase 1 — CI E2E for CLI (no behavior change) → #16940

Add fast CLI tests with tiny GGUF models; assert:
- shape/dtype, JSON schema,
- determinism (seeded, threads=1),
- similarity ≥ 0.999 when threads>1,
- unenforced runtime budget < 15 min.
Runners: required Linux; Windows/macOS smoke only (non-blocking).
Caching: stable cache key (HF + file) + strict timeouts; retries only for download.
Acceptance: green on master; artifacts include timings, dim print, first floats in debug.

Phase 2 — Shared Embedding Helper (function-level, zero behavior change)

Introduce run_embedding(...) in a minimal TU; no broad header churn.
Refit CLI to call the helper; byte-for-byte output parity.
Observability hooks: helper returns dim/tokens; consistent status/error surface.
Acceptance: CLI tests unchanged and still green; no server changes yet.

Phase 3 — Native Server Endpoint (flagged off by default)

Route: POST /v2/embeddings (or /embeddings-native) guarded by --embd-enabled.
Config parity: --embd-model|--embd-hfr|--embd-hff|--embd-threads|--embd-ctx-size|--embd-seed.
Round-trip test: tiny model, one request, JSON schema, shape/dtype.
Error semantics: return MODEL_NOT_FOUND|BAD_ARGUMENT|EMBEDDING_FAILED|TIMEOUT|NOT_ENABLED with existing server HTTP conventions.
Acceptance: new job hits the route behind the flag and passes; /v1/embeddings untouched.

Phase 3a — Concurrency Smoke & Perf Guardrails

Parallel safety: N ∈ {2,4} requests; assert no deadlocks/livelocks.
Latency budget: per-request inflation ≤ 2.5× single-thread median.
Determinism note: multi-thread similarity ≥ 0.999 vs baseline.
Acceptance: CI gate enforces guardrails; flakes attributed only to fetch/network.

Phase 4 — Cross-Platform & Packaging Hardening (still opt-in)

Windows/MSVC Release path checks (e.g., build/bin/Release/*.exe fallback),
macOS ARM64 smoke; document known deltas (e.g., SIMD/BF16 gates).
Binary size & mem: confirm <1.5 GB peak on tiny model; document knobs if breached.
Acceptance: non-blocking smokes green; docs updated with platform notes.

Phase 5
TODO

Phase 6 — Optional Productization Follow-ups (post-baseline)

Batching/micro-scheduler (opt-in): mini-batches for throughput, guarded by perf tests.
GPU/offload toggles: wire existing backends; keep determinism mode documented.
Model policy: allow server-start “pinned model” vs per-request override policy.
SDK/example clients: tiny client snippets to reduce misuse.
Acceptance: each is its own RFC/PR slice with new CI checks.

1) Problem Statement

llama-server currently exposes OpenAI-compatible /embeddings but does not execute first-party llama.cpp embedding kernels.
There is no end-to-end (E2E) regression net for the CLI embedding path, increasing risk of drift and regressions.

Goal: Provide a native, opt-in server route that uses the same function-level embedding implementation as the CLI, validated by a fast, deterministic CI suite.

2) Non-Goals (Phase 1)

Changing defaults of existing OpenAI-compatible routes.
Moving large code blocks or reshaping runtime architecture.
Introducing a new scheduler/batching policy.
GPU/kernel optimization work (functional baseline only).

3) Motivation

Determinism & shape guarantees (dim, dtype) reduce “is this right?” churn.
Shared implementation (CLI ↔ server) eliminates drift and double-maintenance.
Confidence for concurrency (e.g., --parallel N) through smoke checks.
Offline & first-party: unlocks environments without third-party APIs.

4) Proposal (Incremental & Mergeable)

4.1 CLI E2E Baseline (no behavior change)

Add a CI job using tiny GGUF models that asserts:

Dimension correctness (e.g., 384/768/1024/4096 depending on model).
Determinism under fixed seed/threads.
Runtime budget: < 2 minutes total on default CI hardware.

Status: baseline test PR is open; this RFC formalizes scope and follow-ups.

4.2 Shared Embedding Path

Extract a minimal function-level helper callable from both:

examples/llama-embedding
llama-server (new route)

Constraints:

Keep headers stable; avoid include churn.
Keep the helper at function granularity (no global refactor).

4.3 Native Server Endpoint (opt-in)

Route: POST /v2/embeddings (or /embeddings-native) behind a feature flag.
Default off. No behavior change to /v1/embeddings.
When enabled, use shared helper + tiny models in CI to validate.

4.4 Concurrency & Perf Guardrails

Add lightweight parallel tests for N ∈ {2, 4} that assert:
No deadlocks.

Per-request latency inflation ≤ 2.5× single-thread median.

5) API Sketch (minimal, explicit)

Route (opt-in for phase 1):

POST /v2/embeddings

Request

{
  "input": ["text a", "text b"],
  "model": "path-or-hf-id",      // optional if server started with --embd-model
  "threads": 2,                  // optional
  "ctx_size": 1024,              // optional
  "seed": 1234                   // optional; enables deterministic CI
}

Response

{
  "data": [
    { "index": 0, "embedding": [/* float32... */] },
    { "index": 1, "embedding": [/* float32... */] }
  ],
  "model": "resolved-gguf",
  "dim": 384,
  "usage": { "prompt_tokens": 18 }
}

Error Shape (consistent with server)

{ "error": { "code": "MODEL_NOT_FOUND", "message": "..." } }

Flag/Env parity

Concern	CLI flag	Server flag/env
Model	-m/-hfr/-hff	--embd-model, --embd-hfr/--embd-hff
Threads	--threads	--embd-threads
Ctx size	--ctx-size	--embd-ctx-size
Output format	--embd-output-format	N/A (JSON only over HTTP)
Seed	--seed	--embd-seed

Note: Server always returns JSON; raw/binary remains CLI-only.

6) Determinism Policy (Phase 1)

Deterministic test mode: seed != -1, threads = 1.
For threads > 1: accept tiny numeric deltas; assert cosine similarity ≥ 0.999 vs single-thread baseline.

7) CI / E2E Plan

Models:

Prefer small: EmbeddingGemma-small, TinyLlama—pre-cached artifacts.

Checks:

Dim, dtype, JSON schema.
Deterministic replay under fixed seed/threads.
Optional “large model smoke” as non-blocking (nightly).

Budget:
Wall time < 2 min per job; strict timeouts.
Retries only on network/model fetch, never on core assertions.

OS / Toolchains (smoke):

Linux x86_64 GCC/Clang (required)
Windows MSVC Release (best-effort, gated)
macOS ARM64 (best-effort, gated)

8) Performance & Resource Budgets (Phase 1)

Peak mem (tiny model): < 1.5 GB.
P50 single request (tiny model): < 800 ms on CI defaults.
Parallel N=4 inflation: ≤ 2.5× single-thread median per request.

(These are guardrails; calibrate to repo norms during review.)

9) Rollout & Compatibility

Default unchanged: /v1/embeddings (OpenAI-compat) stays as-is.
New route is opt-in via flag/env.
If defaults flip later:
- Keep OpenAI-compat at /v1/embeddings
- Native at /v2/embeddings
- Announce in release notes with migration snippet.

10) Minimal Implementation Sketch

// shared/embed.h
struct EmbeddingParams {
  std::string model;
  int threads = 1;
  int ctx = 1024;
  int seed = -1;
};
struct EmbeddingResult {
  std::vector<float> vec;
  int dim;
  int tokens;
};

EmbeddingResult run_embedding(const EmbeddingParams& p,
                              gsl::span<const char> text); // noexcept-ish, returns error via status/exception depending on repo norms

// examples/llama-embedding/main.cpp
auto res = run_embedding({model, threads, ctx, seed}, input);
print_json(res); // or print_raw

// server/routes/embeddings_v2.cc
if (!flags.embd_enabled) return http_404();
auto req = parse_json(req_body);
EmbeddingParams p = resolve_from_flags_and_req(req, flags); // consistent precedence: req > env > flags > defaults
auto res = run_embedding(p, req["input"][i]);
return json({ "data": ..., "model": p.model, "dim": res.dim, "usage": {...} });

11) Observability & Telemetry (Phase 1)

Structured logs at INFO:
- route, resolved model id/path, dim, threads, ctx, seed set?, batch size, duration ms.
Counters/Timers (if server has metrics hooks):
- embd_requests_total, embd_latency_ms{route="/v2/embeddings"}
- embd_error_total{code}
Debug toggle (flag/env) to dump first 4 floats for first item (privacy-safe) in CI to help triage.

12) Error Semantics (Concrete)

Code	When
MODEL_NOT_FOUND	GGUF not found / HF resolution failed
BAD_ARGUMENT	Missing/invalid input, non-string inputs, negative threads/ctx
EMBEDDING_FAILED	Runtime failure from backend
TIMEOUT	Server-side timeout exceeded
NOT_ENABLED	Route disabled (flag off)

Return HTTP status consistent with existing server conventions (e.g., 400/404/500).

13) Test Matrix (Phase 1)

Functional

Shape/dtype: (model dim ∈ {384, 768, 1024, 4096}) ✓
Deterministic (seeded, threads=1) ✓
Similarity ≥ 0.999 (threads>1 vs baseline) ✓
Parallel safety N={2,4} (no deadlocks, latency slack ≤ 2.5×) ✓
Error cases (bad model, bad args) return correct codes ✓

Platforms

Linux x86_64 GCC/Clang (required)
Windows MSVC Release (non-blocking smoke)
macOS ARM64 (non-blocking smoke)

Build flavors

Release (default) ✓
Debug (one target per day/nightly) ~ smoke only

14) PR Slicing Plan (Reviewer-Friendly)

E2E: CLI Baseline (tests only)

Add tiny-model tests: dim, dtype, determinism, similarity.
CI workflow job (≤2m).
No production code changes.

Shared Helper Introduction

Introduce shared/embed.{h,cc} (or server/lib/* per maintainer preference).
Make CLI call it.
No behavior changes; headers stable.

Server Route (Flagged Off)

Add --embd-enabled and --embd-* flags.
Implement /v2/embeddings (or /embeddings-native).
Add round-trip CI hitting the new route (behind feature flag).
Observability counters/logs.

Parallel Smoke & Error Codes

Add N={2,4} concurrency smoke + error-path tests.
Tighten JSON schema validation.

(Each PR is green, reviewable, and revertable.)

15) Risks & Mitigations

Scope creep → Small PRs; function-level helper only; hard out-of-scope list.
CI flake → Tiny models, caching, strict timeouts, retries only for download.
API confusion → Opt-in route + explicit docs; no change to /v1/embeddings.
Cross-platform quirks → Keep Windows/macOS smoke non-blocking in phase 1; document known gaps.
Header churn → Keep helper local; avoid broad include changes.

16) Ownership & Maintenance

Author owns initial series (tests → helper → endpoint).
Will align with maintainers on file placement, flags naming, and metrics hooks.
Post-merge: include the endpoint in release notes once stabilized.

17) Open Questions

Route name: /v2/embeddings vs /embeddings-native?
Helper location: common/embedding_utils.* vs server/lib/*?
CI hardware budget: confirm target cores/mem/timeouts.
Any existing fixtures/generators to reuse for JSON schema & metrics?

18) Appendix

18.1 Config Precedence (server)

request body > env > flags > hardcoded defaults

18.2 Model Resolution Rules

If --embd-model given: use it unless request overrides.
If --embd-hfr/--embd-hff given: resolve via HF; cache path recorded in logs.
If offline: attempt local cache; return MODEL_NOT_FOUND with clear message.

18.3 Reproducibility Notes

CI sets: seed, threads=1, ctx=1024, LLAMA_CACHE path.
Serialize floats as JSON list; dtype float32 enforced in helper.
Print dim and first 2–4 floats in debug mode to help bisect.

SamMalayek · 2025-11-03T05:50:32Z

SamMalayek
Nov 3, 2025
Author

IN PROGRESS. I'll ping a few folks when ready for review.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DRAFT] RFC: Native /embeddings in llama-server + CI E2E baseline for embedding CLI #16957

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

[DRAFT] RFC: Native /embeddings in llama-server + CI E2E baseline for embedding CLI #16957

Uh oh!

Uh oh!

SamMalayek Nov 3, 2025

IN PROGRESS

0) TL;DR

Phases & Milestones (incremental, mergeable)

1) Problem Statement

2) Non-Goals (Phase 1)

3) Motivation

4) Proposal (Incremental & Mergeable)

4.1 CLI E2E Baseline (no behavior change)

4.2 Shared Embedding Path

4.3 Native Server Endpoint (opt-in)

4.4 Concurrency & Perf Guardrails

5) API Sketch (minimal, explicit)

6) Determinism Policy (Phase 1)

7) CI / E2E Plan

8) Performance & Resource Budgets (Phase 1)

9) Rollout & Compatibility

10) Minimal Implementation Sketch

11) Observability & Telemetry (Phase 1)

12) Error Semantics (Concrete)

13) Test Matrix (Phase 1)

14) PR Slicing Plan (Reviewer-Friendly)

15) Risks & Mitigations

16) Ownership & Maintenance

17) Open Questions

18) Appendix

18.1 Config Precedence (server)

18.2 Model Resolution Rules

18.3 Reproducibility Notes

Replies: 1 comment

Uh oh!

SamMalayek Nov 3, 2025 Author

SamMalayek
Nov 3, 2025

SamMalayek
Nov 3, 2025
Author