[DRAFT] RFC: Native /embeddings in llama-server + CI E2E baseline for embedding CLI #16957
SamMalayek
started this conversation in
Ideas
Replies: 1 comment
-
|
IN PROGRESS. I'll ping a few folks when ready for review. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
IN PROGRESS
Currently exploring:
0) TL;DR
Ship a minimal, opt-in native embeddings endpoint in
llama-serverthat calls the same function-level embedding path as the CLI. Land a fast, deterministic CI E2E that proves shape/dtype/determinism and basic parallel safety. No default behavior changes in phase 1.Phases & Milestones (incremental, mergeable)
Phase 0 — Groundwork (no code paths changed)
shared/embed.*vsserver/lib/*), route name (/v2/embeddingsvs/embeddings-native), flag names (--embd-*).seed,threads=1,ctx=1024, JSON float32 serialization rules.Phase 1 — CI E2E for CLI (no behavior change) → #16940
Phase 2 — Shared Embedding Helper (function-level, zero behavior change)
run_embedding(...)in a minimal TU; no broad header churn.Phase 3 — Native Server Endpoint (flagged off by default)
/v2/embeddings(or/embeddings-native) guarded by--embd-enabled.--embd-model|--embd-hfr|--embd-hff|--embd-threads|--embd-ctx-size|--embd-seed.MODEL_NOT_FOUND|BAD_ARGUMENT|EMBEDDING_FAILED|TIMEOUT|NOT_ENABLEDwith existing server HTTP conventions./v1/embeddingsuntouched.Phase 3a — Concurrency Smoke & Perf Guardrails
Phase 4 — Cross-Platform & Packaging Hardening (still opt-in)
Phase 5
TODO
Phase 6 — Optional Productization Follow-ups (post-baseline)
1) Problem Statement
llama-servercurrently exposes OpenAI-compatible/embeddingsbut does not execute first-party llama.cpp embedding kernels.Goal: Provide a native, opt-in server route that uses the same function-level embedding implementation as the CLI, validated by a fast, deterministic CI suite.
2) Non-Goals (Phase 1)
3) Motivation
--parallel N) through smoke checks.4) Proposal (Incremental & Mergeable)
4.1 CLI E2E Baseline (no behavior change)
Add a CI job using tiny GGUF models that asserts:
4.2 Shared Embedding Path
Extract a minimal function-level helper callable from both:
examples/llama-embeddingllama-server(new route)Constraints:
4.3 Native Server Endpoint (opt-in)
POST /v2/embeddings(or/embeddings-native) behind a feature flag./v1/embeddings.4.4 Concurrency & Perf Guardrails
N ∈ {2, 4}that assert:Per-request latency inflation ≤ 2.5× single-thread median.
5) API Sketch (minimal, explicit)
Route (opt-in for phase 1):
POST /v2/embeddingsRequest
Response
Error Shape (consistent with server)
Flag/Env parity
6) Determinism Policy (Phase 1)
seed != -1,threads = 1.threads > 1: accept tiny numeric deltas; assert cosine similarity ≥ 0.999 vs single-thread baseline.7) CI / E2E Plan
Models:
Prefer small:
EmbeddingGemma-small,TinyLlama—pre-cached artifacts.Checks:
Budget:
Wall time < 2 min per job; strict timeouts.
Retries only on network/model fetch, never on core assertions.
OS / Toolchains (smoke):
8) Performance & Resource Budgets (Phase 1)
(These are guardrails; calibrate to repo norms during review.)
9) Rollout & Compatibility
/v1/embeddings(OpenAI-compat) stays as-is./v1/embeddings/v2/embeddings10) Minimal Implementation Sketch
11) Observability & Telemetry (Phase 1)
embd_requests_total,embd_latency_ms{route="/v2/embeddings"}embd_error_total{code}12) Error Semantics (Concrete)
Return HTTP status consistent with existing server conventions (e.g., 400/404/500).
13) Test Matrix (Phase 1)
Functional
Platforms
Build flavors
14) PR Slicing Plan (Reviewer-Friendly)
shared/embed.{h,cc}(orserver/lib/*per maintainer preference).--embd-enabledand--embd-*flags./v2/embeddings(or/embeddings-native).(Each PR is green, reviewable, and revertable.)
15) Risks & Mitigations
16) Ownership & Maintenance
17) Open Questions
/v2/embeddingsvs/embeddings-native?common/embedding_utils.*vsserver/lib/*?18) Appendix
18.1 Config Precedence (server)
request body>env>flags>hardcoded defaults18.2 Model Resolution Rules
--embd-modelgiven: use it unless request overrides.--embd-hfr/--embd-hffgiven: resolve via HF; cache path recorded in logs.MODEL_NOT_FOUNDwith clear message.18.3 Reproducibility Notes
seed,threads=1,ctx=1024,LLAMA_CACHEpath.float32enforced in helper.Beta Was this translation helpful? Give feedback.
All reactions