-
Notifications
You must be signed in to change notification settings - Fork 690
fix: Load the tokenizer JSON once for chat and completions. #2910
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Loading the tokenizer JSON from disk can be expensive (it can be a ~10MiB file). Previously we loaded once building the chat handler and again building the completions handler. That also meant the completions handler was not ready until slightly after the chat handler. Signed-off-by: Graham King <[email protected]>
WalkthroughRefactors the LLM stack from async to sync construction across backend, model-card loading, and preprocessing. Threads a shared HuggingFace tokenizer handle through pipeline builders and entrypoints (HTTP/GRPC/watcher). Updates Python bindings accordingly, adjusts logging, and revises tests to new sync APIs and borrowed parameters. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Client
participant Entry as EntryPoint (HTTP/GRPC)
participant Card as ModelDeploymentCard
participant Tok as tokenizer_hf
participant Prep as OpenAIPreprocessor
participant BE as Backend
participant Pipe as Pipeline
Client->>Entry: initialize pipeline
Entry->>Card: load() [sync]
Entry->>Tok: tokenizer_hf() [sync]
Note right of Tok: Single load, reused across paths
Entry->>Prep: new_with_parts(card, formatter, Tok.clone()) [sync]
Entry->>BE: from_tokenizer(Tok) [sync]
BE->>Pipe: into_operator()
Prep->>Pipe: into_operator()
Entry->>Pipe: assemble routed pipeline
Pipe-->>Client: ready
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related PRs
Poem
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (6)
lib/llm/src/model_card.rs (2)
224-234: Tokenizer JSON still loaded multiple times; add in-card caching to truly “load once”
tokenizer_hf()re-reads ~10 MiBtokenizer.jsonon each call whenTokenizerKind::HfTokenizerJson, so chat and completions still cause two disk loads. Cache the constructedHfTokenizerinside the MDC and share it across clones.Apply the following changes:
- Add a cached handle (shared across MDC clones):
@@ -use std::sync::Arc; +use std::sync::Arc; +use once_cell::sync::OnceCell; @@ #[derive(Serialize, Deserialize, Clone, Debug, Builder, Default)] pub struct ModelDeploymentCard { @@ #[serde(default)] pub runtime_config: ModelRuntimeConfig, + /// Lazily initialized, process-shared HF tokenizer handle + #[serde(skip)] + #[builder(default)] + hf_tokenizer_cached: Arc<OnceCell<Arc<HfTokenizer>>>, }
- Use the cache in
tokenizer_hf():@@ - pub fn tokenizer_hf(&self) -> anyhow::Result<HfTokenizer> { - match &self.tokenizer { - Some(TokenizerKind::HfTokenizerJson(file)) => { - HfTokenizer::from_file(file).map_err(anyhow::Error::msg) - } - Some(TokenizerKind::GGUF(t)) => Ok(*t.clone()), - None => { - anyhow::bail!("Blank ModelDeploymentCard does not have a tokenizer"); - } - } - } + pub fn tokenizer_hf(&self) -> anyhow::Result<HfTokenizer> { + if let Some(tok_arc) = self.hf_tokenizer_cached.get() { + return Ok((**tok_arc).clone()); + } + let loaded: HfTokenizer = match &self.tokenizer { + Some(TokenizerKind::HfTokenizerJson(file)) => { + HfTokenizer::from_file(file).map_err(anyhow::Error::msg)? + } + Some(TokenizerKind::GGUF(t)) => (*t.clone()), + None => anyhow::bail!("Blank ModelDeploymentCard does not have a tokenizer"), + }; + let arc = Arc::new(loaded); + // Ignore race: if another thread set it first, keep that one. + let _ = self.hf_tokenizer_cached.set(arc.clone()); + Ok((*arc).clone()) + }Note: add
once_cell = "1"to Cargo.toml if not already present. This guarantees a single disk load per-process and shares the handle across cloned MDCs and both chat/completions paths.
224-234: Cache HfTokenizer on ModelDeploymentCard to prevent repeated deserialization
tokenizer_hf()is called from watcher, backend, preprocessor, and entrypoints and currently re-loads the ~10 MiB JSON on every call. Add a cached field (e.g.OnceCell<Arc<HfTokenizer>>) toModelDeploymentCardso the JSON file is deserialized only once.lib/bindings/python/rust/llm/preprocessor.rs (1)
32-39: ModelDeploymentCard lacks in-card tokenizer cache
ModelDeploymentCardinlib/llm/src/model_card.rsdoes not include anArc<OnceCell<…>>(or equivalent) to cache the tokenizer, so callingmdc.inner.clone()will reinitialize the tokenizer on each clone. You must add a shared cache field insideModelDeploymentCardto ensure all clones reuse the same tokenizer handle.lib/llm/src/backend.rs (2)
156-160: Avoid panic: unwrap() on decoder error.process_token_ids() can error; unwrap() will crash the pipeline. Handle and pass-through/log instead.
Apply this patch:
- let result = state.decoder.process_token_ids(&data.token_ids).unwrap(); + let result = match state.decoder.process_token_ids(&data.token_ids) { + Ok(r) => r, + Err(err) => { + log::warn!(?err, "decoder failed; passing through upstream data"); + // keep upstream output untouched + return Some((output, state)); + } + };
448-456: String slicing by byte offsets can panic on non-UTF-8 boundaries.offset comes from a byte search; slicing with [pre_append..offset] may split a UTF‑8 codepoint and panic. Use get() to guard.
- let partial_token = if offset >= pre_append { - self.jail[pre_append..offset].to_string() - } else { - "".to_string() - }; + let partial_token = if offset >= pre_append { + match self.jail.get(pre_append..offset) { + Some(s) => s.to_string(), + None => String::new(), + } + } else { + String::new() + };lib/llm/src/preprocessor/prompt/template.rs (1)
36-43: Same Path formatting issue for chat_template_file.Use .display() for the context string.
- let chat_template = std::fs::read_to_string(chat_template_file) - .with_context(|| format!("fs:read_to_string '{}'", chat_template_file))?; + let chat_template = std::fs::read_to_string(chat_template_file) + .with_context(|| { + format!("fs:read_to_string '{}'", chat_template_file.display()) + })?;
🧹 Nitpick comments (24)
lib/llm/src/migration.rs (1)
119-131: Avoid stringly-typed error detectionMatching on a hardcoded error prefix is brittle. Prefer a typed error (e.g., a dedicated variant like EngineStreamError::Disconnected) surfaced by the downstream layer, or attach a machine-checkable code via
thiserrorcontext and match on that.lib/llm/src/model_card.rs (2)
464-525: Repo loading path: minor guardrailsConsider logging which source provided
context_lengthand the chosen value to aid debugging heterogeneous repos. Optional.
865-872: Path validation: consider canonicalization
check_for_filecould optionally canonicalize and return absolute paths to avoid surprises with relative paths and working-directory changes. Low impact.lib/bindings/python/rust/llm/backend.rs (1)
10-10: Remove unused imports to keep CI clean.
Source(and likelyOperator) aren’t used in this file.Apply:
-use dynamo_runtime::pipeline::{Operator, ServiceBackend, ServiceFrontend, Source}; +use dynamo_runtime::pipeline::{ServiceBackend, ServiceFrontend};lib/llm/src/local_model.rs (2)
199-206: Avoid blocking the async runtime if build() is ever called under load.If
build()may run concurrently on a busy runtime, consider offloading the sync load to a blocking task. Otherwise, ignore.For example:
- let mut card = - ModelDeploymentCard::load(model_config_path, self.custom_template_path.as_deref())?; + let cfg = model_config_path.to_path_buf(); + let tmpl = self.custom_template_path.clone(); + let mut card = tokio::task::spawn_blocking(move || { + ModelDeploymentCard::load(&cfg, tmpl.as_deref()) + }).await??;Also applies to: 253-258
190-197: Doc nit: fix “it’s” → “its”.Minor grammar in user-facing docs.
- /// - Load it's ModelDeploymentCard card + /// - Load its ModelDeploymentCardlib/llm/tests/backend.rs (2)
11-11: Pass by reference for clarity.
from_mdctakes&ModelDeploymentCard; pass&mdcto make intent explicit.- let operator = Backend::from_mdc(mdc); + let operator = Backend::from_mdc(&mdc);
7-8: Drop Tokio runtime annotation since the test is now sync.Removes unnecessary async scaffolding.
-#[tokio::test] -async fn test_sequence_factory() { +#[test] +fn test_sequence_factory() {lib/bindings/python/rust/llm/model_card.rs (1)
19-22: Sync load looks good; keep Python API backward-compatibleReturn type and error mapping are clean. Consider adding a
from_local_pathalias to avoid breaking existing Python users.Add alongside the existing method:
#[pymethods] impl ModelDeploymentCard { // Previously called "from_local_path" #[staticmethod] fn load(path: String, model_name: String) -> PyResult<ModelDeploymentCard> { let mut card = RsModelDeploymentCard::load(&path, None).map_err(to_pyerr)?; card.set_name(&model_name); Ok(ModelDeploymentCard { inner: card }) } + + /// Back-compat alias for older callers. + #[staticmethod] + #[pyo3(name = "from_local_path")] + fn from_local_path(path: String, model_name: String) -> PyResult<ModelDeploymentCard> { + Self::load(path, model_name) + } }lib/llm/src/entrypoint/input/endpoint.rs (1)
76-76: Confirm intentional silent degrade when tokenizer load fails
Backend::from_mdcnow warns and returns a backend without a tokenizer on failure (see backend.rs Lines 78-89), removing the previous fallible path. If decode validation or token-based routing relies on a tokenizer, this could mask configuration errors at startup.Option: provide a strict path that bubbles the error (e.g.,
Backend::try_from_mdc(...) -> Result<Arc<Self>>) and use it here whenengine_configexpects tokenization. Otherwise, keep current behavior.- let backend = Backend::from_mdc(model.card()).into_operator(); + let backend = Backend::from_mdc(model.card()).into_operator(); // OK if silent degrade is desired + // Alternatively (if available): + // let backend = Backend::try_from_mdc(model.card())?.into_operator();lib/llm/src/backend.rs (2)
73-76: Expose a way to enable validate_engine_decode.validate_engine_decode is hardcoded false. Add a builder/flag so tests or debug deployments can enable it.
429-441: Reduce noisy debug logs in the inner loop.Multiple log::debug! per token will flood logs under load. Gate behind a verbose feature flag or remove.
lib/llm/src/entrypoint/input/grpc.rs (1)
28-62: Inject shared hf_tokenizer into watcher-driven pipelinesIn lib/llm/src/discovery/watcher.rs (around line 316) the call to
entrypoint::build_routed_pipelineomits the sharedhf_tokenizer, so models discovered via etcd will still each load their own tokenizer. Pass the clonedhf_tokenizerinto these builder calls to reuse a single tokenizer instance and prevent redundant loads.lib/llm/tests/model_card.rs (1)
51-56: Make the error-string assertion less brittle.“unable to extract” can vary across environments. Consider accepting either that phrase or explicit missing file hints.
Apply this diff:
- assert!(err.contains("unable to extract")); + assert!( + err.contains("unable to extract") || err.contains("config.json"), + "expected missing required files error, got: {err}" + );lib/llm/tests/preprocessor.rs (2)
347-347: Fix test name typo.Rename “mulit” → “multi” for clarity.
Apply this diff:
-async fn test_mulit_turn_without_system() { +async fn test_multi_turn_without_system() {
379-379: Fix test name typo.Same typo in the second test.
Apply this diff:
-async fn test_mulit_turn_with_system() { +async fn test_multi_turn_with_system() {lib/llm/src/discovery/watcher.rs (4)
275-276: Elevate and enrich the MDC-load failure log.This represents a model-not-ready condition; warn (not info) and include the model name for correlation.
Apply this diff:
- tracing::info!(error = %err, "load_mdc did not complete"); + tracing::warn!(error = %err, model_name = %model_entry.name, "load_mdc did not complete");
331-331: Add context to readiness log.Include model name for easier tracing in multi-model setups.
Apply this diff:
- tracing::info!("Chat completions is ready"); + tracing::info!(model_name = %model_entry.name, "Chat completions is ready");
357-357: Add context to readiness log.Mirror the chat readiness improvement.
Apply this diff:
- tracing::info!("Completions is ready"); + tracing::info!(model_name = %model_entry.name, "Completions is ready");
404-406: Optional: apply the same “single tokenizer load” to embeddings.Embeddings still load the tokenizer twice (preprocessor + backend). You can reuse the already loaded tokenizer for parity and startup latency wins.
Apply this diff:
- let preprocessor = OpenAIPreprocessor::new(card.clone())?.into_operator(); - let backend = Backend::from_mdc(&card).into_operator(); + // Reuse a single tokenizer instance for both preprocessor and backend + let formatter = PromptFormatter::no_op(); + let PromptFormatter::OAI(formatter) = formatter; + let hf_tok = card.tokenizer_hf()?; + let preprocessor = + OpenAIPreprocessor::new_with_parts(card.clone(), formatter, hf_tok.clone())? + .into_operator(); + let backend = Backend::from_tokenizer(hf_tok).into_operator();lib/llm/src/preprocessor.rs (3)
109-116: Constructor looks good; optional API tweak to avoid cloning the MDC.
new_with_partsconsumesModelDeploymentCard, forcing upstreamclone(). Consider taking&ModelDeploymentCardand cloning inside if needed to reduce copies on hot paths.
366-374: Avoid extra clone in spawn_blocking for embeddings batch.You clone
input_strsonly to move it into the closure. Move it directly and build refs inside.- let input_strs: Vec<String> = arr.to_vec(); - let encodings = tokio::task::spawn_blocking({ - let tokenizer = self.tokenizer.clone(); - let strs = input_strs.clone(); - move || { - tokenizer.encode_batch(&strs.iter().map(|s| s.as_str()).collect::<Vec<_>>()) - } - }) + let input_strs: Vec<String> = arr.to_vec(); + let encodings = tokio::task::spawn_blocking({ + let tokenizer = self.tokenizer.clone(); + move || { + let refs: Vec<&str> = input_strs.iter().map(|s| s.as_str()).collect(); + tokenizer.encode_batch(&refs) + } + }) .await??;
101-106: DeprecateOpenAIPreprocessor::newor document its extra tokenizer load
OpenAIPreprocessor::new(mdc)remains in three places:
- lib/bindings/python/rust/llm/preprocessor.rs:33
- lib/llm/src/entrypoint/input/batch.rs:70
- lib/llm/src/discovery/watcher.rs:404
- Add
#[deprecated(note = "reloads tokenizer JSON—prefernew_with_partsfor cache reuse")]onnewand update these call sites to usenew_with_parts(or keepnewonly with a clear doc‐warning).lib/llm/src/entrypoint/input/common.rs (1)
211-214: Avoid cloning the entire ModelDeploymentCard if not necessary.Both
new_with_parts(card.clone(), ..)andMigration::from_mdc(card.clone())force clones. If those constructors can take&ModelDeploymentCard, switch to borrowing to reduce copies.Also applies to: 280-282
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
⛔ Files ignored due to path filters (1)
lib/runtime/examples/Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (18)
lib/bindings/python/rust/llm/backend.rs(1 hunks)lib/bindings/python/rust/llm/model_card.rs(1 hunks)lib/bindings/python/rust/llm/preprocessor.rs(1 hunks)lib/llm/src/backend.rs(2 hunks)lib/llm/src/discovery/watcher.rs(6 hunks)lib/llm/src/entrypoint/input/batch.rs(1 hunks)lib/llm/src/entrypoint/input/common.rs(11 hunks)lib/llm/src/entrypoint/input/endpoint.rs(1 hunks)lib/llm/src/entrypoint/input/grpc.rs(2 hunks)lib/llm/src/entrypoint/input/http.rs(2 hunks)lib/llm/src/local_model.rs(1 hunks)lib/llm/src/migration.rs(1 hunks)lib/llm/src/model_card.rs(9 hunks)lib/llm/src/preprocessor.rs(2 hunks)lib/llm/src/preprocessor/prompt/template.rs(3 hunks)lib/llm/tests/backend.rs(1 hunks)lib/llm/tests/model_card.rs(4 hunks)lib/llm/tests/preprocessor.rs(7 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-09-02T16:46:54.015Z
Learnt from: GuanLuo
PR: ai-dynamo/dynamo#2714
File: lib/llm/src/discovery/model_entry.rs:38-42
Timestamp: 2025-09-02T16:46:54.015Z
Learning: In lib/llm/src/discovery/model_entry.rs, GuanLuo prefers not to add serde defaults for model_type and model_input fields to keep the specification explicit and avoid user errors, relying on atomic deployment strategy to avoid backward compatibility issues.
Applied to files:
lib/bindings/python/rust/llm/preprocessor.rslib/bindings/python/rust/llm/backend.rs
🧬 Code graph analysis (18)
lib/llm/src/migration.rs (2)
lib/llm/src/backend.rs (2)
from_mdc(79-90)new(359-396)lib/llm/src/preprocessor.rs (1)
new(101-107)
lib/bindings/python/rust/llm/preprocessor.rs (2)
lib/llm/src/preprocessor.rs (1)
new(101-107)lib/bindings/python/rust/lib.rs (3)
new(320-345)new(999-1003)to_pyerr(130-135)
lib/bindings/python/rust/llm/backend.rs (1)
lib/llm/src/backend.rs (1)
from_mdc(79-90)
lib/bindings/python/rust/llm/model_card.rs (2)
lib/llm/src/model_card.rs (1)
load(375-389)lib/bindings/python/rust/lib.rs (1)
to_pyerr(130-135)
lib/llm/src/discovery/watcher.rs (4)
lib/llm/src/model_card.rs (1)
tokenizer_hf(224-234)lib/llm/src/preprocessor/prompt.rs (1)
no_op(125-127)lib/llm/src/preprocessor.rs (2)
new_with_parts(109-129)new(101-107)lib/llm/src/backend.rs (1)
from_mdc(79-90)
lib/llm/src/backend.rs (4)
lib/llm/src/tokenizers/hf.rs (2)
from_tokenizer(35-37)from(81-83)lib/llm/src/tokenizers.rs (6)
tokenizer(352-354)from(118-120)from(127-129)new(188-203)new(280-287)new(507-515)lib/bindings/python/rust/llm/backend.rs (1)
new(23-29)lib/llm/src/preprocessor.rs (1)
new(101-107)
lib/llm/tests/model_card.rs (2)
lib/bindings/python/rust/llm/model_card.rs (1)
load(19-23)lib/llm/src/model_card.rs (1)
load(375-389)
lib/llm/src/preprocessor/prompt/template.rs (3)
lib/llm/src/preprocessor.rs (1)
new(101-107)lib/llm/src/model_card.rs (3)
from_gguf(425-462)from_gguf(704-741)from_gguf(771-776)lib/llm/src/preprocessor/prompt/template/tokcfg.rs (1)
from_gguf(97-159)
lib/llm/src/entrypoint/input/batch.rs (1)
lib/llm/src/preprocessor.rs (1)
new(101-107)
lib/llm/src/preprocessor.rs (4)
lib/llm/src/backend.rs (3)
new(359-396)from_mdc(79-90)from_tokenizer(69-77)lib/llm/src/preprocessor/prompt/template.rs (1)
from_mdc(20-59)lib/llm/src/model_card.rs (1)
mdcsum(204-207)lib/llm/src/tokenizers/hf.rs (1)
from_tokenizer(35-37)
lib/llm/src/entrypoint/input/endpoint.rs (1)
lib/llm/src/backend.rs (1)
from_mdc(79-90)
lib/llm/src/entrypoint/input/grpc.rs (3)
lib/llm/src/model_card.rs (1)
tokenizer_hf(224-234)lib/llm/src/local_model.rs (1)
card(331-333)lib/llm/src/entrypoint/input/common.rs (5)
build_routed_pipeline(226-257)build_pipeline(175-178)build_pipeline(194-224)build_pipeline(334-337)build_pipeline(353-353)
lib/llm/src/model_card.rs (1)
lib/bindings/python/rust/llm/model_card.rs (1)
load(19-23)
lib/llm/tests/preprocessor.rs (2)
lib/llm/src/model_card.rs (1)
load(375-389)lib/llm/src/preprocessor/prompt/template.rs (1)
from_mdc(20-59)
lib/llm/src/entrypoint/input/http.rs (3)
lib/llm/src/model_card.rs (1)
tokenizer_hf(224-234)lib/llm/src/local_model.rs (1)
card(331-333)lib/llm/src/entrypoint/input/common.rs (5)
build_routed_pipeline(226-257)build_pipeline(175-178)build_pipeline(194-224)build_pipeline(334-337)build_pipeline(353-353)
lib/llm/src/local_model.rs (2)
lib/bindings/python/rust/llm/model_card.rs (1)
load(19-23)lib/llm/src/model_card.rs (1)
load(375-389)
lib/llm/tests/backend.rs (2)
lib/llm/src/model_card.rs (1)
load(375-389)lib/llm/src/backend.rs (1)
from_mdc(79-90)
lib/llm/src/entrypoint/input/common.rs (4)
lib/llm/src/backend.rs (2)
from_mdc(79-90)from_tokenizer(69-77)lib/llm/src/preprocessor/prompt/template.rs (1)
from_mdc(20-59)lib/llm/src/preprocessor.rs (1)
new_with_parts(109-129)lib/llm/src/model_card.rs (1)
load(375-389)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Build and Test - dynamo
- GitHub Check: pre-merge-rust (.)
- GitHub Check: pre-merge-rust (lib/runtime/examples)
- GitHub Check: pre-merge-rust (lib/bindings/python)
🔇 Additional comments (31)
lib/llm/src/model_card.rs (2)
371-389: Syncloadpath looks goodSynchronous dispatch between local dir and GGUF is clear and preserves prior semantics.
904-921: Targeted unit tests 👍Good coverage that exercises HF config parsing on real samples.
lib/bindings/python/rust/llm/preprocessor.rs (1)
32-39: Nice: removed ad-hoc runtime, now using sync constructorConstructor is simpler and avoids a Tokio runtime just for init.
lib/bindings/python/rust/llm/backend.rs (2)
23-27: Borrowing MDC for Backend::from_mdc is correct and matches the new sync API.This aligns with the borrowed, synchronous constructor and avoids unnecessary moves. Looks good.
24-28: Python binding uses RustBackend::from_mdcwithout reloading tokenizer JSON
Verified no directTokenizer::from_file,HfTokenizer::from_file, ortokenizer_hf()calls inlib/bindings/python/rust/llm—the Python layer delegates to the RustBackend, which handles the tokenizer exactly once.lib/llm/src/local_model.rs (1)
256-256: Sync ModelDeploymentCard::load call is appropriate here.Startup path doing a single blocking load is fine and simplifies flow.
lib/llm/tests/backend.rs (1)
9-9: LGTM on switching to sync load.Test reads cleaner with the synchronous
ModelDeploymentCard::load.lib/llm/src/entrypoint/input/grpc.rs (2)
93-114: Good: single tokenizer load shared by chat and completions.Loads tokenizer once, clones the handle, and threads it through both pipelines. Matches PR goal.
134-147: Good: StaticCore path also reuses the same tokenizer handle.Consistent with StaticRemote; avoids double JSON load.
lib/llm/src/preprocessor/prompt/template.rs (1)
55-57: GGUF path LGTM.Borrowing gguf_path and using ChatTemplate::from_gguf keeps this sync API clean.
lib/llm/tests/model_card.rs (4)
11-12: Sync API migration LGTM.Switch to synchronous load and get_model_info is correct and matches API changes.
23-23: Error-path test LGTM.Direct Result use without await is correct.
29-34: Tokenizer-kind assertion LGTM.Covers both JSON and GGUF.
39-45: Prompt-formatter assertion LGTM.Matches expected artifact type.
lib/llm/tests/preprocessor.rs (7)
60-64: Sync MDC load LGTM.Constructing MDC synchronously and setting name/context is correct.
286-287: Borrowed MDC usage LGTM.Passing &ModelDeploymentCard to from_mdc avoids cloning and IO.
318-319: Repeat pattern LGTM.Same efficient borrowing for the tools test.
355-356: Repeat pattern LGTM.Consistent with sync APIs.
387-388: Repeat pattern LGTM.Consistent usage for system+multi-turn case.
425-426: Repeat pattern LGTM.Consistent usage for system+tools case.
468-469: Explicit borrow LGTM.Using &mdc here is correct and consistent.
lib/llm/src/discovery/watcher.rs (6)
179-180: Deletion log LGTM.Structured field is helpful for ops.
185-186: Error log LGTM.Using error = %e yields better formatting.
311-313: Great: load tokenizer once and reuse.Meets the PR objective to avoid duplicate 10 MiB JSON loads.
325-331: Pass-through of tokenizer to chat builder LGTM.Prevents a second load on the chat path.
337-341: Preprocessor constructed with shared tokenizer LGTM.Avoids redundant tokenizer re-loading in the completions path.
352-353: Forwarding tokenizer to completions pipeline LGTM.Ensures a single tokenizer instance for both chat and completions.
lib/llm/src/entrypoint/input/common.rs (2)
134-146: Nice: single tokenizer load for chat pipeline (meets PR objective).You obtain
hf_tokenizeronce and share it downstream. This avoids the second 10 MiB JSON read and aligns chat/completions readiness.
194-199: Verify downstream callers updated: no internal call sites forbuild_pipelineorbuild_routed_pipelinewere found—ensure all entrypoints, tests, and plugins pass the newhf_tokenizerparameter.lib/llm/src/entrypoint/input/http.rs (2)
122-134: LGTM: tokenizer loaded once and shared across chat and completions.Static Remote and Static Core paths now pass a single
tokenizer_hfhandle (cloned) into both pipelines. This removes the redundant disk IO and aligns readiness timing.Also applies to: 137-143, 172-185
123-134: Minor: propagate formatter mismatch as early error if needed.If
PromptFormatter::from_mdccan return non-OAI variants, ensure downstream builders handle that gracefully (they currently will viaanyhowat construction sites in common.rs after you apply the let-else fix).
Signed-off-by: Graham King <[email protected]>
…o#2910) Signed-off-by: Graham King <[email protected]> Signed-off-by: Gavin.Zhu <[email protected]>
Signed-off-by: Graham King <[email protected]> Signed-off-by: nnshah1 <[email protected]>
Signed-off-by: Graham King <[email protected]> Signed-off-by: nnshah1 <[email protected]>
Signed-off-by: Graham King <[email protected]> Signed-off-by: Indrajit Bhosale <[email protected]>
Loading the tokenizer JSON from disk can be expensive (it can be a ~10MiB file).
Previously we loaded once building the chat handler and again building the completions handler. That also meant the completions handler was not ready until slightly after the chat handler.
Note some Python bindings may have gone from async to sync, tests should catch if so.
Summary by CodeRabbit
New Features
Refactor
Chores
Tests