Skip to content

Conversation

@grahamking
Copy link
Contributor

@grahamking grahamking commented Sep 5, 2025

Loading the tokenizer JSON from disk can be expensive (it can be a ~10MiB file).

Previously we loaded once building the chat handler and again building the completions handler. That also meant the completions handler was not ready until slightly after the chat handler.

Note some Python bindings may have gone from async to sync, tests should catch if so.

Summary by CodeRabbit

  • New Features

    • Reuse of a single tokenizer across chat, completions, and embeddings to reduce load times and memory.
    • Simplified Python model loading API returning a concrete object (no async context required).
  • Refactor

    • Converted model loading, preprocessing, and backend initialization to synchronous flows, cutting startup latency and overhead.
    • Backend creation is more resilient if tokenization cannot be initialized.
  • Chores

    • More structured, informative logging around model add/remove and readiness.
  • Tests

    • Updated to reflect synchronous APIs and added assertions for model metadata.

Loading the tokenizer JSON from disk can be expensive (it can be a
~10MiB file).

Previously we loaded once building the chat handler and
again building the completions handler. That also meant the completions
handler was not ready until slightly after the chat handler.

Signed-off-by: Graham King <[email protected]>
@grahamking grahamking requested a review from a team as a code owner September 5, 2025 19:44
@copy-pr-bot
Copy link

copy-pr-bot bot commented Sep 5, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 5, 2025

Walkthrough

Refactors the LLM stack from async to sync construction across backend, model-card loading, and preprocessing. Threads a shared HuggingFace tokenizer handle through pipeline builders and entrypoints (HTTP/GRPC/watcher). Updates Python bindings accordingly, adjusts logging, and revises tests to new sync APIs and borrowed parameters.

Changes

Cohort / File(s) Summary
Core sync refactor: Backend, Model Card, Preprocessor, Migration, Prompt Template
lib/llm/src/backend.rs, lib/llm/src/model_card.rs, lib/llm/src/preprocessor.rs, lib/llm/src/preprocessor/prompt/template.rs, lib/llm/src/migration.rs
Converted constructors and loaders from async Result-returning to synchronous APIs; Backend::from_mdc now borrows and falls back to no-tokenizer on failure; introduced from_tokenizer; ModelDeploymentCard::load and related repo/formatter/tokenizer helpers are sync; OpenAIPreprocessor::new replaced by sync new/new_with_parts; PromptFormatter::from_mdc now sync and by-ref; Migration::from_mdc sync.
Entrypoint pipeline threading of tokenizer
lib/llm/src/entrypoint/input/common.rs, lib/llm/src/entrypoint/input/http.rs, lib/llm/src/entrypoint/input/grpc.rs, lib/llm/src/entrypoint/input/endpoint.rs, lib/llm/src/entrypoint/input/batch.rs
Pipeline builders now accept and propagate tokenizer_hf; backend/preprocessor creation uses from_tokenizer and new_with_parts; removed awaits and error propagation tied to async constructors; static HTTP/GRPC paths pass tokenizer clones as needed; endpoint backend creation simplified to direct from_mdc(...).into_operator().
Discovery watcher integration and logging
lib/llm/src/discovery/watcher.rs
Computes tokenizer_hf once per PUT, passes through chat/completions/embeddings; switches to structured logging; replaces async preprocessor/backend construction with sync calls; adds readiness logs.
Python bindings sync updates
lib/bindings/python/rust/llm/backend.rs, lib/bindings/python/rust/llm/model_card.rs, lib/bindings/python/rust/llm/preprocessor.rs
Removed runtime/block_on paths; Backend::from_mdc and OpenAIPreprocessor::new now called synchronously; ModelDeploymentCard::load binding changed from async PyAny to returning the PyClass directly; minor error-mapping adjustments.
Local model loader
lib/llm/src/local_model.rs
Uses synchronous ModelDeploymentCard::load (no await), passing path directly; surrounding logic unchanged.
Tests updated to sync APIs
lib/llm/tests/backend.rs, lib/llm/tests/model_card.rs, lib/llm/tests/preprocessor.rs
Remove awaits, adjust to sync load, get_model_info, and by-ref PromptFormatter::from_mdc; add assertions for model info; backend creation now non-fallible in tests.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Client
  participant Entry as EntryPoint (HTTP/GRPC)
  participant Card as ModelDeploymentCard
  participant Tok as tokenizer_hf
  participant Prep as OpenAIPreprocessor
  participant BE as Backend
  participant Pipe as Pipeline

  Client->>Entry: initialize pipeline
  Entry->>Card: load() [sync]
  Entry->>Tok: tokenizer_hf() [sync]
  Note right of Tok: Single load, reused across paths

  Entry->>Prep: new_with_parts(card, formatter, Tok.clone()) [sync]
  Entry->>BE: from_tokenizer(Tok) [sync]
  BE->>Pipe: into_operator()
  Prep->>Pipe: into_operator()

  Entry->>Pipe: assemble routed pipeline
  Pipe-->>Client: ready
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

Poem

I nibbled the threads of async’s vine,
Swapped futures for pathways clean and fine.
One tokenizer, shared like clover,
Pipelines click—hop, it’s over.
Logs now chirp, models align—
Rabbit approves: sync in time. 🐇✨


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (6)
lib/llm/src/model_card.rs (2)

224-234: Tokenizer JSON still loaded multiple times; add in-card caching to truly “load once”

tokenizer_hf() re-reads ~10 MiB tokenizer.json on each call when TokenizerKind::HfTokenizerJson, so chat and completions still cause two disk loads. Cache the constructed HfTokenizer inside the MDC and share it across clones.

Apply the following changes:

  1. Add a cached handle (shared across MDC clones):
@@
-use std::sync::Arc;
+use std::sync::Arc;
+use once_cell::sync::OnceCell;
@@
 #[derive(Serialize, Deserialize, Clone, Debug, Builder, Default)]
 pub struct ModelDeploymentCard {
@@
     #[serde(default)]
     pub runtime_config: ModelRuntimeConfig,
+    /// Lazily initialized, process-shared HF tokenizer handle
+    #[serde(skip)]
+    #[builder(default)]
+    hf_tokenizer_cached: Arc<OnceCell<Arc<HfTokenizer>>>,
 }
  1. Use the cache in tokenizer_hf():
@@
-    pub fn tokenizer_hf(&self) -> anyhow::Result<HfTokenizer> {
-        match &self.tokenizer {
-            Some(TokenizerKind::HfTokenizerJson(file)) => {
-                HfTokenizer::from_file(file).map_err(anyhow::Error::msg)
-            }
-            Some(TokenizerKind::GGUF(t)) => Ok(*t.clone()),
-            None => {
-                anyhow::bail!("Blank ModelDeploymentCard does not have a tokenizer");
-            }
-        }
-    }
+    pub fn tokenizer_hf(&self) -> anyhow::Result<HfTokenizer> {
+        if let Some(tok_arc) = self.hf_tokenizer_cached.get() {
+            return Ok((**tok_arc).clone());
+        }
+        let loaded: HfTokenizer = match &self.tokenizer {
+            Some(TokenizerKind::HfTokenizerJson(file)) => {
+                HfTokenizer::from_file(file).map_err(anyhow::Error::msg)?
+            }
+            Some(TokenizerKind::GGUF(t)) => (*t.clone()),
+            None => anyhow::bail!("Blank ModelDeploymentCard does not have a tokenizer"),
+        };
+        let arc = Arc::new(loaded);
+        // Ignore race: if another thread set it first, keep that one.
+        let _ = self.hf_tokenizer_cached.set(arc.clone());
+        Ok((*arc).clone())
+    }

Note: add once_cell = "1" to Cargo.toml if not already present. This guarantees a single disk load per-process and shares the handle across cloned MDCs and both chat/completions paths.


224-234: Cache HfTokenizer on ModelDeploymentCard to prevent repeated deserialization
tokenizer_hf() is called from watcher, backend, preprocessor, and entrypoints and currently re-loads the ~10 MiB JSON on every call. Add a cached field (e.g. OnceCell<Arc<HfTokenizer>>) to ModelDeploymentCard so the JSON file is deserialized only once.

lib/bindings/python/rust/llm/preprocessor.rs (1)

32-39: ModelDeploymentCard lacks in-card tokenizer cache
ModelDeploymentCard in lib/llm/src/model_card.rs does not include an Arc<OnceCell<…>> (or equivalent) to cache the tokenizer, so calling mdc.inner.clone() will reinitialize the tokenizer on each clone. You must add a shared cache field inside ModelDeploymentCard to ensure all clones reuse the same tokenizer handle.

lib/llm/src/backend.rs (2)

156-160: Avoid panic: unwrap() on decoder error.

process_token_ids() can error; unwrap() will crash the pipeline. Handle and pass-through/log instead.

Apply this patch:

-                    let result = state.decoder.process_token_ids(&data.token_ids).unwrap();
+                    let result = match state.decoder.process_token_ids(&data.token_ids) {
+                        Ok(r) => r,
+                        Err(err) => {
+                            log::warn!(?err, "decoder failed; passing through upstream data");
+                            // keep upstream output untouched
+                            return Some((output, state));
+                        }
+                    };

448-456: String slicing by byte offsets can panic on non-UTF-8 boundaries.

offset comes from a byte search; slicing with [pre_append..offset] may split a UTF‑8 codepoint and panic. Use get() to guard.

-                    let partial_token = if offset >= pre_append {
-                        self.jail[pre_append..offset].to_string()
-                    } else {
-                        "".to_string()
-                    };
+                    let partial_token = if offset >= pre_append {
+                        match self.jail.get(pre_append..offset) {
+                            Some(s) => s.to_string(),
+                            None => String::new(),
+                        }
+                    } else {
+                        String::new()
+                    };
lib/llm/src/preprocessor/prompt/template.rs (1)

36-43: Same Path formatting issue for chat_template_file.

Use .display() for the context string.

-                    let chat_template = std::fs::read_to_string(chat_template_file)
-                        .with_context(|| format!("fs:read_to_string '{}'", chat_template_file))?;
+                    let chat_template = std::fs::read_to_string(chat_template_file)
+                        .with_context(|| {
+                            format!("fs:read_to_string '{}'", chat_template_file.display())
+                        })?;
🧹 Nitpick comments (24)
lib/llm/src/migration.rs (1)

119-131: Avoid stringly-typed error detection

Matching on a hardcoded error prefix is brittle. Prefer a typed error (e.g., a dedicated variant like EngineStreamError::Disconnected) surfaced by the downstream layer, or attach a machine-checkable code via thiserror context and match on that.

lib/llm/src/model_card.rs (2)

464-525: Repo loading path: minor guardrails

Consider logging which source provided context_length and the chosen value to aid debugging heterogeneous repos. Optional.


865-872: Path validation: consider canonicalization

check_for_file could optionally canonicalize and return absolute paths to avoid surprises with relative paths and working-directory changes. Low impact.

lib/bindings/python/rust/llm/backend.rs (1)

10-10: Remove unused imports to keep CI clean.

Source (and likely Operator) aren’t used in this file.

Apply:

-use dynamo_runtime::pipeline::{Operator, ServiceBackend, ServiceFrontend, Source};
+use dynamo_runtime::pipeline::{ServiceBackend, ServiceFrontend};
lib/llm/src/local_model.rs (2)

199-206: Avoid blocking the async runtime if build() is ever called under load.

If build() may run concurrently on a busy runtime, consider offloading the sync load to a blocking task. Otherwise, ignore.

For example:

-        let mut card =
-            ModelDeploymentCard::load(model_config_path, self.custom_template_path.as_deref())?;
+        let cfg = model_config_path.to_path_buf();
+        let tmpl = self.custom_template_path.clone();
+        let mut card = tokio::task::spawn_blocking(move || {
+            ModelDeploymentCard::load(&cfg, tmpl.as_deref())
+        }).await??;

Also applies to: 253-258


190-197: Doc nit: fix “it’s” → “its”.

Minor grammar in user-facing docs.

-    /// - Load it's ModelDeploymentCard card
+    /// - Load its ModelDeploymentCard
lib/llm/tests/backend.rs (2)

11-11: Pass by reference for clarity.

from_mdc takes &ModelDeploymentCard; pass &mdc to make intent explicit.

-    let operator = Backend::from_mdc(mdc);
+    let operator = Backend::from_mdc(&mdc);

7-8: Drop Tokio runtime annotation since the test is now sync.

Removes unnecessary async scaffolding.

-#[tokio::test]
-async fn test_sequence_factory() {
+#[test]
+fn test_sequence_factory() {
lib/bindings/python/rust/llm/model_card.rs (1)

19-22: Sync load looks good; keep Python API backward-compatible

Return type and error mapping are clean. Consider adding a from_local_path alias to avoid breaking existing Python users.

Add alongside the existing method:

 #[pymethods]
 impl ModelDeploymentCard {
   // Previously called "from_local_path"
   #[staticmethod]
   fn load(path: String, model_name: String) -> PyResult<ModelDeploymentCard> {
     let mut card = RsModelDeploymentCard::load(&path, None).map_err(to_pyerr)?;
     card.set_name(&model_name);
     Ok(ModelDeploymentCard { inner: card })
   }
+
+  /// Back-compat alias for older callers.
+  #[staticmethod]
+  #[pyo3(name = "from_local_path")]
+  fn from_local_path(path: String, model_name: String) -> PyResult<ModelDeploymentCard> {
+    Self::load(path, model_name)
+  }
 }
lib/llm/src/entrypoint/input/endpoint.rs (1)

76-76: Confirm intentional silent degrade when tokenizer load fails

Backend::from_mdc now warns and returns a backend without a tokenizer on failure (see backend.rs Lines 78-89), removing the previous fallible path. If decode validation or token-based routing relies on a tokenizer, this could mask configuration errors at startup.

Option: provide a strict path that bubbles the error (e.g., Backend::try_from_mdc(...) -> Result<Arc<Self>>) and use it here when engine_config expects tokenization. Otherwise, keep current behavior.

-            let backend = Backend::from_mdc(model.card()).into_operator();
+            let backend = Backend::from_mdc(model.card()).into_operator(); // OK if silent degrade is desired
+            // Alternatively (if available):
+            // let backend = Backend::try_from_mdc(model.card())?.into_operator();
lib/llm/src/backend.rs (2)

73-76: Expose a way to enable validate_engine_decode.

validate_engine_decode is hardcoded false. Add a builder/flag so tests or debug deployments can enable it.


429-441: Reduce noisy debug logs in the inner loop.

Multiple log::debug! per token will flood logs under load. Gate behind a verbose feature flag or remove.

lib/llm/src/entrypoint/input/grpc.rs (1)

28-62: Inject shared hf_tokenizer into watcher-driven pipelines

In lib/llm/src/discovery/watcher.rs (around line 316) the call to entrypoint::build_routed_pipeline omits the shared hf_tokenizer, so models discovered via etcd will still each load their own tokenizer. Pass the cloned hf_tokenizer into these builder calls to reuse a single tokenizer instance and prevent redundant loads.

lib/llm/tests/model_card.rs (1)

51-56: Make the error-string assertion less brittle.

“unable to extract” can vary across environments. Consider accepting either that phrase or explicit missing file hints.

Apply this diff:

-    assert!(err.contains("unable to extract"));
+    assert!(
+        err.contains("unable to extract") || err.contains("config.json"),
+        "expected missing required files error, got: {err}"
+    );
lib/llm/tests/preprocessor.rs (2)

347-347: Fix test name typo.

Rename “mulit” → “multi” for clarity.

Apply this diff:

-async fn test_mulit_turn_without_system() {
+async fn test_multi_turn_without_system() {

379-379: Fix test name typo.

Same typo in the second test.

Apply this diff:

-async fn test_mulit_turn_with_system() {
+async fn test_multi_turn_with_system() {
lib/llm/src/discovery/watcher.rs (4)

275-276: Elevate and enrich the MDC-load failure log.

This represents a model-not-ready condition; warn (not info) and include the model name for correlation.

Apply this diff:

-                tracing::info!(error = %err, "load_mdc did not complete");
+                tracing::warn!(error = %err, model_name = %model_entry.name, "load_mdc did not complete");

331-331: Add context to readiness log.

Include model name for easier tracing in multi-model setups.

Apply this diff:

-                tracing::info!("Chat completions is ready");
+                tracing::info!(model_name = %model_entry.name, "Chat completions is ready");

357-357: Add context to readiness log.

Mirror the chat readiness improvement.

Apply this diff:

-                tracing::info!("Completions is ready");
+                tracing::info!(model_name = %model_entry.name, "Completions is ready");

404-406: Optional: apply the same “single tokenizer load” to embeddings.

Embeddings still load the tokenizer twice (preprocessor + backend). You can reuse the already loaded tokenizer for parity and startup latency wins.

Apply this diff:

-            let preprocessor = OpenAIPreprocessor::new(card.clone())?.into_operator();
-            let backend = Backend::from_mdc(&card).into_operator();
+            // Reuse a single tokenizer instance for both preprocessor and backend
+            let formatter = PromptFormatter::no_op();
+            let PromptFormatter::OAI(formatter) = formatter;
+            let hf_tok = card.tokenizer_hf()?;
+            let preprocessor =
+                OpenAIPreprocessor::new_with_parts(card.clone(), formatter, hf_tok.clone())?
+                    .into_operator();
+            let backend = Backend::from_tokenizer(hf_tok).into_operator();
lib/llm/src/preprocessor.rs (3)

109-116: Constructor looks good; optional API tweak to avoid cloning the MDC.

new_with_parts consumes ModelDeploymentCard, forcing upstream clone(). Consider taking &ModelDeploymentCard and cloning inside if needed to reduce copies on hot paths.


366-374: Avoid extra clone in spawn_blocking for embeddings batch.

You clone input_strs only to move it into the closure. Move it directly and build refs inside.

-            let input_strs: Vec<String> = arr.to_vec();
-            let encodings = tokio::task::spawn_blocking({
-                let tokenizer = self.tokenizer.clone();
-                let strs = input_strs.clone();
-                move || {
-                    tokenizer.encode_batch(&strs.iter().map(|s| s.as_str()).collect::<Vec<_>>())
-                }
-            })
+            let input_strs: Vec<String> = arr.to_vec();
+            let encodings = tokio::task::spawn_blocking({
+                let tokenizer = self.tokenizer.clone();
+                move || {
+                    let refs: Vec<&str> = input_strs.iter().map(|s| s.as_str()).collect();
+                    tokenizer.encode_batch(&refs)
+                }
+            })
             .await??;

101-106: Deprecate OpenAIPreprocessor::new or document its extra tokenizer load

  • OpenAIPreprocessor::new(mdc) remains in three places:
    • lib/bindings/python/rust/llm/preprocessor.rs:33
    • lib/llm/src/entrypoint/input/batch.rs:70
    • lib/llm/src/discovery/watcher.rs:404
  • Add #[deprecated(note = "reloads tokenizer JSON—prefer new_with_parts for cache reuse")] on new and update these call sites to use new_with_parts (or keep new only with a clear doc‐warning).
lib/llm/src/entrypoint/input/common.rs (1)

211-214: Avoid cloning the entire ModelDeploymentCard if not necessary.

Both new_with_parts(card.clone(), ..) and Migration::from_mdc(card.clone()) force clones. If those constructors can take &ModelDeploymentCard, switch to borrowing to reduce copies.

Also applies to: 280-282

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between e0bafce and d69902a.

⛔ Files ignored due to path filters (1)
  • lib/runtime/examples/Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (18)
  • lib/bindings/python/rust/llm/backend.rs (1 hunks)
  • lib/bindings/python/rust/llm/model_card.rs (1 hunks)
  • lib/bindings/python/rust/llm/preprocessor.rs (1 hunks)
  • lib/llm/src/backend.rs (2 hunks)
  • lib/llm/src/discovery/watcher.rs (6 hunks)
  • lib/llm/src/entrypoint/input/batch.rs (1 hunks)
  • lib/llm/src/entrypoint/input/common.rs (11 hunks)
  • lib/llm/src/entrypoint/input/endpoint.rs (1 hunks)
  • lib/llm/src/entrypoint/input/grpc.rs (2 hunks)
  • lib/llm/src/entrypoint/input/http.rs (2 hunks)
  • lib/llm/src/local_model.rs (1 hunks)
  • lib/llm/src/migration.rs (1 hunks)
  • lib/llm/src/model_card.rs (9 hunks)
  • lib/llm/src/preprocessor.rs (2 hunks)
  • lib/llm/src/preprocessor/prompt/template.rs (3 hunks)
  • lib/llm/tests/backend.rs (1 hunks)
  • lib/llm/tests/model_card.rs (4 hunks)
  • lib/llm/tests/preprocessor.rs (7 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-09-02T16:46:54.015Z
Learnt from: GuanLuo
PR: ai-dynamo/dynamo#2714
File: lib/llm/src/discovery/model_entry.rs:38-42
Timestamp: 2025-09-02T16:46:54.015Z
Learning: In lib/llm/src/discovery/model_entry.rs, GuanLuo prefers not to add serde defaults for model_type and model_input fields to keep the specification explicit and avoid user errors, relying on atomic deployment strategy to avoid backward compatibility issues.

Applied to files:

  • lib/bindings/python/rust/llm/preprocessor.rs
  • lib/bindings/python/rust/llm/backend.rs
🧬 Code graph analysis (18)
lib/llm/src/migration.rs (2)
lib/llm/src/backend.rs (2)
  • from_mdc (79-90)
  • new (359-396)
lib/llm/src/preprocessor.rs (1)
  • new (101-107)
lib/bindings/python/rust/llm/preprocessor.rs (2)
lib/llm/src/preprocessor.rs (1)
  • new (101-107)
lib/bindings/python/rust/lib.rs (3)
  • new (320-345)
  • new (999-1003)
  • to_pyerr (130-135)
lib/bindings/python/rust/llm/backend.rs (1)
lib/llm/src/backend.rs (1)
  • from_mdc (79-90)
lib/bindings/python/rust/llm/model_card.rs (2)
lib/llm/src/model_card.rs (1)
  • load (375-389)
lib/bindings/python/rust/lib.rs (1)
  • to_pyerr (130-135)
lib/llm/src/discovery/watcher.rs (4)
lib/llm/src/model_card.rs (1)
  • tokenizer_hf (224-234)
lib/llm/src/preprocessor/prompt.rs (1)
  • no_op (125-127)
lib/llm/src/preprocessor.rs (2)
  • new_with_parts (109-129)
  • new (101-107)
lib/llm/src/backend.rs (1)
  • from_mdc (79-90)
lib/llm/src/backend.rs (4)
lib/llm/src/tokenizers/hf.rs (2)
  • from_tokenizer (35-37)
  • from (81-83)
lib/llm/src/tokenizers.rs (6)
  • tokenizer (352-354)
  • from (118-120)
  • from (127-129)
  • new (188-203)
  • new (280-287)
  • new (507-515)
lib/bindings/python/rust/llm/backend.rs (1)
  • new (23-29)
lib/llm/src/preprocessor.rs (1)
  • new (101-107)
lib/llm/tests/model_card.rs (2)
lib/bindings/python/rust/llm/model_card.rs (1)
  • load (19-23)
lib/llm/src/model_card.rs (1)
  • load (375-389)
lib/llm/src/preprocessor/prompt/template.rs (3)
lib/llm/src/preprocessor.rs (1)
  • new (101-107)
lib/llm/src/model_card.rs (3)
  • from_gguf (425-462)
  • from_gguf (704-741)
  • from_gguf (771-776)
lib/llm/src/preprocessor/prompt/template/tokcfg.rs (1)
  • from_gguf (97-159)
lib/llm/src/entrypoint/input/batch.rs (1)
lib/llm/src/preprocessor.rs (1)
  • new (101-107)
lib/llm/src/preprocessor.rs (4)
lib/llm/src/backend.rs (3)
  • new (359-396)
  • from_mdc (79-90)
  • from_tokenizer (69-77)
lib/llm/src/preprocessor/prompt/template.rs (1)
  • from_mdc (20-59)
lib/llm/src/model_card.rs (1)
  • mdcsum (204-207)
lib/llm/src/tokenizers/hf.rs (1)
  • from_tokenizer (35-37)
lib/llm/src/entrypoint/input/endpoint.rs (1)
lib/llm/src/backend.rs (1)
  • from_mdc (79-90)
lib/llm/src/entrypoint/input/grpc.rs (3)
lib/llm/src/model_card.rs (1)
  • tokenizer_hf (224-234)
lib/llm/src/local_model.rs (1)
  • card (331-333)
lib/llm/src/entrypoint/input/common.rs (5)
  • build_routed_pipeline (226-257)
  • build_pipeline (175-178)
  • build_pipeline (194-224)
  • build_pipeline (334-337)
  • build_pipeline (353-353)
lib/llm/src/model_card.rs (1)
lib/bindings/python/rust/llm/model_card.rs (1)
  • load (19-23)
lib/llm/tests/preprocessor.rs (2)
lib/llm/src/model_card.rs (1)
  • load (375-389)
lib/llm/src/preprocessor/prompt/template.rs (1)
  • from_mdc (20-59)
lib/llm/src/entrypoint/input/http.rs (3)
lib/llm/src/model_card.rs (1)
  • tokenizer_hf (224-234)
lib/llm/src/local_model.rs (1)
  • card (331-333)
lib/llm/src/entrypoint/input/common.rs (5)
  • build_routed_pipeline (226-257)
  • build_pipeline (175-178)
  • build_pipeline (194-224)
  • build_pipeline (334-337)
  • build_pipeline (353-353)
lib/llm/src/local_model.rs (2)
lib/bindings/python/rust/llm/model_card.rs (1)
  • load (19-23)
lib/llm/src/model_card.rs (1)
  • load (375-389)
lib/llm/tests/backend.rs (2)
lib/llm/src/model_card.rs (1)
  • load (375-389)
lib/llm/src/backend.rs (1)
  • from_mdc (79-90)
lib/llm/src/entrypoint/input/common.rs (4)
lib/llm/src/backend.rs (2)
  • from_mdc (79-90)
  • from_tokenizer (69-77)
lib/llm/src/preprocessor/prompt/template.rs (1)
  • from_mdc (20-59)
lib/llm/src/preprocessor.rs (1)
  • new_with_parts (109-129)
lib/llm/src/model_card.rs (1)
  • load (375-389)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Build and Test - dynamo
  • GitHub Check: pre-merge-rust (.)
  • GitHub Check: pre-merge-rust (lib/runtime/examples)
  • GitHub Check: pre-merge-rust (lib/bindings/python)
🔇 Additional comments (31)
lib/llm/src/model_card.rs (2)

371-389: Sync load path looks good

Synchronous dispatch between local dir and GGUF is clear and preserves prior semantics.


904-921: Targeted unit tests 👍

Good coverage that exercises HF config parsing on real samples.

lib/bindings/python/rust/llm/preprocessor.rs (1)

32-39: Nice: removed ad-hoc runtime, now using sync constructor

Constructor is simpler and avoids a Tokio runtime just for init.

lib/bindings/python/rust/llm/backend.rs (2)

23-27: Borrowing MDC for Backend::from_mdc is correct and matches the new sync API.

This aligns with the borrowed, synchronous constructor and avoids unnecessary moves. Looks good.


24-28: Python binding uses Rust Backend::from_mdc without reloading tokenizer JSON
Verified no direct Tokenizer::from_file, HfTokenizer::from_file, or tokenizer_hf() calls in lib/bindings/python/rust/llm—the Python layer delegates to the Rust Backend, which handles the tokenizer exactly once.

lib/llm/src/local_model.rs (1)

256-256: Sync ModelDeploymentCard::load call is appropriate here.

Startup path doing a single blocking load is fine and simplifies flow.

lib/llm/tests/backend.rs (1)

9-9: LGTM on switching to sync load.

Test reads cleaner with the synchronous ModelDeploymentCard::load.

lib/llm/src/entrypoint/input/grpc.rs (2)

93-114: Good: single tokenizer load shared by chat and completions.

Loads tokenizer once, clones the handle, and threads it through both pipelines. Matches PR goal.


134-147: Good: StaticCore path also reuses the same tokenizer handle.

Consistent with StaticRemote; avoids double JSON load.

lib/llm/src/preprocessor/prompt/template.rs (1)

55-57: GGUF path LGTM.

Borrowing gguf_path and using ChatTemplate::from_gguf keeps this sync API clean.

lib/llm/tests/model_card.rs (4)

11-12: Sync API migration LGTM.

Switch to synchronous load and get_model_info is correct and matches API changes.


23-23: Error-path test LGTM.

Direct Result use without await is correct.


29-34: Tokenizer-kind assertion LGTM.

Covers both JSON and GGUF.


39-45: Prompt-formatter assertion LGTM.

Matches expected artifact type.

lib/llm/tests/preprocessor.rs (7)

60-64: Sync MDC load LGTM.

Constructing MDC synchronously and setting name/context is correct.


286-287: Borrowed MDC usage LGTM.

Passing &ModelDeploymentCard to from_mdc avoids cloning and IO.


318-319: Repeat pattern LGTM.

Same efficient borrowing for the tools test.


355-356: Repeat pattern LGTM.

Consistent with sync APIs.


387-388: Repeat pattern LGTM.

Consistent usage for system+multi-turn case.


425-426: Repeat pattern LGTM.

Consistent usage for system+tools case.


468-469: Explicit borrow LGTM.

Using &mdc here is correct and consistent.

lib/llm/src/discovery/watcher.rs (6)

179-180: Deletion log LGTM.

Structured field is helpful for ops.


185-186: Error log LGTM.

Using error = %e yields better formatting.


311-313: Great: load tokenizer once and reuse.

Meets the PR objective to avoid duplicate 10 MiB JSON loads.


325-331: Pass-through of tokenizer to chat builder LGTM.

Prevents a second load on the chat path.


337-341: Preprocessor constructed with shared tokenizer LGTM.

Avoids redundant tokenizer re-loading in the completions path.


352-353: Forwarding tokenizer to completions pipeline LGTM.

Ensures a single tokenizer instance for both chat and completions.

lib/llm/src/entrypoint/input/common.rs (2)

134-146: Nice: single tokenizer load for chat pipeline (meets PR objective).

You obtain hf_tokenizer once and share it downstream. This avoids the second 10 MiB JSON read and aligns chat/completions readiness.


194-199: Verify downstream callers updated: no internal call sites for build_pipeline or build_routed_pipeline were found—ensure all entrypoints, tests, and plugins pass the new hf_tokenizer parameter.

lib/llm/src/entrypoint/input/http.rs (2)

122-134: LGTM: tokenizer loaded once and shared across chat and completions.

Static Remote and Static Core paths now pass a single tokenizer_hf handle (cloned) into both pipelines. This removes the redundant disk IO and aligns readiness timing.

Also applies to: 137-143, 172-185


123-134: Minor: propagate formatter mismatch as early error if needed.

If PromptFormatter::from_mdc can return non-OAI variants, ensure downstream builders handle that gracefully (they currently will via anyhow at construction sites in common.rs after you apply the let-else fix).

Signed-off-by: Graham King <[email protected]>
@grahamking grahamking merged commit cb5a657 into main Sep 5, 2025
11 checks passed
@grahamking grahamking deleted the gk-single-tok-load branch September 5, 2025 20:48
GavinZhu-GMI pushed a commit to GavinZhu-GMI/dynamo that referenced this pull request Sep 8, 2025
nnshah1 pushed a commit that referenced this pull request Sep 8, 2025
nnshah1 pushed a commit that referenced this pull request Sep 8, 2025
indrajit96 pushed a commit that referenced this pull request Sep 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants