Skip to content

Conversation

@GuanLuo
Copy link
Contributor

@GuanLuo GuanLuo commented Aug 27, 2025

Overview:

This PR introduces a new request / response type tensor, which is commonly used in non-LLM inference, for the backend. With #2638 Dynamo now have a frontend that handles tensor request / response (KServe gRPC protocol) but converts to OpenAI completion message, we could enable more traditional use case with this PR.

NOTE:

Details:

  • Introduce NvCreateTensorRequest / NvCreateTensorResponse message types that a backend can consume and produce
  • Introduce ModelType::Tensor, ModelInputType::Tensor which can be passed during register_llm(). An engine that interacts with tensor messages will be created and registered if both are set. The user may register the same model for both tensor and text input under different endpoints, but we need to validate that and for now we restrict the tensor types can not be set with existing types for LLM inference.
  • Extend KServe gRPC frontend to be able to convert KServe gRPC message from / to tensor messages. Since the frontend can also convert OpenAI completion messages. The frontend will check the model type registered and perform conversion in the order of 1. tensor; 2. completion.
  • As tensor type is more relaxed, the user will need to query model meta data (i.e. I/O names, shapes and data types) so they can prepare the input properly. For this PR, we will exploit the runtime_config field in ModelDeploymentCard, which the backend can pass during the register_llm() call. When invoking ModelMetadata / ModelConfig endpoints, the runtime_config will be fetched and convert to corresponding messages. If a model type is tensor, we expects the runtime_config to be the Pythonized representation of a TensorModelConfig as defined in lib/llm/src/protocols/tensor.rs. For future PR, we should properly define a field in ModelDeploymentCard similar to how tokenizer config is being fetched and stored.

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

Summary by CodeRabbit

  • New Features

    • End-to-end Tensor model support: registration, routing, management, and streaming inference with request‑ID annotation, telemetry, and disconnect handling.
    • API surface: Tensor recognized across gRPC/KServe and HTTP metrics; model listing, metadata/config, and streaming paths added.
    • Python API: ModelInput/ModelType expose Tensor/TensorBased; runtime tensor_model_config can be set/retrieved from Python.
  • Documentation

    • Docstrings updated to list Tensor where applicable.
  • Tests

    • New tests covering Tensor registration, config propagation, inference, streaming, and error cases.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Aug 27, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@grahamking
Copy link
Contributor

@ryanolson Could you take a look?

@GuanLuo GuanLuo force-pushed the gluo/dis-540-full-tensor-support branch from dac73a6 to 226031b Compare September 3, 2025 00:07
@GuanLuo GuanLuo force-pushed the gluo/dis-540-full-tensor-support branch from 226031b to d63d1d5 Compare September 10, 2025 02:07
@GuanLuo GuanLuo marked this pull request as ready for review September 10, 2025 18:13
@GuanLuo GuanLuo requested review from a team as code owners September 10, 2025 18:13
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 10, 2025

Walkthrough

Adds first-class Tensor support across the codebase: new tensor protocol types and engines, ModelInput/ModelType Tensor variants, runtime config field and Python bindings, gRPC/HTTP routing and conversions for tensor flows, discovery/watcher and model manager tensor plumbing, and tests for tensor inference and streaming.

Changes

Cohort / File(s) Summary
Python bindings: enums & registration
lib/bindings/python/rust/lib.rs, lib/bindings/python/src/dynamo/_core.pyi
Expose ModelInput::Tensor and ModelType::TensorBased to Python; update register_llm to map ModelInput::Tensor; update docstrings to include Tensor.
Python runtime config methods
lib/bindings/python/rust/llm/local_model.rs
Add ModelRuntimeConfig.set_tensor_model_config and get_tensor_model_config for PyDict ↔ internal tensor_model_config conversions.
Python tensor tests
lib/bindings/python/tests/test_tensor.py
New test registering a Tensor model, round-tripping ModelRuntimeConfig.tensor_model_config, calling register_llm with Tensor types, and optional end-to-end serving callback.
Model types & runtime config
lib/llm/src/model_type.rs, lib/llm/src/local_model/runtime_config.rs
Add ModelType::TensorBased flag and supports_tensor(); add ModelInput::Tensor; add tensor_model_config: Option<tensor::TensorModelConfig> to ModelRuntimeConfig.
Protocols & types
lib/llm/src/protocols.rs, lib/llm/src/protocols/tensor.rs, lib/llm/src/types.rs
Add protocols::tensor module: DataType, FlattenTensor, TensorMetadata, TensorModelConfig, Tensor container, NvCreateTensorRequest/Response, annotations/NvExt providers, DeltaAggregator and stream folding; re-export NV types and add generic tensor engine aliases.
Discovery & watcher
lib/llm/src/discovery/model_manager.rs, lib/llm/src/discovery/watcher.rs
Add tensor_engines to ModelManager with list/add/remove/get APIs; watcher registers/removes tensor models, emits ModelUpdate for TensorBased, and handles Tensor+TensorBased registration path.
gRPC/KServe services
lib/llm/src/grpc/service.rs, lib/llm/src/grpc/service/tensor.rs, lib/llm/src/grpc/service/kserve.rs, lib/llm/src/grpc/service/openai.rs
Add tensor gRPC module; implement tensor_response_stream, disconnect monitoring, TryFrom conversions between inference and tensor types, DataType parsing/Display/to_kserve, KServe routing for tensor vs completions, and OpenAI↔inference conversion helpers.
HTTP metrics & endpoints
lib/llm/src/http/service/metrics.rs
Add Endpoint::Tensor variant and map Display / as_str() to "tensor".
Tests: service-level
lib/llm/tests/kserve_service.rs, lib/llm/tests/http-service.rs
Add TensorEngine test harness and KServe tensor tests (metadata/config, infer, streaming); add Endpoint::Tensor arm (todo) in http-service tests.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Client
  participant KServe
  participant State
  participant Manager
  participant Engine
  Note over Client,KServe: Tensor inference (single or streaming)
  Client->>KServe: ModelInferRequest
  KServe->>State: resolve model type (is_tensor_model)
  State->>Manager: get_tensor_engine(model)
  Manager-->>State: TensorStreamingEngine
  KServe->>Engine: NvCreateTensorRequest (with annotations/request_id)
  activate Engine
  loop streaming responses
    Engine-->>KServe: Annotated<NvCreateTensorResponse>
    KServe->>KServe: process annotations / metrics
    KServe-->>Client: Model(Infer|Stream)Response
  end
  deactivate Engine
  Note over KServe: grpc_monitor_for_disconnects watches context and ends stream on disconnect
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~70 minutes

Possibly related PRs

Poem

I twitch my whiskers at a tensor stream,
Carrots encoded in a binary dream.
From Python bolts to Rusty root I hop,
I echo shapes until the streams all stop.
— a rabbit, nibbling on bytes 🥕

Pre-merge checks

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 59.42% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The title "feat: tensor type for generic inference." is concise, uses conventional commit style, and accurately summarizes the primary change—adding tensor request/response types and model support for generic (non‑LLM) inference—so it clearly reflects the main scope of the PR.
Description Check ✅ Passed The PR description provides a clear Overview and detailed Details explaining the new tensor message types, registration behavior, frontend conversion logic, and runtime_config usage, but the "Where should the reviewer start?" section is empty and the Related Issues field still contains a placeholder, so reviewer guidance and issue linkage are incomplete.

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  • Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
  • Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
lib/llm/src/discovery/watcher.rs (1)

150-159: Fix duplicate-add path: has_model_any ignores Embedding/Tensor.

This block relies on ModelManager::has_model_any to short-circuit for existing models, but that method currently only checks chat/completions. New embeddings/tensor endpoints for an existing model name will fall through and attempt to re-add, yielding ModelAlreadyExists errors and noisy logs.

Apply this diff to include embeddings and tensor in has_model_any (in model_manager.rs):

--- a/lib/llm/src/discovery/model_manager.rs
+++ b/lib/llm/src/discovery/model_manager.rs
@@ -69,9 +69,12 @@ impl ModelManager {
     pub fn has_model_any(&self, model: &str) -> bool {
-        self.chat_completion_engines.read().unwrap().contains(model)
-            || self.completion_engines.read().unwrap().contains(model)
+        self.chat_completion_engines.read().unwrap().contains(model)
+            || self.completion_engines.read().unwrap().contains(model)
+            || self.embeddings_engines.read().unwrap().contains(model)
+            || self.tensor_engines.read().unwrap().contains(model)
     }
lib/llm/src/discovery/model_manager.rs (1)

69-72: Update has_model_any to include Embedding/Tensor.

Prevents duplicate add attempts in watcher for those types (see watcher.rs Lines 150-159).
Apply:

-    pub fn has_model_any(&self, model: &str) -> bool {
-        self.chat_completion_engines.read().unwrap().contains(model)
-            || self.completion_engines.read().unwrap().contains(model)
-    }
+    pub fn has_model_any(&self, model: &str) -> bool {
+        self.chat_completion_engines.read().unwrap().contains(model)
+            || self.completion_engines.read().unwrap().contains(model)
+            || self.embeddings_engines.read().unwrap().contains(model)
+            || self.tensor_engines.read().unwrap().contains(model)
+    }
🧹 Nitpick comments (19)
lib/llm/tests/http-service.rs (1)

209-216: Avoid leaving todo!() in test match arms

Using todo!() will panic if compute_index is called with new endpoints during future test expansion. Either fully support the new endpoints here or make the intent explicit with unreachable!/clear error to reduce surprise.

Apply one of:

-        Endpoint::Embeddings => todo!(),
-        Endpoint::Responses => todo!(),
-        Endpoint::Tensor => todo!(),
+        Endpoint::Embeddings | Endpoint::Responses | Endpoint::Tensor => {
+            unreachable!("Only Completions/ChatCompletions are indexed in this test")
+        }

Or generalize the test to compute indices from a fixed ordered slice including the new endpoints and expand the expected array accordingly.

lib/llm/src/model_type.rs (1)

101-104: Comment: clarify endpoint mapping for Tensor

Note reads that Tensor doesn’t map to HTTP EndpointType. Consider adding a brief rationale (e.g., “served via gRPC/KServe only”) or a TODO linking to where Tensor endpoints would be exposed if/when HTTP support lands.

lib/llm/src/local_model/runtime_config.rs (1)

26-35: Fix typos and tighten comment wording around tensor_model_config

Minor nits in comments (“convinence/convinent”, “workout”) and clarity around future direction.

Proposed comment edits:

-    // Provide tensor model config in the case where the model type is Tensor.
-    // Currently use JSON object for convinence, the programmatic way is to
-    // define the model config struct as part of the tensor protocol and
-    // import it here.
-    // [gluo TODO] switch to ModelConfig if desired and workout a way to
-    // prepare it in a convinent way, the protobuf library used by tonic
-    // doesn't provide JSON parsing.
+    // Tensor model configuration used when the model type is Tensor.
+    // We currently use a Rust struct (see tensor::TensorModelConfig) rather than raw JSON for clarity and type-safety.
+    // [gluo TODO] If we switch to a dedicated ModelConfig, figure out a convenient way to construct it;
+    // the prost/tonic protobuf stack doesn't provide JSON parsing.
lib/bindings/python/rust/lib.rs (1)

162-168: Pre-validate Tensor combinations for clearer Python errors

Improve UX by failing fast if ModelType/ModelInput combinations are inconsistent (mixing Tensor with non-Tensor) rather than surfacing a lower-level attach error.

 fn register_llm<'p>(
@@
-    let model_input = match model_input {
+    let model_input = match model_input {
         ModelInput::Text => llm_rs::model_type::ModelInput::Text,
         ModelInput::Tokens => llm_rs::model_type::ModelInput::Tokens,
         ModelInput::Tensor => llm_rs::model_type::ModelInput::Tensor,
     };
 
     let model_type_obj = model_type.inner;
+    // Validate Tensor combos early for a clearer Python-side error
+    if model_type_obj.supports_tensor()
+        && !matches!(model_input, llm_rs::model_type::ModelInput::Tensor)
+    {
+        return Err(PyErr::new::<PyException, _>(
+            "Tensor models must use ModelInput.Tensor",
+        ));
+    }
+    if !model_type_obj.supports_tensor()
+        && matches!(model_input, llm_rs::model_type::ModelInput::Tensor)
+    {
+        return Err(PyErr::new::<PyException, _>(
+            "ModelInput.Tensor requires ModelType.Tensor",
+        ));
+    }
lib/bindings/python/rust/llm/local_model.rs (1)

55-62: Setter doesn’t need a Python token

_py isn’t used. Simplify the API by removing it.

Alternative:

-fn set_tensor_model_config(
-    &mut self,
-    _py: Python<'_>,
-    tensor_model_config: &Bound<'_, PyDict>,
-) -> PyResult<()> {
+fn set_tensor_model_config(
+    &mut self,
+    tensor_model_config: &Bound<'_, PyDict>,
+) -> PyResult<()> {
lib/bindings/python/tests/test_tensor.py (4)

13-13: Make TEST_END_TO_END parsing robust

Current code treats "0" as truthy. Parse env var to a boolean.

-TEST_END_TO_END = os.environ.get("TEST_END_TO_END", 0)
+TEST_END_TO_END = os.getenv("TEST_END_TO_END", "0").lower() in ("1", "true", "yes")

49-51: Silence unused context arg in test generator

Rename to _context to satisfy linters.

-async def generate(request, context):
+async def generate(request, _context):

4-4: Correct or remove the usage comment

This test file references dynamo.mocker; update to a pytest invocation or remove to avoid confusion.

-# Usage: `python -m dynamo.mocker --model-path /data/models/Qwen3-0.6B-Q8_0.gguf --extra-engine-args args.json`
+# Usage: `pytest -k test_tensor -s` (or run as a module locally for debug)

33-43: Avoid relying on a real HF model in unit tests

Registering with "Qwen/Qwen3-0.6B" can make tests flaky/offline-hostile. Prefer a local stub/noop engine or a feature flag/mocked loader for tensor tests.

If helpful, I can sketch a minimal stub engine and a feature gate for tests.

lib/llm/src/discovery/watcher.rs (1)

217-236: Removal bookkeeping extended for tensor — looks consistent.

The tensor remove path mirrors the existing logic and extends the debug line. One minor ask: the “removed” flags are only raised when the last model of that type is removed. If the intent is per-model removal signaling (not just “no more of this type in the system”), consider emitting ModelUpdate::Removed regardless, or change the flag names to reflect the “last-of-type removed” semantics. Not blocking.

Would you like a quick patch to emit per-model-type removal events irrespective of “last-of-type”?

Also applies to: 238-262

lib/llm/src/grpc/service/tensor.rs (2)

35-45: Docstring says we fold non-streaming, but code always returns a stream.

Either (a) update the comment to reflect that callers do the folding, or (b) add a unary helper that collects the stream when streaming == false. Example helper (outside this function):

pub async fn tensor_unary_response(
    state: Arc<kserve::State>,
    request: NvCreateTensorRequest,
) -> Result<NvCreateTensorResponse, Status> {
    let mut s = tensor_response_stream(state, request, false).await?;
    futures::pin_mut!(s);
    let mut final_resp = None;
    while let Some(item) = s.next().await {
        final_resp = Some(item.into_inner()); // adjust if Annotated exposes inner
    }
    final_resp.ok_or_else(|| Status::internal("empty response"))
}

55-59: Include model name in not_found error.

Improves observability.

-        .map_err(|_| Status::not_found("model not found"))?;
+        .map_err(|_| Status::not_found(format!("model not found: {}", model)))?;
lib/llm/tests/kserve_service.rs (1)

190-234: TensorEngine repeat logic: sanitize count and avoid i32 ranges

repeat_count is i32 and used directly in 0..repeat_count. Negative values yield an empty range; large values could DOS the test. Convert to usize with saturation and cap to a small max (e.g., 10) to keep tests deterministic.

-            let stream = async_stream::stream! {
-                for _ in 0..repeat_count {
+            let stream = async_stream::stream! {
+                let repeat = repeat_count.max(0) as usize;
+                let repeat = repeat.min(10);
+                for _ in 0..repeat {
                     yield Annotated::from_data(NvCreateTensorResponse {
                         id: request.id.clone(),
                         model: request.model.clone(),
                         tensors: request.tensors.clone(),
                     });
                 }
             };
lib/llm/src/grpc/service/kserve.rs (5)

198-216: Misleading comment and error text on tensor path

Comment says “Fallback handling by assuming the model is OpenAI Completions model” and error string says “completions” on the tensor branch. Update both to “tensor” for clarity.

-        // [gluo TODO] refactor to reuse code, inference logic is largely the same
-        let mut reply = if self.state().is_tensor_model(&model) {
-            // Fallback handling by assuming the model is OpenAI Completions model
+        // [gluo TODO] refactor to reuse code, inference logic is largely the same
+        let mut reply = if self.state().is_tensor_model(&model) {
+            // Tensor path
             let tensor_request: NvCreateTensorRequest = request
                 .try_into()
                 .map_err(|e| Status::invalid_argument(format!("Failed to parse request: {}", e)))?;
@@
-            let tensor_response = NvCreateTensorResponse::from_annotated_stream(stream)
+            let tensor_response = NvCreateTensorResponse::from_annotated_stream(stream)
                 .await
                 .map_err(|e| {
-                    tracing::error!("Failed to fold completions stream: {:?}", e);
-                    Status::internal(format!("Failed to fold completions stream: {}", e))
+                    tracing::error!("Failed to fold tensor stream: {:?}", e);
+                    Status::internal(format!("Failed to fold tensor stream: {}", e))
                 })?;

405-481: Metadata path: good bifurcation by model type

Tensor vs. completions mapping is clear and matches tests. Consider extracting the tensor-metadata mapping into a helper to avoid duplication with model_config().


487-576: Config path: mirrors metadata; consider shared builder

Same mapping duplication as metadata; a shared helper would reduce drift.


90-93: Minor: accept &str in is_tensor_model to avoid cloning

Taking &str is more ergonomic and avoids requiring a String at call sites.

-    fn is_tensor_model(&self, model: &String) -> bool {
-        self.manager.list_tensor_models().contains(model)
+    fn is_tensor_model(&self, model: &str) -> bool {
+        self.manager.list_tensor_models().contains(&model.to_string())
     }

1006-1031: Improve error context in DataType::from_str

Include the offending string in the error to aid debugging.

-            _ => Err(anyhow::anyhow!("Invalid data type")),
+            _ => Err(anyhow::anyhow!(format!("Invalid data type: {}", s))),
lib/llm/src/protocols/tensor.rs (1)

22-31: Make DataType Copy

DataType is small and immutable; deriving Copy avoids unnecessary moves/clones.

-#[derive(Debug, Serialize, Clone, Eq, PartialEq, Deserialize)]
+#[derive(Debug, Serialize, Clone, Copy, Eq, PartialEq, Deserialize)]
 pub enum DataType {
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ae38bd4 and d63d1d5.

📒 Files selected for processing (17)
  • lib/bindings/python/rust/lib.rs (3 hunks)
  • lib/bindings/python/rust/llm/local_model.rs (1 hunks)
  • lib/bindings/python/src/dynamo/_core.pyi (1 hunks)
  • lib/bindings/python/tests/test_tensor.py (1 hunks)
  • lib/llm/src/discovery/model_manager.rs (8 hunks)
  • lib/llm/src/discovery/watcher.rs (5 hunks)
  • lib/llm/src/grpc/service.rs (1 hunks)
  • lib/llm/src/grpc/service/kserve.rs (11 hunks)
  • lib/llm/src/grpc/service/tensor.rs (1 hunks)
  • lib/llm/src/http/service/metrics.rs (3 hunks)
  • lib/llm/src/local_model/runtime_config.rs (2 hunks)
  • lib/llm/src/model_type.rs (5 hunks)
  • lib/llm/src/protocols.rs (1 hunks)
  • lib/llm/src/protocols/tensor.rs (1 hunks)
  • lib/llm/src/types.rs (1 hunks)
  • lib/llm/tests/http-service.rs (1 hunks)
  • lib/llm/tests/kserve_service.rs (6 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-09-02T16:46:54.015Z
Learnt from: GuanLuo
PR: ai-dynamo/dynamo#2714
File: lib/llm/src/discovery/model_entry.rs:38-42
Timestamp: 2025-09-02T16:46:54.015Z
Learning: In lib/llm/src/discovery/model_entry.rs, GuanLuo prefers not to add serde defaults for model_type and model_input fields to keep the specification explicit and avoid user errors, relying on atomic deployment strategy to avoid backward compatibility issues.

Applied to files:

  • lib/llm/src/local_model/runtime_config.rs
📚 Learning: 2025-08-21T17:23:02.836Z
Learnt from: michaelfeil
PR: ai-dynamo/dynamo#2591
File: lib/bindings/python/rust/http.rs:0-0
Timestamp: 2025-08-21T17:23:02.836Z
Learning: In lib/bindings/python/rust/http.rs, the enable_endpoint method uses EndpointType::all() to dynamically support all available endpoint types with case-insensitive matching, which is more maintainable than hardcoded match statements for endpoint type mappings.

Applied to files:

  • lib/llm/tests/http-service.rs
  • lib/llm/src/model_type.rs
🧬 Code graph analysis (12)
lib/llm/src/grpc/service.rs (1)
lib/llm/src/grpc/service/kserve.rs (1)
  • tensor (847-852)
lib/llm/src/protocols.rs (1)
lib/llm/src/grpc/service/kserve.rs (1)
  • tensor (847-852)
lib/llm/src/grpc/service/tensor.rs (4)
lib/llm/src/grpc/service/kserve.rs (3)
  • tensor (847-852)
  • state (130-132)
  • new (57-63)
lib/llm/src/http/service/disconnect.rs (1)
  • create_connection_monitor (100-121)
lib/llm/src/protocols/tensor.rs (1)
  • annotations (113-117)
lib/llm/src/http/service/metrics.rs (4)
  • new (107-210)
  • new (312-333)
  • new (406-415)
  • mark_ok (335-337)
lib/bindings/python/rust/llm/local_model.rs (1)
lib/bindings/python/rust/engine.rs (1)
  • depythonize (318-318)
lib/llm/src/types.rs (1)
lib/llm/src/grpc/service/kserve.rs (1)
  • tensor (847-852)
lib/llm/src/discovery/watcher.rs (2)
lib/bindings/python/src/dynamo/_core.pyi (4)
  • ModelType (850-852)
  • ModelInput (846-848)
  • client (232-236)
  • new (108-126)
lib/runtime/src/pipeline/network/egress/push_router.rs (1)
  • from_client_with_threshold (104-135)
lib/bindings/python/rust/lib.rs (2)
lib/bindings/python/src/dynamo/_core.pyi (2)
  • ModelInput (846-848)
  • ModelType (850-852)
lib/llm/src/model_card.rs (2)
  • model_type (553-553)
  • model_type (752-754)
lib/llm/src/grpc/service/kserve.rs (3)
lib/llm/src/grpc/service/tensor.rs (1)
  • tensor_response_stream (37-112)
lib/llm/src/protocols/tensor.rs (1)
  • from_annotated_stream (141-180)
lib/llm/src/grpc/service/openai.rs (2)
  • get_parsing_options (193-198)
  • completion_response_stream (44-126)
lib/bindings/python/tests/test_tensor.py (3)
lib/bindings/python/src/dynamo/_core.pyi (4)
  • ModelInput (846-848)
  • ModelRuntimeConfig (465-469)
  • ModelType (850-852)
  • DistributedRuntime (31-54)
lib/bindings/python/rust/lib.rs (6)
  • register_llm (147-217)
  • get (445-461)
  • namespace (368-373)
  • create_service (563-569)
  • endpoint (555-561)
  • serve_endpoint (575-601)
lib/bindings/python/rust/llm/local_model.rs (2)
  • set_tensor_model_config (55-62)
  • get_tensor_model_config (64-71)
lib/llm/tests/kserve_service.rs (3)
lib/bindings/python/src/dynamo/_core.pyi (3)
  • ModelRuntimeConfig (465-469)
  • ModelInput (846-848)
  • ModelType (850-852)
lib/llm/src/grpc/service/kserve.rs (3)
  • tensor (847-852)
  • new (57-63)
  • manager (78-80)
lib/llm/src/local_model/runtime_config.rs (1)
  • new (38-40)
lib/llm/src/model_type.rs (3)
lib/bindings/python/src/dynamo/_core.pyi (2)
  • ModelType (850-852)
  • ModelInput (846-848)
lib/llm/src/http/service/metrics.rs (3)
  • as_str (376-384)
  • as_str (388-393)
  • as_str (397-402)
lib/llm/src/endpoint_type.rs (1)
  • as_str (32-39)
lib/llm/src/protocols/tensor.rs (1)
lib/llm/src/grpc/service/tensor.rs (1)
  • annotations (81-94)
🪛 GitHub Actions: Rust pre-merge checks
lib/bindings/python/rust/llm/local_model.rs

[error] 52-52: cargo fmt check failed. Detected formatting differences in function set_tensor_model_config signature. Run 'cargo fmt' to fix formatting.

🪛 Ruff (0.12.2)
lib/bindings/python/tests/test_tensor.py

49-49: Unused function argument: context

(ARG001)

🪛 GitHub Actions: NVIDIA Dynamo Github Validation
lib/llm/src/discovery/model_manager.rs

[error] 96-96: Rust compile error: unwrap() called on RwLockReadGuard; unwrap() is not a method on the read guard. The code uses 'self.tensor_engines.read().unwrap().list()', which fails to compile. The compiler suggests using 'wrap()' instead.


[error] 131-131: Rust compile error: unwrap() called on RwLockWriteGuard; unwrap() is not available on the write guard. The compiler suggests using 'wrap' instead.


[error] 151-151: Rust compile error: unwrap() called on RwLockWriteGuard; unwrap() is not available on the write guard. The compiler suggests using 'wrap' instead.


[error] 194-194: Rust compile error: unwrap() called on RwLockReadGuard; unwrap() is not available on the read guard. The compiler suggests using 'wrap' instead.

🔇 Additional comments (38)
lib/llm/src/http/service/metrics.rs (3)

56-58: Add Tensor endpoint — looks correct

New Endpoint::Tensor variant is consistent with the rest of the enum and doc comment matches usage.


370-371: Display formatting for Tensor is consistent

"tensor" string matches naming conventions for other endpoints.


382-383: as_str mapping for Tensor is correct

Label value "tensor" aligns with metrics label expectations.

lib/llm/src/model_type.rs (5)

49-50: Introduce Tensor bitflag — OK

Adding Tensor = 1<<3 preserves existing flags and leaves room for future growth.


67-70: supports_tensor helper — OK

Parity with other supports_* helpers; improves readability at call sites.


82-85: Include "tensor" in as_vec output

Order remains stable (chat, completions, embedding, tensor). Good.


120-122: Add ModelInput::Tensor — OK

Enum addition is consistent with serde/Display usage.


129-130: ModelInput::as_str for Tensor — OK

Returns "tensor" as expected.

lib/llm/src/grpc/service.rs (1)

6-6: Expose tensor gRPC module — OK

Public mod export is minimal and aligns with the new service surface.

lib/llm/src/protocols.rs (1)

28-28: Expose tensor protocol module — OK

Makes tensor protocol types available to dependents.

lib/bindings/python/src/dynamo/_core.pyi (2)

847-847: Docstring includes Tensor in ModelInput — OK


851-851: Docstring includes Tensor in ModelType — OK

lib/llm/src/local_model/runtime_config.rs (1)

8-9: Import tensor protocol — OK

lib/bindings/python/rust/lib.rs (3)

162-166: Map Python ModelInput::Tensor to Rust — OK


299-303: Expose ModelType.Tensor to Python — OK

Bitflag classattr mirrors Rust side and composes via or as expected.


320-321: Add Python enum value for ModelInput::Tensor — OK

lib/llm/src/types.rs (2)

76-92: Tensor type aliases look consistent and additive

Re-exports and engine aliases align with existing OpenAI sections. Naming and Annotated usage are consistent.


88-91: Alias is correct; no separate stream response type exists
NvCreateTensorResponse is the only response type defined in lib/llm/src/protocols/tensor.rs and supports streaming via its from_annotated_stream implementation.

lib/bindings/python/rust/llm/local_model.rs (1)

64-71: LGTM on get_tensor_model_config

Conversion and Option handling look correct; error mapping is consistent with the rest of the bindings.

lib/llm/src/discovery/watcher.rs (4)

35-36: Tensor protocol import is correct and scoped.

Importing NvCreateTensorRequest/Response here keeps watcher concerns local. No issues.


58-63: Include Tensor in ALL_MODEL_TYPES.

Good addition; ensures ModelUpdate emits for tensor as well.


432-445: Tensor + Tensor routing path looks correct.

The PushRouter generic parameters align with NvCreateTensorRequest and Annotated. Registration to manager via add_tensor_model is as expected.


448-452: Unsupported configuration message updated — good.

Including “Tensor+Tensor” improves operator feedback.

lib/llm/src/discovery/model_manager.rs (5)

18-19: Correct engine type import.

types::generic::tensor::TensorStreamingEngine import aligns with the watcher registration.


40-41: Add tensor_engines field — consistent with other engine maps.

Initialization and locking semantics mirror existing maps.


59-60: Initialize tensor_engines in new() — OK.


74-80: Include tensor models in model_display_names — OK.


95-97: Align RwLock import and unwrap usage
Remove all .unwrap() on .read()/.write() if you’ve imported parking_lot::RwLock (its read()/write() return guards directly), or switch the file to std::sync::RwLock and retain the unwraps. Ensure the use at the top of lib/llm/src/discovery/model_manager.rs matches your chosen lock API.

lib/llm/src/grpc/service/tensor.rs (3)

80-95: Annotation injection — LGTM.

request_id annotation is correctly prefixed to the stream when requested.


121-152: grpc_monitor_for_disconnects — solid.

Arming/disarming semantics and inflight guard handling are correct.


165-177: UUID handling — LGTM.

Gracefully accepts client-provided UUID or generates a new v4.

lib/llm/tests/kserve_service.rs (3)

287-299: Duplicate model entry for "split" likely fine, but consider consolidating

You register runtime metadata for split via save_model_entry after add_completions_model. Looks OK for tests, but you may not need both calls if the manager prevents duplicates; consider documenting intended override behavior.


1147-1400: Great, thorough tensor-path coverage

End-to-end tests for metadata/config, unary infer, error on multi-response in unary, and stream infer are solid and mirror the intended routing behaviors.


1402-1455: Helper asserts mirror echo semantics and ID plumbing

validate_tensor_response checks name, version, id and I/O parity; aligns with the conversion path. LGTM.

lib/llm/src/grpc/service/kserve.rs (2)

300-373: Streaming path correctly preserves request_id and templating; mirrors unary path

The demux note is fair; current behavior is acceptable for now.


994-1004: KServe dtype mapping looks correct

BOOL/UINT32/INT32/FP32/BYTES → TypeBool/TypeUint32/TypeInt32/TypeFp32/TypeString matches expectations.

lib/llm/src/protocols/tensor.rs (2)

67-81: Good: NvExtProvider/AnnotationsProvider implemented consistently

Annotations pass-through via NvExt is clean and keeps parity with other protocols.


140-181: Aggregator behavior and error messages are appropriate

Single-response folding with explicit “Multiple responses in non-streaming mode” aligns with tests and expected semantics.

@GuanLuo GuanLuo force-pushed the gluo/dis-540-full-tensor-support branch from d63d1d5 to 53689af Compare September 16, 2025 00:05
@GuanLuo GuanLuo force-pushed the gluo/dis-540-full-tensor-support branch from 53689af to 164044f Compare September 16, 2025 00:08
Signed-off-by: Guan Luo <[email protected]>
@grahamking
Copy link
Contributor

/ok to test a6054a8

@grahamking
Copy link
Contributor

@GuanLuo Could you provide an example of how we use this new ModelInput type?

  • Can I use any model? Or something generic-tensor specific (if that makes sense)?
  • Can I do a normal chat completions request with curl?

Copy link
Contributor

@rmccorm4 rmccorm4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - most comments are just questions for better understanding.

  1. We need docs for both kserve frontend and tensor-based model related features, describing how to opt in/out, and what their current limitations or assumptions are
  2. I notice a lot of todos in the code. If you write up some detailed github issues describing some of the TODOs - it is very likely (from past history so far) that an OSS contributor will try to fix/implement them. Even if they don't get picked up, these will auto sync to Linear and be helpful for others to pickup in future too.

@oandreeva-nv
Copy link
Contributor

ok to merge conditional on ci failing not due to this pr + it'd be nice to have some logic extracted into smaller methods

@GuanLuo
Copy link
Contributor Author

GuanLuo commented Sep 23, 2025

/ok to test 41fdf07

Co-authored-by: Ryan McCormick <[email protected]>
Signed-off-by: GuanLuo <[email protected]>
@GuanLuo GuanLuo enabled auto-merge (squash) September 23, 2025 22:08
@GuanLuo
Copy link
Contributor Author

GuanLuo commented Sep 24, 2025

/ok to test 2fd767c

@GuanLuo GuanLuo merged commit 6ba64c3 into main Sep 24, 2025
17 of 18 checks passed
@GuanLuo GuanLuo deleted the gluo/dis-540-full-tensor-support branch September 24, 2025 07:31
athreesh pushed a commit that referenced this pull request Sep 24, 2025
Signed-off-by: Guan Luo <[email protected]>
Signed-off-by: GuanLuo <[email protected]>
Co-authored-by: Olga Andreeva <[email protected]>
Co-authored-by: Ryan McCormick <[email protected]>
jasonqinzhou pushed a commit that referenced this pull request Sep 24, 2025
Signed-off-by: Guan Luo <[email protected]>
Signed-off-by: GuanLuo <[email protected]>
Co-authored-by: Olga Andreeva <[email protected]>
Co-authored-by: Ryan McCormick <[email protected]>
Signed-off-by: Jason Zhou <[email protected]>
jasonqinzhou pushed a commit that referenced this pull request Sep 24, 2025
Signed-off-by: Guan Luo <[email protected]>
Signed-off-by: GuanLuo <[email protected]>
Co-authored-by: Olga Andreeva <[email protected]>
Co-authored-by: Ryan McCormick <[email protected]>
Signed-off-by: Jason Zhou <[email protected]>
kylehh pushed a commit that referenced this pull request Sep 25, 2025
Signed-off-by: Guan Luo <[email protected]>
Signed-off-by: GuanLuo <[email protected]>
Co-authored-by: Olga Andreeva <[email protected]>
Co-authored-by: Ryan McCormick <[email protected]>
Signed-off-by: Kyle H <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants