feat: tensor type for generic inference. #2746

GuanLuo · 2025-08-27T10:03:15Z

Overview:

This PR introduces a new request / response type tensor, which is commonly used in non-LLM inference, for the backend. With #2638 Dynamo now have a frontend that handles tensor request / response (KServe gRPC protocol) but converts to OpenAI completion message, we could enable more traditional use case with this PR.

NOTE:

This PR is based on the new model type interface that refactor: Split ModelType to ModelInput for request and response type; ModelType for the supported workloads #2714 introduce
The files are placed in llm and there is use of utilities in llm (i.e. register_llm()) for convenience, tensor type can be used for generic inference and we should revisit the placement / usage later.

Details:

Introduce NvCreateTensorRequest / NvCreateTensorResponse message types that a backend can consume and produce
Introduce ModelType::Tensor, ModelInputType::Tensor which can be passed during register_llm(). An engine that interacts with tensor messages will be created and registered if both are set. The user may register the same model for both tensor and text input under different endpoints, but we need to validate that and for now we restrict the tensor types can not be set with existing types for LLM inference.
Extend KServe gRPC frontend to be able to convert KServe gRPC message from / to tensor messages. Since the frontend can also convert OpenAI completion messages. The frontend will check the model type registered and perform conversion in the order of 1. tensor; 2. completion.
As tensor type is more relaxed, the user will need to query model meta data (i.e. I/O names, shapes and data types) so they can prepare the input properly. For this PR, we will exploit the runtime_config field in ModelDeploymentCard, which the backend can pass during the register_llm() call. When invoking ModelMetadata / ModelConfig endpoints, the runtime_config will be fetched and convert to corresponding messages. If a model type is tensor, we expects the runtime_config to be the Pythonized representation of a TensorModelConfig as defined in lib/llm/src/protocols/tensor.rs. For future PR, we should properly define a field in ModelDeploymentCard similar to how tokenizer config is being fetched and stored.

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

closes GitHub issue: #xxx

Summary by CodeRabbit

New Features
- End-to-end Tensor model support: registration, routing, management, and streaming inference with request‑ID annotation, telemetry, and disconnect handling.
- API surface: Tensor recognized across gRPC/KServe and HTTP metrics; model listing, metadata/config, and streaming paths added.
- Python API: ModelInput/ModelType expose Tensor/TensorBased; runtime tensor_model_config can be set/retrieved from Python.
Documentation
- Docstrings updated to list Tensor where applicable.
Tests
- New tests covering Tensor registration, config propagation, inference, streaming, and error cases.

copy-pr-bot · 2025-08-27T10:03:19Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

grahamking · 2025-08-28T13:43:31Z

@ryanolson Could you take a look?

coderabbitai · 2025-09-10T18:26:50Z

Walkthrough

Adds first-class Tensor support across the codebase: new tensor protocol types and engines, ModelInput/ModelType Tensor variants, runtime config field and Python bindings, gRPC/HTTP routing and conversions for tensor flows, discovery/watcher and model manager tensor plumbing, and tests for tensor inference and streaming.

Changes

Cohort / File(s)	Summary
Python bindings: enums & registration `lib/bindings/python/rust/lib.rs`, `lib/bindings/python/src/dynamo/_core.pyi`	Expose `ModelInput::Tensor` and `ModelType::TensorBased` to Python; update `register_llm` to map `ModelInput::Tensor`; update docstrings to include Tensor.
Python runtime config methods `lib/bindings/python/rust/llm/local_model.rs`	Add `ModelRuntimeConfig.set_tensor_model_config` and `get_tensor_model_config` for PyDict ↔ internal `tensor_model_config` conversions.
Python tensor tests `lib/bindings/python/tests/test_tensor.py`	New test registering a Tensor model, round-tripping `ModelRuntimeConfig.tensor_model_config`, calling `register_llm` with Tensor types, and optional end-to-end serving callback.
Model types & runtime config `lib/llm/src/model_type.rs`, `lib/llm/src/local_model/runtime_config.rs`	Add `ModelType::TensorBased` flag and `supports_tensor()`; add `ModelInput::Tensor`; add `tensor_model_config: Option<tensor::TensorModelConfig>` to `ModelRuntimeConfig`.
Protocols & types `lib/llm/src/protocols.rs`, `lib/llm/src/protocols/tensor.rs`, `lib/llm/src/types.rs`	Add `protocols::tensor` module: DataType, FlattenTensor, TensorMetadata, TensorModelConfig, Tensor container, NvCreateTensorRequest/Response, annotations/NvExt providers, DeltaAggregator and stream folding; re-export NV types and add generic tensor engine aliases.
Discovery & watcher `lib/llm/src/discovery/model_manager.rs`, `lib/llm/src/discovery/watcher.rs`	Add `tensor_engines` to ModelManager with list/add/remove/get APIs; watcher registers/removes tensor models, emits ModelUpdate for TensorBased, and handles Tensor+TensorBased registration path.
gRPC/KServe services `lib/llm/src/grpc/service.rs`, `lib/llm/src/grpc/service/tensor.rs`, `lib/llm/src/grpc/service/kserve.rs`, `lib/llm/src/grpc/service/openai.rs`	Add `tensor` gRPC module; implement `tensor_response_stream`, disconnect monitoring, TryFrom conversions between inference and tensor types, DataType parsing/Display/to_kserve, KServe routing for tensor vs completions, and OpenAI↔inference conversion helpers.
HTTP metrics & endpoints `lib/llm/src/http/service/metrics.rs`	Add `Endpoint::Tensor` variant and map `Display` / `as_str()` to `"tensor"`.
Tests: service-level `lib/llm/tests/kserve_service.rs`, `lib/llm/tests/http-service.rs`	Add `TensorEngine` test harness and KServe tensor tests (metadata/config, infer, streaming); add `Endpoint::Tensor` arm (todo) in http-service tests.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Client
  participant KServe
  participant State
  participant Manager
  participant Engine
  Note over Client,KServe: Tensor inference (single or streaming)
  Client->>KServe: ModelInferRequest
  KServe->>State: resolve model type (is_tensor_model)
  State->>Manager: get_tensor_engine(model)
  Manager-->>State: TensorStreamingEngine
  KServe->>Engine: NvCreateTensorRequest (with annotations/request_id)
  activate Engine
  loop streaming responses
    Engine-->>KServe: Annotated<NvCreateTensorResponse>
    KServe->>KServe: process annotations / metrics
    KServe-->>Client: Model(Infer|Stream)Response
  end
  deactivate Engine
  Note over KServe: grpc_monitor_for_disconnects watches context and ends stream on disconnect

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~70 minutes

Possibly related PRs

feat: KServe gRPC support #2638 — Related KServe/gRPC surface and conversion changes; directly connected to tensor/kserve service work.
refactor: Split ModelType to ModelInput for request and response type; ModelType for the supported workloads #2714 — Prior changes to ModelInput/ModelType variants; directly related to adding Tensor variants and propagating them.
fix: re-enable deduplication #2343 — Overlaps with Python bindings and block-manager integrations; potentially relevant to packaging/bindings context.

Poem

I twitch my whiskers at a tensor stream,
Carrots encoded in a binary dream.
From Python bolts to Rusty root I hop,
I echo shapes until the streams all stop.
— a rabbit, nibbling on bytes 🥕

Pre-merge checks

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 59.42% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title "feat: tensor type for generic inference." is concise, uses conventional commit style, and accurately summarizes the primary change—adding tensor request/response types and model support for generic (non‑LLM) inference—so it clearly reflects the main scope of the PR.
Description Check	✅ Passed	The PR description provides a clear Overview and detailed Details explaining the new tensor message types, registration behavior, frontend conversion logic, and runtime_config usage, but the "Where should the reviewer start?" section is empty and the Related Issues field still contains a placeholder, so reviewer guidance and issue linkage are incomplete.

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 8

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

lib/llm/src/discovery/watcher.rs (1)
150-159: Fix duplicate-add path: has_model_any ignores Embedding/Tensor.

This block relies on ModelManager::has_model_any to short-circuit for existing models, but that method currently only checks chat/completions. New embeddings/tensor endpoints for an existing model name will fall through and attempt to re-add, yielding ModelAlreadyExists errors and noisy logs.

Apply this diff to include embeddings and tensor in has_model_any (in model_manager.rs):
--- a/lib/llm/src/discovery/model_manager.rs
+++ b/lib/llm/src/discovery/model_manager.rs
@@ -69,9 +69,12 @@ impl ModelManager {
     pub fn has_model_any(&self, model: &str) -> bool {
-        self.chat_completion_engines.read().unwrap().contains(model)
-            || self.completion_engines.read().unwrap().contains(model)
+        self.chat_completion_engines.read().unwrap().contains(model)
+            || self.completion_engines.read().unwrap().contains(model)
+            || self.embeddings_engines.read().unwrap().contains(model)
+            || self.tensor_engines.read().unwrap().contains(model)
     }
lib/llm/src/discovery/model_manager.rs (1)
69-72: Update has_model_any to include Embedding/Tensor.

Prevents duplicate add attempts in watcher for those types (see watcher.rs Lines 150-159).
Apply:
-    pub fn has_model_any(&self, model: &str) -> bool {
-        self.chat_completion_engines.read().unwrap().contains(model)
-            || self.completion_engines.read().unwrap().contains(model)
-    }
+    pub fn has_model_any(&self, model: &str) -> bool {
+        self.chat_completion_engines.read().unwrap().contains(model)
+            || self.completion_engines.read().unwrap().contains(model)
+            || self.embeddings_engines.read().unwrap().contains(model)
+            || self.tensor_engines.read().unwrap().contains(model)
+    }

🧹 Nitpick comments (19)

lib/llm/tests/http-service.rs (1)
209-216: Avoid leaving todo!() in test match arms

Using todo!() will panic if compute_index is called with new endpoints during future test expansion. Either fully support the new endpoints here or make the intent explicit with unreachable!/clear error to reduce surprise.

Apply one of:
-        Endpoint::Embeddings => todo!(),
-        Endpoint::Responses => todo!(),
-        Endpoint::Tensor => todo!(),
+        Endpoint::Embeddings | Endpoint::Responses | Endpoint::Tensor => {
+            unreachable!("Only Completions/ChatCompletions are indexed in this test")
+        }
Or generalize the test to compute indices from a fixed ordered slice including the new endpoints and expand the expected array accordingly.
lib/llm/src/model_type.rs (1)

101-104: Comment: clarify endpoint mapping for Tensor

Note reads that Tensor doesn’t map to HTTP EndpointType. Consider adding a brief rationale (e.g., “served via gRPC/KServe only”) or a TODO linking to where Tensor endpoints would be exposed if/when HTTP support lands.
lib/llm/src/local_model/runtime_config.rs (1)
26-35: Fix typos and tighten comment wording around tensor_model_config

Minor nits in comments (“convinence/convinent”, “workout”) and clarity around future direction.

Proposed comment edits:
-    // Provide tensor model config in the case where the model type is Tensor.
-    // Currently use JSON object for convinence, the programmatic way is to
-    // define the model config struct as part of the tensor protocol and
-    // import it here.
-    // [gluo TODO] switch to ModelConfig if desired and workout a way to
-    // prepare it in a convinent way, the protobuf library used by tonic
-    // doesn't provide JSON parsing.
+    // Tensor model configuration used when the model type is Tensor.
+    // We currently use a Rust struct (see tensor::TensorModelConfig) rather than raw JSON for clarity and type-safety.
+    // [gluo TODO] If we switch to a dedicated ModelConfig, figure out a convenient way to construct it;
+    // the prost/tonic protobuf stack doesn't provide JSON parsing.
lib/bindings/python/rust/lib.rs (1)
162-168: Pre-validate Tensor combinations for clearer Python errors

Improve UX by failing fast if ModelType/ModelInput combinations are inconsistent (mixing Tensor with non-Tensor) rather than surfacing a lower-level attach error.
 fn register_llm<'p>(
@@
-    let model_input = match model_input {
+    let model_input = match model_input {
         ModelInput::Text => llm_rs::model_type::ModelInput::Text,
         ModelInput::Tokens => llm_rs::model_type::ModelInput::Tokens,
         ModelInput::Tensor => llm_rs::model_type::ModelInput::Tensor,
     };
 
     let model_type_obj = model_type.inner;
+    // Validate Tensor combos early for a clearer Python-side error
+    if model_type_obj.supports_tensor()
+        && !matches!(model_input, llm_rs::model_type::ModelInput::Tensor)
+    {
+        return Err(PyErr::new::<PyException, _>(
+            "Tensor models must use ModelInput.Tensor",
+        ));
+    }
+    if !model_type_obj.supports_tensor()
+        && matches!(model_input, llm_rs::model_type::ModelInput::Tensor)
+    {
+        return Err(PyErr::new::<PyException, _>(
+            "ModelInput.Tensor requires ModelType.Tensor",
+        ));
+    }
lib/bindings/python/rust/llm/local_model.rs (1)
55-62: Setter doesn’t need a Python token

_py isn’t used. Simplify the API by removing it.

Alternative:
-fn set_tensor_model_config(
-    &mut self,
-    _py: Python<'_>,
-    tensor_model_config: &Bound<'_, PyDict>,
-) -> PyResult<()> {
+fn set_tensor_model_config(
+    &mut self,
+    tensor_model_config: &Bound<'_, PyDict>,
+) -> PyResult<()> {
lib/bindings/python/tests/test_tensor.py (4)
13-13: Make TEST_END_TO_END parsing robust

Current code treats "0" as truthy. Parse env var to a boolean.
-TEST_END_TO_END = os.environ.get("TEST_END_TO_END", 0)
+TEST_END_TO_END = os.getenv("TEST_END_TO_END", "0").lower() in ("1", "true", "yes")
49-51: Silence unused context arg in test generator

Rename to _context to satisfy linters.
-async def generate(request, context):
+async def generate(request, _context):
4-4: Correct or remove the usage comment

This test file references dynamo.mocker; update to a pytest invocation or remove to avoid confusion.
-# Usage: `python -m dynamo.mocker --model-path /data/models/Qwen3-0.6B-Q8_0.gguf --extra-engine-args args.json`
+# Usage: `pytest -k test_tensor -s` (or run as a module locally for debug)
33-43: Avoid relying on a real HF model in unit tests

Registering with "Qwen/Qwen3-0.6B" can make tests flaky/offline-hostile. Prefer a local stub/noop engine or a feature flag/mocked loader for tensor tests.

If helpful, I can sketch a minimal stub engine and a feature gate for tests.
lib/llm/src/discovery/watcher.rs (1)

217-236: Removal bookkeeping extended for tensor — looks consistent.

The tensor remove path mirrors the existing logic and extends the debug line. One minor ask: the “removed” flags are only raised when the last model of that type is removed. If the intent is per-model removal signaling (not just “no more of this type in the system”), consider emitting ModelUpdate::Removed regardless, or change the flag names to reflect the “last-of-type removed” semantics. Not blocking.

Would you like a quick patch to emit per-model-type removal events irrespective of “last-of-type”?

Also applies to: 238-262
lib/llm/src/grpc/service/tensor.rs (2)
35-45: Docstring says we fold non-streaming, but code always returns a stream.

Either (a) update the comment to reflect that callers do the folding, or (b) add a unary helper that collects the stream when streaming == false. Example helper (outside this function):
pub async fn tensor_unary_response(
    state: Arc<kserve::State>,
    request: NvCreateTensorRequest,
) -> Result<NvCreateTensorResponse, Status> {
    let mut s = tensor_response_stream(state, request, false).await?;
    futures::pin_mut!(s);
    let mut final_resp = None;
    while let Some(item) = s.next().await {
        final_resp = Some(item.into_inner()); // adjust if Annotated exposes inner
    }
    final_resp.ok_or_else(|| Status::internal("empty response"))
}
55-59: Include model name in not_found error.

Improves observability.
-        .map_err(|_| Status::not_found("model not found"))?;
+        .map_err(|_| Status::not_found(format!("model not found: {}", model)))?;
lib/llm/tests/kserve_service.rs (1)
190-234: TensorEngine repeat logic: sanitize count and avoid i32 ranges

repeat_count is i32 and used directly in 0..repeat_count. Negative values yield an empty range; large values could DOS the test. Convert to usize with saturation and cap to a small max (e.g., 10) to keep tests deterministic.
-            let stream = async_stream::stream! {
-                for _ in 0..repeat_count {
+            let stream = async_stream::stream! {
+                let repeat = repeat_count.max(0) as usize;
+                let repeat = repeat.min(10);
+                for _ in 0..repeat {
                     yield Annotated::from_data(NvCreateTensorResponse {
                         id: request.id.clone(),
                         model: request.model.clone(),
                         tensors: request.tensors.clone(),
                     });
                 }
             };
lib/llm/src/grpc/service/kserve.rs (5)
198-216: Misleading comment and error text on tensor path

Comment says “Fallback handling by assuming the model is OpenAI Completions model” and error string says “completions” on the tensor branch. Update both to “tensor” for clarity.
-        // [gluo TODO] refactor to reuse code, inference logic is largely the same
-        let mut reply = if self.state().is_tensor_model(&model) {
-            // Fallback handling by assuming the model is OpenAI Completions model
+        // [gluo TODO] refactor to reuse code, inference logic is largely the same
+        let mut reply = if self.state().is_tensor_model(&model) {
+            // Tensor path
             let tensor_request: NvCreateTensorRequest = request
                 .try_into()
                 .map_err(|e| Status::invalid_argument(format!("Failed to parse request: {}", e)))?;
@@
-            let tensor_response = NvCreateTensorResponse::from_annotated_stream(stream)
+            let tensor_response = NvCreateTensorResponse::from_annotated_stream(stream)
                 .await
                 .map_err(|e| {
-                    tracing::error!("Failed to fold completions stream: {:?}", e);
-                    Status::internal(format!("Failed to fold completions stream: {}", e))
+                    tracing::error!("Failed to fold tensor stream: {:?}", e);
+                    Status::internal(format!("Failed to fold tensor stream: {}", e))
                 })?;
405-481: Metadata path: good bifurcation by model type

Tensor vs. completions mapping is clear and matches tests. Consider extracting the tensor-metadata mapping into a helper to avoid duplication with model_config().

487-576: Config path: mirrors metadata; consider shared builder

Same mapping duplication as metadata; a shared helper would reduce drift.

90-93: Minor: accept &str in is_tensor_model to avoid cloning

Taking &str is more ergonomic and avoids requiring a String at call sites.
-    fn is_tensor_model(&self, model: &String) -> bool {
-        self.manager.list_tensor_models().contains(model)
+    fn is_tensor_model(&self, model: &str) -> bool {
+        self.manager.list_tensor_models().contains(&model.to_string())
     }
1006-1031: Improve error context in DataType::from_str

Include the offending string in the error to aid debugging.
-            _ => Err(anyhow::anyhow!("Invalid data type")),
+            _ => Err(anyhow::anyhow!(format!("Invalid data type: {}", s))),
lib/llm/src/protocols/tensor.rs (1)
22-31: Make DataType Copy

DataType is small and immutable; deriving Copy avoids unnecessary moves/clones.
-#[derive(Debug, Serialize, Clone, Eq, PartialEq, Deserialize)]
+#[derive(Debug, Serialize, Clone, Copy, Eq, PartialEq, Deserialize)]
 pub enum DataType {

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ae38bd4 and d63d1d5.

📒 Files selected for processing (17)

lib/bindings/python/rust/lib.rs (3 hunks)
lib/bindings/python/rust/llm/local_model.rs (1 hunks)
lib/bindings/python/src/dynamo/_core.pyi (1 hunks)
lib/bindings/python/tests/test_tensor.py (1 hunks)
lib/llm/src/discovery/model_manager.rs (8 hunks)
lib/llm/src/discovery/watcher.rs (5 hunks)
lib/llm/src/grpc/service.rs (1 hunks)
lib/llm/src/grpc/service/kserve.rs (11 hunks)
lib/llm/src/grpc/service/tensor.rs (1 hunks)
lib/llm/src/http/service/metrics.rs (3 hunks)
lib/llm/src/local_model/runtime_config.rs (2 hunks)
lib/llm/src/model_type.rs (5 hunks)
lib/llm/src/protocols.rs (1 hunks)
lib/llm/src/protocols/tensor.rs (1 hunks)
lib/llm/src/types.rs (1 hunks)
lib/llm/tests/http-service.rs (1 hunks)
lib/llm/tests/kserve_service.rs (6 hunks)

🧰 Additional context used

🧠 Learnings (2)

📚 Learning: 2025-09-02T16:46:54.015Z

Learnt from: GuanLuo
PR: ai-dynamo/dynamo#2714
File: lib/llm/src/discovery/model_entry.rs:38-42
Timestamp: 2025-09-02T16:46:54.015Z
Learning: In lib/llm/src/discovery/model_entry.rs, GuanLuo prefers not to add serde defaults for model_type and model_input fields to keep the specification explicit and avoid user errors, relying on atomic deployment strategy to avoid backward compatibility issues.

Applied to files:

lib/llm/src/local_model/runtime_config.rs

📚 Learning: 2025-08-21T17:23:02.836Z

Learnt from: michaelfeil
PR: ai-dynamo/dynamo#2591
File: lib/bindings/python/rust/http.rs:0-0
Timestamp: 2025-08-21T17:23:02.836Z
Learning: In lib/bindings/python/rust/http.rs, the enable_endpoint method uses EndpointType::all() to dynamically support all available endpoint types with case-insensitive matching, which is more maintainable than hardcoded match statements for endpoint type mappings.

Applied to files:

lib/llm/tests/http-service.rs
lib/llm/src/model_type.rs

🧬 Code graph analysis (12)

lib/llm/src/grpc/service.rs (1)

lib/llm/src/grpc/service/kserve.rs (1)

tensor (847-852)

lib/llm/src/protocols.rs (1)

lib/llm/src/grpc/service/kserve.rs (1)

tensor (847-852)

lib/llm/src/grpc/service/tensor.rs (4)

lib/llm/src/grpc/service/kserve.rs (3)

tensor (847-852)

state (130-132)

new (57-63)

lib/llm/src/http/service/disconnect.rs (1)

create_connection_monitor (100-121)

lib/llm/src/protocols/tensor.rs (1)

annotations (113-117)

lib/llm/src/http/service/metrics.rs (4)

new (107-210)

new (312-333)

new (406-415)

mark_ok (335-337)

lib/bindings/python/rust/llm/local_model.rs (1)

lib/bindings/python/rust/engine.rs (1)

depythonize (318-318)

lib/llm/src/types.rs (1)

lib/llm/src/grpc/service/kserve.rs (1)

tensor (847-852)

lib/llm/src/discovery/watcher.rs (2)

lib/bindings/python/src/dynamo/_core.pyi (4)

ModelType (850-852)

ModelInput (846-848)

client (232-236)

new (108-126)

lib/runtime/src/pipeline/network/egress/push_router.rs (1)

from_client_with_threshold (104-135)

lib/bindings/python/rust/lib.rs (2)

lib/bindings/python/src/dynamo/_core.pyi (2)

ModelInput (846-848)

ModelType (850-852)

lib/llm/src/model_card.rs (2)

model_type (553-553)

model_type (752-754)

lib/llm/src/grpc/service/kserve.rs (3)

lib/llm/src/grpc/service/tensor.rs (1)

tensor_response_stream (37-112)

lib/llm/src/protocols/tensor.rs (1)

from_annotated_stream (141-180)

lib/llm/src/grpc/service/openai.rs (2)

get_parsing_options (193-198)

completion_response_stream (44-126)

lib/bindings/python/tests/test_tensor.py (3)

lib/bindings/python/src/dynamo/_core.pyi (4)

ModelInput (846-848)

ModelRuntimeConfig (465-469)

ModelType (850-852)

DistributedRuntime (31-54)

lib/bindings/python/rust/lib.rs (6)

register_llm (147-217)

get (445-461)

namespace (368-373)

create_service (563-569)

endpoint (555-561)

serve_endpoint (575-601)

lib/bindings/python/rust/llm/local_model.rs (2)

set_tensor_model_config (55-62)

get_tensor_model_config (64-71)

lib/llm/tests/kserve_service.rs (3)

lib/bindings/python/src/dynamo/_core.pyi (3)

ModelRuntimeConfig (465-469)

ModelInput (846-848)

ModelType (850-852)

lib/llm/src/grpc/service/kserve.rs (3)

tensor (847-852)

new (57-63)

manager (78-80)

lib/llm/src/local_model/runtime_config.rs (1)

new (38-40)

lib/llm/src/model_type.rs (3)

lib/bindings/python/src/dynamo/_core.pyi (2)

ModelType (850-852)

ModelInput (846-848)

lib/llm/src/http/service/metrics.rs (3)

as_str (376-384)

as_str (388-393)

as_str (397-402)

lib/llm/src/endpoint_type.rs (1)

as_str (32-39)

lib/llm/src/protocols/tensor.rs (1)

lib/llm/src/grpc/service/tensor.rs (1)

annotations (81-94)

🪛 GitHub Actions: Rust pre-merge checks

lib/bindings/python/rust/llm/local_model.rs

[error] 52-52: cargo fmt check failed. Detected formatting differences in function set_tensor_model_config signature. Run 'cargo fmt' to fix formatting.

🪛 Ruff (0.12.2)

lib/bindings/python/tests/test_tensor.py

49-49: Unused function argument: context

(ARG001)

🪛 GitHub Actions: NVIDIA Dynamo Github Validation

lib/llm/src/discovery/model_manager.rs

[error] 96-96: Rust compile error: unwrap() called on RwLockReadGuard; unwrap() is not a method on the read guard. The code uses 'self.tensor_engines.read().unwrap().list()', which fails to compile. The compiler suggests using 'wrap()' instead.

[error] 131-131: Rust compile error: unwrap() called on RwLockWriteGuard; unwrap() is not available on the write guard. The compiler suggests using 'wrap' instead.

[error] 151-151: Rust compile error: unwrap() called on RwLockWriteGuard; unwrap() is not available on the write guard. The compiler suggests using 'wrap' instead.

[error] 194-194: Rust compile error: unwrap() called on RwLockReadGuard; unwrap() is not available on the read guard. The compiler suggests using 'wrap' instead.

🔇 Additional comments (38)

lib/llm/src/http/service/metrics.rs (3)

56-58: Add Tensor endpoint — looks correct

New Endpoint::Tensor variant is consistent with the rest of the enum and doc comment matches usage.

370-371: Display formatting for Tensor is consistent

"tensor" string matches naming conventions for other endpoints.

382-383: as_str mapping for Tensor is correct

Label value "tensor" aligns with metrics label expectations.

lib/llm/src/model_type.rs (5)

49-50: Introduce Tensor bitflag — OK

Adding Tensor = 1<<3 preserves existing flags and leaves room for future growth.

67-70: supports_tensor helper — OK

Parity with other supports_* helpers; improves readability at call sites.

82-85: Include "tensor" in as_vec output

Order remains stable (chat, completions, embedding, tensor). Good.

120-122: Add ModelInput::Tensor — OK

Enum addition is consistent with serde/Display usage.

129-130: ModelInput::as_str for Tensor — OK

Returns "tensor" as expected.

lib/llm/src/grpc/service.rs (1)

6-6: Expose tensor gRPC module — OK

Public mod export is minimal and aligns with the new service surface.

lib/llm/src/protocols.rs (1)

28-28: Expose tensor protocol module — OK

Makes tensor protocol types available to dependents.

lib/bindings/python/src/dynamo/_core.pyi (2)

847-847: Docstring includes Tensor in ModelInput — OK

851-851: Docstring includes Tensor in ModelType — OK

lib/llm/src/local_model/runtime_config.rs (1)

8-9: Import tensor protocol — OK

lib/bindings/python/rust/lib.rs (3)

162-166: Map Python ModelInput::Tensor to Rust — OK

299-303: Expose ModelType.Tensor to Python — OK

Bitflag classattr mirrors Rust side and composes via or as expected.

320-321: Add Python enum value for ModelInput::Tensor — OK

lib/llm/src/types.rs (2)

76-92: Tensor type aliases look consistent and additive

Re-exports and engine aliases align with existing OpenAI sections. Naming and Annotated usage are consistent.

88-91: Alias is correct; no separate stream response type exists
NvCreateTensorResponse is the only response type defined in lib/llm/src/protocols/tensor.rs and supports streaming via its from_annotated_stream implementation.

lib/bindings/python/rust/llm/local_model.rs (1)

64-71: LGTM on get_tensor_model_config

Conversion and Option handling look correct; error mapping is consistent with the rest of the bindings.

lib/llm/src/discovery/watcher.rs (4)

35-36: Tensor protocol import is correct and scoped.

Importing NvCreateTensorRequest/Response here keeps watcher concerns local. No issues.

58-63: Include Tensor in ALL_MODEL_TYPES.

Good addition; ensures ModelUpdate emits for tensor as well.

432-445: Tensor + Tensor routing path looks correct.

The PushRouter generic parameters align with NvCreateTensorRequest and Annotated. Registration to manager via add_tensor_model is as expected.

448-452: Unsupported configuration message updated — good.

Including “Tensor+Tensor” improves operator feedback.

lib/llm/src/discovery/model_manager.rs (5)

18-19: Correct engine type import.

types::generic::tensor::TensorStreamingEngine import aligns with the watcher registration.

40-41: Add tensor_engines field — consistent with other engine maps.

Initialization and locking semantics mirror existing maps.

59-60: Initialize tensor_engines in new() — OK.

74-80: Include tensor models in model_display_names — OK.

95-97: Align RwLock import and unwrap usage
Remove all .unwrap() on .read()/.write() if you’ve imported parking_lot::RwLock (its read()/write() return guards directly), or switch the file to std::sync::RwLock and retain the unwraps. Ensure the use at the top of lib/llm/src/discovery/model_manager.rs matches your chosen lock API.

lib/llm/src/grpc/service/tensor.rs (3)

80-95: Annotation injection — LGTM.

request_id annotation is correctly prefixed to the stream when requested.

121-152: grpc_monitor_for_disconnects — solid.

Arming/disarming semantics and inflight guard handling are correct.

165-177: UUID handling — LGTM.

Gracefully accepts client-provided UUID or generates a new v4.

lib/llm/tests/kserve_service.rs (3)

287-299: Duplicate model entry for "split" likely fine, but consider consolidating

You register runtime metadata for split via save_model_entry after add_completions_model. Looks OK for tests, but you may not need both calls if the manager prevents duplicates; consider documenting intended override behavior.

1147-1400: Great, thorough tensor-path coverage

End-to-end tests for metadata/config, unary infer, error on multi-response in unary, and stream infer are solid and mirror the intended routing behaviors.

1402-1455: Helper asserts mirror echo semantics and ID plumbing

validate_tensor_response checks name, version, id and I/O parity; aligns with the conversion path. LGTM.

lib/llm/src/grpc/service/kserve.rs (2)

300-373: Streaming path correctly preserves request_id and templating; mirrors unary path

The demux note is fair; current behavior is acceptable for now.

994-1004: KServe dtype mapping looks correct

BOOL/UINT32/INT32/FP32/BYTES → TypeBool/TypeUint32/TypeInt32/TypeFp32/TypeString matches expectations.

lib/llm/src/protocols/tensor.rs (2)

67-81: Good: NvExtProvider/AnnotationsProvider implemented consistently

Annotations pass-through via NvExt is clean and keeps parity with other protocols.

140-181: Aggregator behavior and error messages are appropriate

Single-response folding with explicit “Multiple responses in non-streaming mode” aligns with tests and expected semantics.

lib/bindings/python/rust/llm/local_model.rs

lib/llm/src/discovery/model_manager.rs

lib/llm/src/grpc/service/kserve.rs

lib/llm/src/grpc/service/tensor.rs

lib/llm/tests/kserve_service.rs

Signed-off-by: Guan Luo <[email protected]>

grahamking · 2025-09-16T14:03:56Z

/ok to test a6054a8

grahamking · 2025-09-16T14:10:36Z

@GuanLuo Could you provide an example of how we use this new ModelInput type?

Can I use any model? Or something generic-tensor specific (if that makes sense)?
Can I do a normal chat completions request with curl?

lib/llm/src/model_type.rs

lib/llm/src/grpc/service/tensor.rs

lib/llm/src/protocols/tensor.rs

lib/llm/src/grpc/service/tensor.rs

rmccorm4

LGTM - most comments are just questions for better understanding.

We need docs for both kserve frontend and tensor-based model related features, describing how to opt in/out, and what their current limitations or assumptions are
I notice a lot of todos in the code. If you write up some detailed github issues describing some of the TODOs - it is very likely (from past history so far) that an OSS contributor will try to fix/implement them. Even if they don't get picked up, these will auto sync to Linear and be helpful for others to pickup in future too.

lib/llm/src/grpc/service/tensor.rs

oandreeva-nv · 2025-09-23T19:49:16Z

ok to merge conditional on ci failing not due to this pr + it'd be nice to have some logic extracted into smaller methods

Signed-off-by: Guan Luo <[email protected]>

GuanLuo · 2025-09-23T20:39:27Z

/ok to test 41fdf07

Co-authored-by: Ryan McCormick <[email protected]> Signed-off-by: GuanLuo <[email protected]>

GuanLuo · 2025-09-24T06:58:33Z

/ok to test 2fd767c

Signed-off-by: Guan Luo <[email protected]> Signed-off-by: GuanLuo <[email protected]> Co-authored-by: Olga Andreeva <[email protected]> Co-authored-by: Ryan McCormick <[email protected]>

Signed-off-by: Guan Luo <[email protected]> Signed-off-by: GuanLuo <[email protected]> Co-authored-by: Olga Andreeva <[email protected]> Co-authored-by: Ryan McCormick <[email protected]> Signed-off-by: Jason Zhou <[email protected]>

Signed-off-by: Guan Luo <[email protected]> Signed-off-by: GuanLuo <[email protected]> Co-authored-by: Olga Andreeva <[email protected]> Co-authored-by: Ryan McCormick <[email protected]> Signed-off-by: Kyle H <[email protected]>

pull-request-size bot added the size/L label Aug 27, 2025

github-actions bot added the feat label Aug 27, 2025

GuanLuo force-pushed the gluo/dis-540-full-tensor-support branch from dac73a6 to 226031b Compare September 3, 2025 00:07

GuanLuo force-pushed the gluo/dis-540-full-tensor-support branch from 226031b to d63d1d5 Compare September 10, 2025 02:07

pull-request-size bot added size/XXL and removed size/L labels Sep 10, 2025

GuanLuo marked this pull request as ready for review September 10, 2025 18:13

GuanLuo requested review from a team as code owners September 10, 2025 18:13

coderabbitai bot reviewed Sep 10, 2025

View reviewed changes

GuanLuo force-pushed the gluo/dis-540-full-tensor-support branch from d63d1d5 to 53689af Compare September 16, 2025 00:05

GuanLuo added 10 commits September 15, 2025 17:08

feat: add new Dynamo recognized request type

b2edf07

Signed-off-by: Guan Luo <[email protected]>

feat: add tensor model type and extend model manager API

bfb727b

Signed-off-by: Guan Luo <[email protected]>

WIP: add placeholder for KServe tensor handling

a6d2b74

Signed-off-by: Guan Luo <[email protected]>

feat: add ModelInput::Tensor and ModelType::Tensor

5850a98

Signed-off-by: Guan Luo <[email protected]>

fix: fix up

2dbb514

Signed-off-by: Guan Luo <[email protected]>

Passing configs

734bf3c

Signed-off-by: Guan Luo <[email protected]>

feat: add tensor request and response

01c3821

Signed-off-by: Guan Luo <[email protected]>

test: add test

6903619

Signed-off-by: Guan Luo <[email protected]>

chore: add end-to-end identity model

c3a2691

Signed-off-by: Guan Luo <[email protected]>

chore: address comment

164044f

Signed-off-by: Guan Luo <[email protected]>

GuanLuo force-pushed the gluo/dis-540-full-tensor-support branch from 53689af to 164044f Compare September 16, 2025 00:08

chore: clean up comment

a6054a8

Signed-off-by: Guan Luo <[email protected]>

oandreeva-nv reviewed Sep 16, 2025

View reviewed changes

lib/llm/src/model_type.rs Outdated Show resolved Hide resolved

rmccorm4 reviewed Sep 23, 2025

View reviewed changes

lib/llm/src/grpc/service/tensor.rs Show resolved Hide resolved

rmccorm4 reviewed Sep 23, 2025

View reviewed changes

lib/llm/src/protocols/tensor.rs Show resolved Hide resolved

rmccorm4 reviewed Sep 23, 2025

View reviewed changes

lib/llm/src/protocols/tensor.rs Show resolved Hide resolved

rmccorm4 reviewed Sep 23, 2025

View reviewed changes

lib/llm/src/grpc/service/tensor.rs Show resolved Hide resolved

rmccorm4 approved these changes Sep 23, 2025

View reviewed changes