-
Notifications
You must be signed in to change notification settings - Fork 693
feat: tensor type for generic inference. #2746
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@ryanolson Could you take a look? |
dac73a6 to
226031b
Compare
226031b to
d63d1d5
Compare
WalkthroughAdds first-class Tensor support across the codebase: new tensor protocol types and engines, ModelInput/ModelType Tensor variants, runtime config field and Python bindings, gRPC/HTTP routing and conversions for tensor flows, discovery/watcher and model manager tensor plumbing, and tests for tensor inference and streaming. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Client
participant KServe
participant State
participant Manager
participant Engine
Note over Client,KServe: Tensor inference (single or streaming)
Client->>KServe: ModelInferRequest
KServe->>State: resolve model type (is_tensor_model)
State->>Manager: get_tensor_engine(model)
Manager-->>State: TensorStreamingEngine
KServe->>Engine: NvCreateTensorRequest (with annotations/request_id)
activate Engine
loop streaming responses
Engine-->>KServe: Annotated<NvCreateTensorResponse>
KServe->>KServe: process annotations / metrics
KServe-->>Client: Model(Infer|Stream)Response
end
deactivate Engine
Note over KServe: grpc_monitor_for_disconnects watches context and ends stream on disconnect
Estimated code review effort🎯 4 (Complex) | ⏱️ ~70 minutes Possibly related PRs
Poem
Pre-merge checks❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
Tip 👮 Agentic pre-merge checks are now available in preview!Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.
Please see the documentation for more information. Example: reviews:
pre_merge_checks:
custom_checks:
- name: "Undocumented Breaking Changes"
mode: "warning"
instructions: |
Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).Please share your feedback with us on this Discord post. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 8
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
lib/llm/src/discovery/watcher.rs (1)
150-159: Fix duplicate-add path: has_model_any ignores Embedding/Tensor.This block relies on ModelManager::has_model_any to short-circuit for existing models, but that method currently only checks chat/completions. New embeddings/tensor endpoints for an existing model name will fall through and attempt to re-add, yielding ModelAlreadyExists errors and noisy logs.
Apply this diff to include embeddings and tensor in has_model_any (in model_manager.rs):
--- a/lib/llm/src/discovery/model_manager.rs +++ b/lib/llm/src/discovery/model_manager.rs @@ -69,9 +69,12 @@ impl ModelManager { pub fn has_model_any(&self, model: &str) -> bool { - self.chat_completion_engines.read().unwrap().contains(model) - || self.completion_engines.read().unwrap().contains(model) + self.chat_completion_engines.read().unwrap().contains(model) + || self.completion_engines.read().unwrap().contains(model) + || self.embeddings_engines.read().unwrap().contains(model) + || self.tensor_engines.read().unwrap().contains(model) }lib/llm/src/discovery/model_manager.rs (1)
69-72: Update has_model_any to include Embedding/Tensor.Prevents duplicate add attempts in watcher for those types (see watcher.rs Lines 150-159).
Apply:- pub fn has_model_any(&self, model: &str) -> bool { - self.chat_completion_engines.read().unwrap().contains(model) - || self.completion_engines.read().unwrap().contains(model) - } + pub fn has_model_any(&self, model: &str) -> bool { + self.chat_completion_engines.read().unwrap().contains(model) + || self.completion_engines.read().unwrap().contains(model) + || self.embeddings_engines.read().unwrap().contains(model) + || self.tensor_engines.read().unwrap().contains(model) + }
🧹 Nitpick comments (19)
lib/llm/tests/http-service.rs (1)
209-216: Avoid leaving todo!() in test match armsUsing todo!() will panic if compute_index is called with new endpoints during future test expansion. Either fully support the new endpoints here or make the intent explicit with unreachable!/clear error to reduce surprise.
Apply one of:
- Endpoint::Embeddings => todo!(), - Endpoint::Responses => todo!(), - Endpoint::Tensor => todo!(), + Endpoint::Embeddings | Endpoint::Responses | Endpoint::Tensor => { + unreachable!("Only Completions/ChatCompletions are indexed in this test") + }Or generalize the test to compute indices from a fixed ordered slice including the new endpoints and expand the expected array accordingly.
lib/llm/src/model_type.rs (1)
101-104: Comment: clarify endpoint mapping for TensorNote reads that Tensor doesn’t map to HTTP EndpointType. Consider adding a brief rationale (e.g., “served via gRPC/KServe only”) or a TODO linking to where Tensor endpoints would be exposed if/when HTTP support lands.
lib/llm/src/local_model/runtime_config.rs (1)
26-35: Fix typos and tighten comment wording around tensor_model_configMinor nits in comments (“convinence/convinent”, “workout”) and clarity around future direction.
Proposed comment edits:
- // Provide tensor model config in the case where the model type is Tensor. - // Currently use JSON object for convinence, the programmatic way is to - // define the model config struct as part of the tensor protocol and - // import it here. - // [gluo TODO] switch to ModelConfig if desired and workout a way to - // prepare it in a convinent way, the protobuf library used by tonic - // doesn't provide JSON parsing. + // Tensor model configuration used when the model type is Tensor. + // We currently use a Rust struct (see tensor::TensorModelConfig) rather than raw JSON for clarity and type-safety. + // [gluo TODO] If we switch to a dedicated ModelConfig, figure out a convenient way to construct it; + // the prost/tonic protobuf stack doesn't provide JSON parsing.lib/bindings/python/rust/lib.rs (1)
162-168: Pre-validate Tensor combinations for clearer Python errorsImprove UX by failing fast if ModelType/ModelInput combinations are inconsistent (mixing Tensor with non-Tensor) rather than surfacing a lower-level attach error.
fn register_llm<'p>( @@ - let model_input = match model_input { + let model_input = match model_input { ModelInput::Text => llm_rs::model_type::ModelInput::Text, ModelInput::Tokens => llm_rs::model_type::ModelInput::Tokens, ModelInput::Tensor => llm_rs::model_type::ModelInput::Tensor, }; let model_type_obj = model_type.inner; + // Validate Tensor combos early for a clearer Python-side error + if model_type_obj.supports_tensor() + && !matches!(model_input, llm_rs::model_type::ModelInput::Tensor) + { + return Err(PyErr::new::<PyException, _>( + "Tensor models must use ModelInput.Tensor", + )); + } + if !model_type_obj.supports_tensor() + && matches!(model_input, llm_rs::model_type::ModelInput::Tensor) + { + return Err(PyErr::new::<PyException, _>( + "ModelInput.Tensor requires ModelType.Tensor", + )); + }lib/bindings/python/rust/llm/local_model.rs (1)
55-62: Setter doesn’t need a Python token_py isn’t used. Simplify the API by removing it.
Alternative:
-fn set_tensor_model_config( - &mut self, - _py: Python<'_>, - tensor_model_config: &Bound<'_, PyDict>, -) -> PyResult<()> { +fn set_tensor_model_config( + &mut self, + tensor_model_config: &Bound<'_, PyDict>, +) -> PyResult<()> {lib/bindings/python/tests/test_tensor.py (4)
13-13: Make TEST_END_TO_END parsing robustCurrent code treats "0" as truthy. Parse env var to a boolean.
-TEST_END_TO_END = os.environ.get("TEST_END_TO_END", 0) +TEST_END_TO_END = os.getenv("TEST_END_TO_END", "0").lower() in ("1", "true", "yes")
49-51: Silence unused context arg in test generatorRename to _context to satisfy linters.
-async def generate(request, context): +async def generate(request, _context):
4-4: Correct or remove the usage commentThis test file references dynamo.mocker; update to a pytest invocation or remove to avoid confusion.
-# Usage: `python -m dynamo.mocker --model-path /data/models/Qwen3-0.6B-Q8_0.gguf --extra-engine-args args.json` +# Usage: `pytest -k test_tensor -s` (or run as a module locally for debug)
33-43: Avoid relying on a real HF model in unit testsRegistering with "Qwen/Qwen3-0.6B" can make tests flaky/offline-hostile. Prefer a local stub/noop engine or a feature flag/mocked loader for tensor tests.
If helpful, I can sketch a minimal stub engine and a feature gate for tests.
lib/llm/src/discovery/watcher.rs (1)
217-236: Removal bookkeeping extended for tensor — looks consistent.The tensor remove path mirrors the existing logic and extends the debug line. One minor ask: the “removed” flags are only raised when the last model of that type is removed. If the intent is per-model removal signaling (not just “no more of this type in the system”), consider emitting ModelUpdate::Removed regardless, or change the flag names to reflect the “last-of-type removed” semantics. Not blocking.
Would you like a quick patch to emit per-model-type removal events irrespective of “last-of-type”?
Also applies to: 238-262
lib/llm/src/grpc/service/tensor.rs (2)
35-45: Docstring says we fold non-streaming, but code always returns a stream.Either (a) update the comment to reflect that callers do the folding, or (b) add a unary helper that collects the stream when streaming == false. Example helper (outside this function):
pub async fn tensor_unary_response( state: Arc<kserve::State>, request: NvCreateTensorRequest, ) -> Result<NvCreateTensorResponse, Status> { let mut s = tensor_response_stream(state, request, false).await?; futures::pin_mut!(s); let mut final_resp = None; while let Some(item) = s.next().await { final_resp = Some(item.into_inner()); // adjust if Annotated exposes inner } final_resp.ok_or_else(|| Status::internal("empty response")) }
55-59: Include model name in not_found error.Improves observability.
- .map_err(|_| Status::not_found("model not found"))?; + .map_err(|_| Status::not_found(format!("model not found: {}", model)))?;lib/llm/tests/kserve_service.rs (1)
190-234: TensorEngine repeat logic: sanitize count and avoid i32 rangesrepeat_count is i32 and used directly in 0..repeat_count. Negative values yield an empty range; large values could DOS the test. Convert to usize with saturation and cap to a small max (e.g., 10) to keep tests deterministic.
- let stream = async_stream::stream! { - for _ in 0..repeat_count { + let stream = async_stream::stream! { + let repeat = repeat_count.max(0) as usize; + let repeat = repeat.min(10); + for _ in 0..repeat { yield Annotated::from_data(NvCreateTensorResponse { id: request.id.clone(), model: request.model.clone(), tensors: request.tensors.clone(), }); } };lib/llm/src/grpc/service/kserve.rs (5)
198-216: Misleading comment and error text on tensor pathComment says “Fallback handling by assuming the model is OpenAI Completions model” and error string says “completions” on the tensor branch. Update both to “tensor” for clarity.
- // [gluo TODO] refactor to reuse code, inference logic is largely the same - let mut reply = if self.state().is_tensor_model(&model) { - // Fallback handling by assuming the model is OpenAI Completions model + // [gluo TODO] refactor to reuse code, inference logic is largely the same + let mut reply = if self.state().is_tensor_model(&model) { + // Tensor path let tensor_request: NvCreateTensorRequest = request .try_into() .map_err(|e| Status::invalid_argument(format!("Failed to parse request: {}", e)))?; @@ - let tensor_response = NvCreateTensorResponse::from_annotated_stream(stream) + let tensor_response = NvCreateTensorResponse::from_annotated_stream(stream) .await .map_err(|e| { - tracing::error!("Failed to fold completions stream: {:?}", e); - Status::internal(format!("Failed to fold completions stream: {}", e)) + tracing::error!("Failed to fold tensor stream: {:?}", e); + Status::internal(format!("Failed to fold tensor stream: {}", e)) })?;
405-481: Metadata path: good bifurcation by model typeTensor vs. completions mapping is clear and matches tests. Consider extracting the tensor-metadata mapping into a helper to avoid duplication with model_config().
487-576: Config path: mirrors metadata; consider shared builderSame mapping duplication as metadata; a shared helper would reduce drift.
90-93: Minor: accept &str in is_tensor_model to avoid cloningTaking &str is more ergonomic and avoids requiring a String at call sites.
- fn is_tensor_model(&self, model: &String) -> bool { - self.manager.list_tensor_models().contains(model) + fn is_tensor_model(&self, model: &str) -> bool { + self.manager.list_tensor_models().contains(&model.to_string()) }
1006-1031: Improve error context in DataType::from_strInclude the offending string in the error to aid debugging.
- _ => Err(anyhow::anyhow!("Invalid data type")), + _ => Err(anyhow::anyhow!(format!("Invalid data type: {}", s))),lib/llm/src/protocols/tensor.rs (1)
22-31: Make DataType CopyDataType is small and immutable; deriving Copy avoids unnecessary moves/clones.
-#[derive(Debug, Serialize, Clone, Eq, PartialEq, Deserialize)] +#[derive(Debug, Serialize, Clone, Copy, Eq, PartialEq, Deserialize)] pub enum DataType {
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (17)
lib/bindings/python/rust/lib.rs(3 hunks)lib/bindings/python/rust/llm/local_model.rs(1 hunks)lib/bindings/python/src/dynamo/_core.pyi(1 hunks)lib/bindings/python/tests/test_tensor.py(1 hunks)lib/llm/src/discovery/model_manager.rs(8 hunks)lib/llm/src/discovery/watcher.rs(5 hunks)lib/llm/src/grpc/service.rs(1 hunks)lib/llm/src/grpc/service/kserve.rs(11 hunks)lib/llm/src/grpc/service/tensor.rs(1 hunks)lib/llm/src/http/service/metrics.rs(3 hunks)lib/llm/src/local_model/runtime_config.rs(2 hunks)lib/llm/src/model_type.rs(5 hunks)lib/llm/src/protocols.rs(1 hunks)lib/llm/src/protocols/tensor.rs(1 hunks)lib/llm/src/types.rs(1 hunks)lib/llm/tests/http-service.rs(1 hunks)lib/llm/tests/kserve_service.rs(6 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-09-02T16:46:54.015Z
Learnt from: GuanLuo
PR: ai-dynamo/dynamo#2714
File: lib/llm/src/discovery/model_entry.rs:38-42
Timestamp: 2025-09-02T16:46:54.015Z
Learning: In lib/llm/src/discovery/model_entry.rs, GuanLuo prefers not to add serde defaults for model_type and model_input fields to keep the specification explicit and avoid user errors, relying on atomic deployment strategy to avoid backward compatibility issues.
Applied to files:
lib/llm/src/local_model/runtime_config.rs
📚 Learning: 2025-08-21T17:23:02.836Z
Learnt from: michaelfeil
PR: ai-dynamo/dynamo#2591
File: lib/bindings/python/rust/http.rs:0-0
Timestamp: 2025-08-21T17:23:02.836Z
Learning: In lib/bindings/python/rust/http.rs, the enable_endpoint method uses EndpointType::all() to dynamically support all available endpoint types with case-insensitive matching, which is more maintainable than hardcoded match statements for endpoint type mappings.
Applied to files:
lib/llm/tests/http-service.rslib/llm/src/model_type.rs
🧬 Code graph analysis (12)
lib/llm/src/grpc/service.rs (1)
lib/llm/src/grpc/service/kserve.rs (1)
tensor(847-852)
lib/llm/src/protocols.rs (1)
lib/llm/src/grpc/service/kserve.rs (1)
tensor(847-852)
lib/llm/src/grpc/service/tensor.rs (4)
lib/llm/src/grpc/service/kserve.rs (3)
tensor(847-852)state(130-132)new(57-63)lib/llm/src/http/service/disconnect.rs (1)
create_connection_monitor(100-121)lib/llm/src/protocols/tensor.rs (1)
annotations(113-117)lib/llm/src/http/service/metrics.rs (4)
new(107-210)new(312-333)new(406-415)mark_ok(335-337)
lib/bindings/python/rust/llm/local_model.rs (1)
lib/bindings/python/rust/engine.rs (1)
depythonize(318-318)
lib/llm/src/types.rs (1)
lib/llm/src/grpc/service/kserve.rs (1)
tensor(847-852)
lib/llm/src/discovery/watcher.rs (2)
lib/bindings/python/src/dynamo/_core.pyi (4)
ModelType(850-852)ModelInput(846-848)client(232-236)new(108-126)lib/runtime/src/pipeline/network/egress/push_router.rs (1)
from_client_with_threshold(104-135)
lib/bindings/python/rust/lib.rs (2)
lib/bindings/python/src/dynamo/_core.pyi (2)
ModelInput(846-848)ModelType(850-852)lib/llm/src/model_card.rs (2)
model_type(553-553)model_type(752-754)
lib/llm/src/grpc/service/kserve.rs (3)
lib/llm/src/grpc/service/tensor.rs (1)
tensor_response_stream(37-112)lib/llm/src/protocols/tensor.rs (1)
from_annotated_stream(141-180)lib/llm/src/grpc/service/openai.rs (2)
get_parsing_options(193-198)completion_response_stream(44-126)
lib/bindings/python/tests/test_tensor.py (3)
lib/bindings/python/src/dynamo/_core.pyi (4)
ModelInput(846-848)ModelRuntimeConfig(465-469)ModelType(850-852)DistributedRuntime(31-54)lib/bindings/python/rust/lib.rs (6)
register_llm(147-217)get(445-461)namespace(368-373)create_service(563-569)endpoint(555-561)serve_endpoint(575-601)lib/bindings/python/rust/llm/local_model.rs (2)
set_tensor_model_config(55-62)get_tensor_model_config(64-71)
lib/llm/tests/kserve_service.rs (3)
lib/bindings/python/src/dynamo/_core.pyi (3)
ModelRuntimeConfig(465-469)ModelInput(846-848)ModelType(850-852)lib/llm/src/grpc/service/kserve.rs (3)
tensor(847-852)new(57-63)manager(78-80)lib/llm/src/local_model/runtime_config.rs (1)
new(38-40)
lib/llm/src/model_type.rs (3)
lib/bindings/python/src/dynamo/_core.pyi (2)
ModelType(850-852)ModelInput(846-848)lib/llm/src/http/service/metrics.rs (3)
as_str(376-384)as_str(388-393)as_str(397-402)lib/llm/src/endpoint_type.rs (1)
as_str(32-39)
lib/llm/src/protocols/tensor.rs (1)
lib/llm/src/grpc/service/tensor.rs (1)
annotations(81-94)
🪛 GitHub Actions: Rust pre-merge checks
lib/bindings/python/rust/llm/local_model.rs
[error] 52-52: cargo fmt check failed. Detected formatting differences in function set_tensor_model_config signature. Run 'cargo fmt' to fix formatting.
🪛 Ruff (0.12.2)
lib/bindings/python/tests/test_tensor.py
49-49: Unused function argument: context
(ARG001)
🪛 GitHub Actions: NVIDIA Dynamo Github Validation
lib/llm/src/discovery/model_manager.rs
[error] 96-96: Rust compile error: unwrap() called on RwLockReadGuard; unwrap() is not a method on the read guard. The code uses 'self.tensor_engines.read().unwrap().list()', which fails to compile. The compiler suggests using 'wrap()' instead.
[error] 131-131: Rust compile error: unwrap() called on RwLockWriteGuard; unwrap() is not available on the write guard. The compiler suggests using 'wrap' instead.
[error] 151-151: Rust compile error: unwrap() called on RwLockWriteGuard; unwrap() is not available on the write guard. The compiler suggests using 'wrap' instead.
[error] 194-194: Rust compile error: unwrap() called on RwLockReadGuard; unwrap() is not available on the read guard. The compiler suggests using 'wrap' instead.
🔇 Additional comments (38)
lib/llm/src/http/service/metrics.rs (3)
56-58: Add Tensor endpoint — looks correctNew Endpoint::Tensor variant is consistent with the rest of the enum and doc comment matches usage.
370-371: Display formatting for Tensor is consistent"tensor" string matches naming conventions for other endpoints.
382-383: as_str mapping for Tensor is correctLabel value "tensor" aligns with metrics label expectations.
lib/llm/src/model_type.rs (5)
49-50: Introduce Tensor bitflag — OKAdding Tensor = 1<<3 preserves existing flags and leaves room for future growth.
67-70: supports_tensor helper — OKParity with other supports_* helpers; improves readability at call sites.
82-85: Include "tensor" in as_vec outputOrder remains stable (chat, completions, embedding, tensor). Good.
120-122: Add ModelInput::Tensor — OKEnum addition is consistent with serde/Display usage.
129-130: ModelInput::as_str for Tensor — OKReturns "tensor" as expected.
lib/llm/src/grpc/service.rs (1)
6-6: Expose tensor gRPC module — OKPublic mod export is minimal and aligns with the new service surface.
lib/llm/src/protocols.rs (1)
28-28: Expose tensor protocol module — OKMakes tensor protocol types available to dependents.
lib/bindings/python/src/dynamo/_core.pyi (2)
847-847: Docstring includes Tensor in ModelInput — OK
851-851: Docstring includes Tensor in ModelType — OKlib/llm/src/local_model/runtime_config.rs (1)
8-9: Import tensor protocol — OKlib/bindings/python/rust/lib.rs (3)
162-166: Map Python ModelInput::Tensor to Rust — OK
299-303: Expose ModelType.Tensor to Python — OKBitflag classattr mirrors Rust side and composes via or as expected.
320-321: Add Python enum value for ModelInput::Tensor — OKlib/llm/src/types.rs (2)
76-92: Tensor type aliases look consistent and additiveRe-exports and engine aliases align with existing OpenAI sections. Naming and Annotated usage are consistent.
88-91: Alias is correct; no separate stream response type exists
NvCreateTensorResponse is the only response type defined in lib/llm/src/protocols/tensor.rs and supports streaming via its from_annotated_stream implementation.lib/bindings/python/rust/llm/local_model.rs (1)
64-71: LGTM on get_tensor_model_configConversion and Option handling look correct; error mapping is consistent with the rest of the bindings.
lib/llm/src/discovery/watcher.rs (4)
35-36: Tensor protocol import is correct and scoped.Importing NvCreateTensorRequest/Response here keeps watcher concerns local. No issues.
58-63: Include Tensor in ALL_MODEL_TYPES.Good addition; ensures ModelUpdate emits for tensor as well.
432-445: Tensor + Tensor routing path looks correct.The PushRouter generic parameters align with NvCreateTensorRequest and Annotated. Registration to manager via add_tensor_model is as expected.
448-452: Unsupported configuration message updated — good.Including “Tensor+Tensor” improves operator feedback.
lib/llm/src/discovery/model_manager.rs (5)
18-19: Correct engine type import.types::generic::tensor::TensorStreamingEngine import aligns with the watcher registration.
40-41: Add tensor_engines field — consistent with other engine maps.Initialization and locking semantics mirror existing maps.
59-60: Initialize tensor_engines in new() — OK.
74-80: Include tensor models in model_display_names — OK.
95-97: Align RwLock import and unwrap usage
Remove all.unwrap()on.read()/.write()if you’ve importedparking_lot::RwLock(itsread()/write()return guards directly), or switch the file tostd::sync::RwLockand retain the unwraps. Ensure theuseat the top of lib/llm/src/discovery/model_manager.rs matches your chosen lock API.lib/llm/src/grpc/service/tensor.rs (3)
80-95: Annotation injection — LGTM.request_id annotation is correctly prefixed to the stream when requested.
121-152: grpc_monitor_for_disconnects — solid.Arming/disarming semantics and inflight guard handling are correct.
165-177: UUID handling — LGTM.Gracefully accepts client-provided UUID or generates a new v4.
lib/llm/tests/kserve_service.rs (3)
287-299: Duplicate model entry for "split" likely fine, but consider consolidatingYou register runtime metadata for split via save_model_entry after add_completions_model. Looks OK for tests, but you may not need both calls if the manager prevents duplicates; consider documenting intended override behavior.
1147-1400: Great, thorough tensor-path coverageEnd-to-end tests for metadata/config, unary infer, error on multi-response in unary, and stream infer are solid and mirror the intended routing behaviors.
1402-1455: Helper asserts mirror echo semantics and ID plumbingvalidate_tensor_response checks name, version, id and I/O parity; aligns with the conversion path. LGTM.
lib/llm/src/grpc/service/kserve.rs (2)
300-373: Streaming path correctly preserves request_id and templating; mirrors unary pathThe demux note is fair; current behavior is acceptable for now.
994-1004: KServe dtype mapping looks correctBOOL/UINT32/INT32/FP32/BYTES → TypeBool/TypeUint32/TypeInt32/TypeFp32/TypeString matches expectations.
lib/llm/src/protocols/tensor.rs (2)
67-81: Good: NvExtProvider/AnnotationsProvider implemented consistentlyAnnotations pass-through via NvExt is clean and keeps parity with other protocols.
140-181: Aggregator behavior and error messages are appropriateSingle-response folding with explicit “Multiple responses in non-streaming mode” aligns with tests and expected semantics.
d63d1d5 to
53689af
Compare
Signed-off-by: Guan Luo <[email protected]>
Signed-off-by: Guan Luo <[email protected]>
Signed-off-by: Guan Luo <[email protected]>
Signed-off-by: Guan Luo <[email protected]>
Signed-off-by: Guan Luo <[email protected]>
Signed-off-by: Guan Luo <[email protected]>
Signed-off-by: Guan Luo <[email protected]>
Signed-off-by: Guan Luo <[email protected]>
Signed-off-by: Guan Luo <[email protected]>
Signed-off-by: Guan Luo <[email protected]>
53689af to
164044f
Compare
Signed-off-by: Guan Luo <[email protected]>
|
/ok to test a6054a8 |
|
@GuanLuo Could you provide an example of how we use this new
|
rmccorm4
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - most comments are just questions for better understanding.
- We need docs for both kserve frontend and tensor-based model related features, describing how to opt in/out, and what their current limitations or assumptions are
- I notice a lot of
todos in the code. If you write up some detailed github issues describing some of the TODOs - it is very likely (from past history so far) that an OSS contributor will try to fix/implement them. Even if they don't get picked up, these will auto sync to Linear and be helpful for others to pickup in future too.
|
ok to merge conditional on ci failing not due to this pr + it'd be nice to have some logic extracted into smaller methods |
Signed-off-by: Guan Luo <[email protected]>
Signed-off-by: Guan Luo <[email protected]>
|
/ok to test 41fdf07 |
Co-authored-by: Ryan McCormick <[email protected]> Signed-off-by: GuanLuo <[email protected]>
|
/ok to test 2fd767c |
Signed-off-by: Guan Luo <[email protected]> Signed-off-by: GuanLuo <[email protected]> Co-authored-by: Olga Andreeva <[email protected]> Co-authored-by: Ryan McCormick <[email protected]>
Signed-off-by: Guan Luo <[email protected]> Signed-off-by: GuanLuo <[email protected]> Co-authored-by: Olga Andreeva <[email protected]> Co-authored-by: Ryan McCormick <[email protected]> Signed-off-by: Jason Zhou <[email protected]>
Signed-off-by: Guan Luo <[email protected]> Signed-off-by: GuanLuo <[email protected]> Co-authored-by: Olga Andreeva <[email protected]> Co-authored-by: Ryan McCormick <[email protected]> Signed-off-by: Jason Zhou <[email protected]>
Signed-off-by: Guan Luo <[email protected]> Signed-off-by: GuanLuo <[email protected]> Co-authored-by: Olga Andreeva <[email protected]> Co-authored-by: Ryan McCormick <[email protected]> Signed-off-by: Kyle H <[email protected]>
Overview:
This PR introduces a new request / response type
tensor, which is commonly used in non-LLM inference, for the backend. With #2638 Dynamo now have a frontend that handles tensor request / response (KServe gRPC protocol) but converts to OpenAI completion message, we could enable more traditional use case with this PR.NOTE:
llmand there is use of utilities inllm(i.e.register_llm()) for convenience, tensor type can be used for generic inference and we should revisit the placement / usage later.Details:
NvCreateTensorRequest/NvCreateTensorResponsemessage types that a backend can consume and produceModelType::Tensor,ModelInputType::Tensorwhich can be passed duringregister_llm(). An engine that interacts with tensor messages will be created and registered if both are set. The user may register the same model for both tensor and text input under different endpoints, but we need to validate that and for now we restrict the tensor types can not be set with existing types for LLM inference.runtime_configfield in ModelDeploymentCard, which the backend can pass during theregister_llm()call. When invokingModelMetadata/ModelConfigendpoints, theruntime_configwill be fetched and convert to corresponding messages. If a model type is tensor, we expects theruntime_configto be the Pythonized representation of aTensorModelConfigas defined inlib/llm/src/protocols/tensor.rs. For future PR, we should properly define a field in ModelDeploymentCard similar to how tokenizer config is being fetched and stored.Where should the reviewer start?
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Summary by CodeRabbit
New Features
Documentation
Tests