-
Notifications
You must be signed in to change notification settings - Fork 691
feat: Add a checksum to ModelDeploymentCard fields #2934
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughIntroduces a new CheckedFile abstraction with Blake3 checksums and serde support, updates model_card and template processing to use it, adjusts NATS object store APIs to borrow Url, and enables blake3 mmap/rayon features. Adds a new module, updates public enums and function signatures, and removes a license header. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor C as Caller
participant MC as ModelCard/Template Logic
participant CF as CheckedFile
participant RT as NATS Runtime
Note over CF: New abstraction: local Path or remote Url<br/>with Blake3 checksum
C->>MC: Prepare artifact for upload
MC->>CF: path()
alt Has local path
MC->>RT: object_store_upload(filepath, &url)
RT-->>MC: Ok
MC->>CF: move_to_url(url)
else No local path (already URL)
MC-->>C: Skip upload
end
sequenceDiagram
autonumber
actor C as Caller
participant MC as ModelCard/Template Logic
participant CF as CheckedFile
participant RT as NATS Runtime
C->>MC: Ensure local file for processing
MC->>CF: url()
alt Has remote Url
MC->>RT: object_store_download(&url, filepath)
RT-->>MC: Ok
MC->>CF: move_to_disk(filepath)
MC->>CF: checksum_matches(filepath)
alt Checksum ok
MC-->>C: Proceed
else Checksum mismatch
MC-->>C: Error
end
else Has local path
MC-->>C: Use existing path
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
Tip 👮 Agentic pre-merge checks are now available in preview!Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
lib/runtime/src/transports/nats.rs (1)
419-432: url_to_bucket_and_key drops nested path segments (breaks keys with slashes).Only the first two segments are used; additional segments are ignored, corrupting keys like nats://h/bucket/dir1/dir2/file. Join the remainder into the key.
Apply:
-pub fn url_to_bucket_and_key(url: &Url) -> anyhow::Result<(String, String)> { +pub fn url_to_bucket_and_key(url: &Url) -> anyhow::Result<(String, String)> { let Some(mut path_segments) = url.path_segments() else { anyhow::bail!("No path in NATS URL: {url}"); }; let Some(bucket) = path_segments.next() else { anyhow::bail!("No bucket in NATS URL: {url}"); }; - let Some(key) = path_segments.next() else { - anyhow::bail!("No key in NATS URL: {url}"); - }; - Ok((bucket.to_string(), key.to_string())) + let key = { + let first = path_segments.next().ok_or_else(|| anyhow::anyhow!("No key in NATS URL: {url}"))?; + let mut k = first.to_string(); + for seg in path_segments { + k.push('/'); + k.push_str(seg); + } + k + }; + Ok((bucket.to_string(), key)) }lib/llm/src/model_card.rs (1)
204-207: mdcsum includes volatile fields and unwraps; compute a stable hash and avoid panicAs written, the checksum changes whenever last_published changes and can panic on serialization failure. If the goal is “same files” parity, hash only the CheckedFile checksums (and variant tags) and exclude volatile fields.
- pub fn mdcsum(&self) -> String { - let json = self.to_json().unwrap(); - format!("{}", blake3::hash(json.as_bytes())) - } + pub fn mdcsum(&self) -> String { + // Stable, file-oriented digest: ignore volatile fields like `last_published`. + let mut hasher = blake3::Hasher::new(); + if let Some(ModelInfoType::HfConfigJson(f)) = &self.model_info { + hasher.update(b"model_info:hf_config_json:"); + hasher.update(f.checksum_hex().as_bytes()); + } + if let Some(GenerationConfig::HfGenerationConfigJson(f)) = &self.gen_config { + hasher.update(b"gen_config:hf_generation_config_json:"); + hasher.update(f.checksum_hex().as_bytes()); + } + if let Some(TokenizerKind::HfTokenizerJson(f)) = &self.tokenizer { + hasher.update(b"tokenizer:hf_tokenizer_json:"); + hasher.update(f.checksum_hex().as_bytes()); + } + if let Some(PromptFormatterArtifact::HfTokenizerConfigJson(f)) = &self.prompt_formatter { + hasher.update(b"prompt_formatter:hf_tokenizer_config_json:"); + hasher.update(f.checksum_hex().as_bytes()); + } + if let Some(PromptFormatterArtifact::HfChatTemplate(f)) = &self.chat_template_file { + hasher.update(b"chat_template:hf_chat_template:"); + hasher.update(f.checksum_hex().as_bytes()); + } + hasher.finalize().to_string() + }If checksum_hex() doesn’t exist on CheckedFile, we should add a public getter for a canonical hex digest. I can follow up with that change.
🧹 Nitpick comments (11)
lib/llm/Cargo.toml (1)
97-97: Consider gating rayon behind a crate feature to trim build size when not needed.If hashing large files isn’t on all build paths, expose a cargo feature (e.g., fast-hash) and move blake3’s "rayon" under it.
lib/runtime/src/transports/nats.rs (1)
228-259: Make data upload/download APIs borrow Url too for consistency.object_store_upload_data and object_store_download_data still take Url by value. Switch to &Url to align with the new API and avoid moves.
- pub async fn object_store_upload_data<T>(&self, data: &T, nats_url: Url) -> anyhow::Result<()> + pub async fn object_store_upload_data<T>(&self, data: &T, nats_url: &Url) -> anyhow::Result<()> ... - let (bucket_name, key) = url_to_bucket_and_key(&nats_url)?; + let (bucket_name, key) = url_to_bucket_and_key(nats_url)?;- pub async fn object_store_download_data<T>(&self, nats_url: Url) -> anyhow::Result<T> + pub async fn object_store_download_data<T>(&self, nats_url: &Url) -> anyhow::Result<T> ... - let (bucket_name, key) = url_to_bucket_and_key(&nats_url)?; + let (bucket_name, key) = url_to_bucket_and_key(nats_url)?;Also applies to: 254-279
lib/llm/src/preprocessor/prompt/template.rs (1)
41-57: Same restriction for chat template — confirm and consider preserving newlines.
- HfChatTemplate now requires a local path; confirm that’s intended across all call sites.
- Replacing '\n' may alter valid Jinja. Consider keeping original content unless you observed template issues.
lib/llm/src/common/checked_file.rs (3)
117-124: Typo in serialized struct name.serialize_struct is labeled "Checksum" for CheckedFile; use "CheckedFile".
- let mut cf = serializer.serialize_struct("Checksum", 2)?; + let mut cf = serializer.serialize_struct("CheckedFile", 2)?;
137-147: URL vs local-path detection can misclassify Windows-like strings.Url::parse succeeds on any “scheme:…” string. Consider classifying as URL only for known schemes (e.g., nats|http|https|s3|gs|file) or when url.has_host()/scheme() matches expected.
Example:
- match Url::parse(&temp.path) { - Ok(url) => CheckedFile { path: Either::Right(url), checksum: temp.checksum }, + match Url::parse(&temp.path) { + Ok(url) if matches!(url.scheme(), "nats" | "http" | "https" | "s3" | "gs" | "file") => { + CheckedFile { path: Either::Right(url), checksum: temp.checksum } + } Err(_) => CheckedFile { path: Either::Left(PathBuf::from(temp.path)), checksum: temp.checksum, }, }
169-176: Threshold heuristic: consider using blake3’s own mmap fallback logic.blake3::Hasher::update_mmap_rayon already decides when to mmap vs. fall back. Calling it directly (for sufficiently large files) or using update_reader for small ones can simplify and avoid arbitrary 10 MB thresholds.
Refs: blake3 Hasher docs. (docs.rs)
lib/llm/src/model_card.rs (5)
224-231: Clarify invariant and error when tokenizer is URL-backedtokenizer_hf() bails if the tokenizer is a URL. That’s fine, but the error is generic. Make it actionable by including the URL and suggesting move_from_nats() first.
- let p = checked_file.path().ok_or_else(|| { - anyhow::anyhow!("URL tokenizer cannot be loaded by HF tokenizer") - })?; + let p = checked_file.path().ok_or_else(|| { + let u = checked_file.url(); + anyhow::anyhow!( + "Tokenizer is URL-backed ({:?}); call move_from_nats() before tokenizer_hf()", + u + ) + })?; ```<!-- review_comment_end --> --- `312-326`: **Clean up downloaded file on checksum mismatch to avoid leaving bad artifacts** If checksum validation fails, remove the just-downloaded file before bailing. ```diff - if !src_file.checksum_matches(&target) { - anyhow::bail!( + if !src_file.checksum_matches(&target) { + let _ = tokio::fs::remove_file(&target).await; + anyhow::bail!( "Invalid {} in NATS for {}, checksum does not match.", $filename, self.display_name ); } ```<!-- review_comment_end --> --- `570-577`: **ModelInfoType::get_model_info: enforce local path, but consider hinting remediation** Good to require a local path here. Consider hinting the caller to invoke move_from_nats() when hitting the error to reduce churn. <!-- review_comment_end --> --- `626-631`: **Avoid temporary PathBufs in parent().unwrap_or(...), and dedupe generation_config path creation** Minor cleanup/readability: prefer Path::new("") over allocating a temporary PathBuf, and compute generation_config.json path once. ```diff - let file_path = file.as_ref(); + let file_path = file.as_ref(); + let gen_cfg_path = file_path + .parent() + .unwrap_or_else(|| std::path::Path::new("")) + .join("generation_config.json"); @@ - let bos_token_id = crate::file_json_field::<TokenIdType>( - &Path::join( - file_path.parent().unwrap_or(&PathBuf::from("")), - "generation_config.json", - ), - "bos_token_id", - ) + let bos_token_id = crate::file_json_field::<TokenIdType>(&gen_cfg_path, "bos_token_id") @@ - crate::file_json_field( - &Path::join( - file_path.parent().unwrap_or(&PathBuf::from("")), - "generation_config.json", - ), - "eos_token_id", - ) + crate::file_json_field(&gen_cfg_path, "eos_token_id") @@ - path = %file_path.display(), + path = %file_path.display(), ```<!-- review_comment_end --> Also applies to: 644-647, 693-696, 684-685 --- `878-881`: **Test nits: pass Path instead of String to from_json_file** The new signature accepts AsRef<Path>. Avoiding String allocations is cleaner. ```diff - let config = HFConfig::from_json_file(config_file.display().to_string())?; + let config = HFConfig::from_json_file(&config_file)?;Apply the same change to the Llama-4 test.
Also applies to: 887-890
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (2)
Cargo.lockis excluded by!**/*.locklib/bindings/python/Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (6)
lib/llm/Cargo.toml(1 hunks)lib/llm/src/common.rs(1 hunks)lib/llm/src/common/checked_file.rs(1 hunks)lib/llm/src/model_card.rs(16 hunks)lib/llm/src/preprocessor/prompt/template.rs(1 hunks)lib/runtime/src/transports/nats.rs(2 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-06-13T22:07:24.843Z
Learnt from: kthui
PR: ai-dynamo/dynamo#1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:07:24.843Z
Learning: The codebase uses async-nats version 0.40, not the older nats crate. Error handling should use async_nats::error::Error variants, not nats::Error variants.
Applied to files:
lib/runtime/src/transports/nats.rs
🧬 Code graph analysis (2)
lib/llm/src/preprocessor/prompt/template.rs (1)
lib/llm/src/local_model.rs (1)
display_name(340-342)
lib/llm/src/model_card.rs (2)
lib/llm/src/local_model.rs (2)
path(335-337)display_name(340-342)lib/runtime/src/transports/nats.rs (2)
object_store_upload(176-191)object_store_download(194-212)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Build and Test - dynamo
- GitHub Check: pre-merge-rust (.)
- GitHub Check: pre-merge-rust (lib/runtime/examples)
- GitHub Check: pre-merge-rust (lib/bindings/python)
🔇 Additional comments (10)
lib/llm/Cargo.toml (1)
97-97: Blake3 features align with code path using update_mmap_rayon — good.Enabling "mmap" and "rayon" is correct for Hasher::update_mmap_rayon used in this PR. Docs confirm those methods are gated by these features.
Refs: blake3 Hasher docs and crate features. (docs.rs)
lib/runtime/src/transports/nats.rs (2)
175-191: Borrowing Url avoids unnecessary clones — good change.Signature change to take &Url for uploads is an improvement and keeps ownership at the callsite.
193-212: Symmetric improvement on download — good.Matching the upload API, borrowing &Url here is consistent and reduces moves.
lib/llm/src/common.rs (1)
4-4: Module exposure LGTM.Publicly exporting checked_file is appropriate given downstream usage.
lib/llm/src/preprocessor/prompt/template.rs (1)
26-33: New restriction: tokenizer_config must be a local path — confirm intent.This bails on URLs. If remote/MDC URLs were previously valid, this is a breaking change; confirm that workers will always materialize these locally before formatting.
lib/llm/src/common/checked_file.rs (1)
112-125: Url crateserdefeature is already enabled
Cargo.toml includesurl = { version = "2.5", features = ["serde"] }, soCheckedFile::serializewill compile as is.lib/llm/src/model_card.rs (4)
270-275: NATS transfers for generation_config: looks goodMacro usage and variant selection are consistent for upload/download of generation_config.json.
Also applies to: 329-334
505-507: Custom template ingest: good move to CheckedFilefrom_disk() here ensures the checksum is recorded at ingest time.
804-810: Repo loaders migrated to CheckedFile: solidfrom_repo() helpers now precompute checksums at ingest. Error contexts are clear and specific.
Also applies to: 812-818, 820-836, 838-844
39-44: ModelInfoType and related enums need backward-compatibility for legacy string payloadslib/llm/src/model_card.rs definitions for ModelInfoType (39–44), TokenizerKind (46–53), PromptFormatterArtifact (66–72), and GenerationConfig (83–89) now wrap CheckedFile instead of PathBuf, so existing serialized strings/paths will fail to deserialize. Implement either:
- custom Deserialize for these enums to accept both legacy string/URL and CheckedFile (using CheckedFile::from_disk/from_url)
- a one-time migration to rewrite persisted MDCs to the new shape
⛔ Skipped due to learnings
Learnt from: GuanLuo PR: ai-dynamo/dynamo#2714 File: lib/llm/src/discovery/model_entry.rs:38-42 Timestamp: 2025-09-02T16:46:54.015Z Learning: In lib/llm/src/discovery/model_entry.rs, GuanLuo prefers not to add serde defaults for model_type and model_input fields to keep the specification explicit and avoid user errors, relying on atomic deployment strategy to avoid backward compatibility issues.
e4f202c to
e28b0fd
Compare
This ensures frontend and worker are using exactly the same files. Signed-off-by: Graham King <[email protected]>
Signed-off-by: Graham King <[email protected]>
e28b0fd to
9147f31
Compare
Signed-off-by: Graham King <[email protected]>
|
/ok to test |
@grahamking, there was an error processing your request: See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/ |
|
/ok to test f5f39c9 |
Signed-off-by: Graham King <[email protected]> Signed-off-by: Indrajit Bhosale <[email protected]>
Signed-off-by: Graham King <[email protected]>
Signed-off-by: Graham King <[email protected]> Signed-off-by: hongkuanz <[email protected]>
This ensures frontend and worker are using exactly the same files.
What is old is new again.
The ModelDeploymentCard looks like this now:
Summary by CodeRabbit