-
Notifications
You must be signed in to change notification settings - Fork 659
feat: enhance sanity_check.py with NATS, etcd, and HF model check #3271
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: enhance sanity_check.py with NATS, etcd, and HF model check #3271
Conversation
WalkthroughAdds new NodeInfo types for NATS, etcd, and Hugging Face cache; integrates them into the Dynamo system tree. Updates DynamoInfo and PythonPackageInfo constructors (adds terse and import_error). Propagates terse mode, adjusts inclusion logic, adds HTTP-based NATS/etcd stats fetching, updates framework display labels, and augments TensorRT-LLM import diagnostics. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant U as Caller
participant D as DynamoInfo
participant S as System Tree
participant H as HuggingFaceInfo
participant N as NatsInfo
participant E as EtcdInfo
Note over U,D: Initialize with flags<br/>thorough_check, terse
U->>D: new DynamoInfo(thorough_check, terse)
D->>S: Build tree root
D->>H: Add HuggingFaceInfo (always)
alt terse == false
D->>N: Add NatsInfo
D->>E: Add EtcdInfo
else terse == true
D->>N: Add only if warnings/errors
D->>E: Add only if warnings/errors
end
D-->>U: Assembled Dynamo tree
sequenceDiagram
autonumber
participant N as NatsInfo
participant NE as NATS HTTP Endpoint
participant E as EtcdInfo
participant EE as etcd HTTP Endpoint
Note over N: Collect stats (thorough mode)
N->>NE: GET /metrics or /varz/connz/subsz
alt endpoint available
NE-->>N: Stats JSON/metrics
N->>N: Parse & format details
else unavailable
N->>N: Record fallback status
end
Note over E: Collect stats (thorough mode)
E->>EE: GET /metrics/health/leader
alt endpoint available
EE-->>E: Stats/health JSON
E->>E: Parse & format details
else unavailable
E->>E: Record fallback status
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
Pre-merge checks❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🧹 Nitpick comments (5)
deploy/dynamo_check.py (5)
757-795: ETCD_ENDPOINTS parsing ignores multiple endpoints and scheme nuancesOnly the first endpoint is used. In HA setups, consider trying each endpoint until one responds and preserving scheme (http/https).
I can provide a helper that yields (host, port, scheme) for each endpoint and tries them with timeouts.
1131-1137: Support standard HF cache env vars (HF_HOME/HF_HUB_CACHE/HUGGINGFACE_HUB_CACHE)
HF_CACHEis non-standard. Add common envs so paths resolve correctly on typical installs.- hf_cache_path = os.environ.get("HF_CACHE") - if hf_cache_path: - hf_cache_path = os.path.join(os.path.expanduser(hf_cache_path), "hub") - else: - hf_cache_path = os.path.expanduser("~/.cache/huggingface/hub") + hf_cache_path = ( + os.environ.get("HUGGINGFACE_HUB_CACHE") + or os.environ.get("HF_HUB_CACHE") + or (os.path.join(os.environ.get("HF_HOME", os.path.expanduser("~/.cache/huggingface")), "hub")) + ) + hf_cache_path = os.path.expanduser(hf_cache_path)
1190-1267: HF model listing cost control and naming
- In thorough mode you compute per-model sizes; good that you cap output to 5, but size-walking can still be heavy for very large repos. Consider skipping size for >N models or summing
blobs/once.- The “models--org--name” decoding is correct; consider also handling
snapshots/layout for better size attribution.
320-326: Hard-coded PDT offset breaks outside PDT and during DST transitionsUse
zoneinfofor local timezone conversion or print UTC to avoid inaccuracies.- # Convert to PDT (UTC-7) - dt_pdt = dt_utc - datetime.timedelta(hours=7) - return dt_pdt.strftime("%Y-%m-%d %H:%M:%S PDT") + try: + from zoneinfo import ZoneInfo + tz = ZoneInfo(os.environ.get("TZ", "America/Los_Angeles")) + return dt_utc.astimezone(tz).strftime("%Y-%m-%d %H:%M:%S %Z") + except Exception: + return dt_utc.strftime("%Y-%m-%d %H:%M:%S UTC")
599-613: HTTP fetch hardening and logging (urllib)Numerous
except Exception: passaround urllib calls hide actionable errors and trigger S310/BLE001/TRY300 lint warnings. At minimum, log at DEBUG and ensure scheme is http/https (you already build the URLs, but guard future changes).Want a small helper
_http_get_json(url, timeout=2)with scheme check and debug logging, then replace call sites?
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
deploy/dynamo_check.py(12 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: keivenchang
PR: ai-dynamo/dynamo#3035
File: lib/runtime/src/metrics/prometheus_names.rs:49-53
Timestamp: 2025-09-16T00:26:37.092Z
Learning: keivenchang prefers consistency in metric naming standardization over strict adherence to Prometheus conventions about gauge vs counter suffixes. When standardizing metrics naming, prioritize consistency across the codebase rather than technical pedantry about individual metric type conventions.
🧬 Code graph analysis (1)
deploy/dynamo_check.py (1)
lib/runtime/src/system_status_server.rs (1)
response(300-305)
🪛 Gitleaks (8.28.0)
deploy/dynamo_check.py
[high] 52-52: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.
(generic-api-key)
🪛 Ruff (0.13.1)
deploy/dynamo_check.py
488-489: try-except-pass detected, consider logging the exception
(S110)
488-488: Do not catch blind exception: Exception
(BLE001)
496-496: Consider moving this statement to an else block
(TRY300)
514-514: Consider moving this statement to an else block
(TRY300)
515-515: Do not catch blind exception: Exception
(BLE001)
580-580: Do not catch blind exception: Exception
(BLE001)
599-599: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.
(S310)
600-600: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.
(S310)
603-603: Do not catch blind exception: Exception
(BLE001)
609-609: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.
(S310)
610-610: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.
(S310)
613-614: try-except-pass detected, consider logging the exception
(S110)
613-613: Do not catch blind exception: Exception
(BLE001)
662-662: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.
(S310)
663-663: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.
(S310)
676-676: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.
(S310)
677-677: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.
(S310)
714-714: Do not catch blind exception: Exception
(BLE001)
771-772: try-except-pass detected, consider logging the exception
(S110)
771-771: Do not catch blind exception: Exception
(BLE001)
785-785: Possible binding to all interfaces
(S104)
789-790: try-except-pass detected, consider logging the exception
(S110)
789-789: Do not catch blind exception: Exception
(BLE001)
806-806: Consider moving this statement to an else block
(TRY300)
807-807: Do not catch blind exception: Exception
(BLE001)
911-911: Do not catch blind exception: Exception
(BLE001)
928-928: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.
(S310)
929-929: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.
(S310)
932-933: try-except-pass detected, consider logging the exception
(S110)
932-932: Do not catch blind exception: Exception
(BLE001)
947-947: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.
(S310)
948-948: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.
(S310)
955-956: try-except-pass detected, consider logging the exception
(S110)
955-955: Do not catch blind exception: Exception
(BLE001)
961-961: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.
(S310)
962-962: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.
(S310)
974-975: try-except-pass detected, consider logging the exception
(S110)
974-974: Do not catch blind exception: Exception
(BLE001)
980-980: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.
(S310)
981-981: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.
(S310)
985-985: Do not catch blind exception: Exception
(BLE001)
990-994: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.
(S310)
995-995: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.
(S310)
999-1000: try-except-pass detected, consider logging the exception
(S110)
999-999: Do not catch blind exception: Exception
(BLE001)
1002-1002: Consider moving this statement to an else block
(TRY300)
1003-1004: try-except-pass detected, consider logging the exception
(S110)
1003-1003: Do not catch blind exception: Exception
(BLE001)
1052-1052: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.
(S310)
1053-1053: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.
(S310)
1056-1056: Do not catch blind exception: Exception
(BLE001)
1068-1068: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.
(S310)
1069-1069: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.
(S310)
1088-1088: Do not catch blind exception: Exception
(BLE001)
1099-1103: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.
(S310)
1104-1104: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.
(S310)
1121-1122: try-except-pass detected, consider logging the exception
(S110)
1121-1121: Do not catch blind exception: Exception
(BLE001)
1217-1217: Do not catch blind exception: Exception
(BLE001)
1224-1224: Do not catch blind exception: Exception
(BLE001)
1228-1229: try-except-pass detected, consider logging the exception
(S110)
1228-1228: Do not catch blind exception: Exception
(BLE001)
1238-1238: Loop control variable dirnames not used within loop body
Rename unused dirnames to _dirnames
(B007)
1246-1247: try-except-pass detected, consider logging the exception
(S110)
1246-1246: Do not catch blind exception: Exception
(BLE001)
2509-2509: Do not catch blind exception: Exception
(BLE001)
2574-2575: try-except-pass detected, consider logging the exception
(S110)
2574-2574: Do not catch blind exception: Exception
(BLE001)
2577-2578: try-except-pass detected, consider logging the exception
(S110)
2577-2577: Do not catch blind exception: Exception
(BLE001)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: trtllm
- GitHub Check: sglang
- GitHub Check: vllm
- GitHub Check: Build and Test - dynamo
🔇 Additional comments (5)
deploy/dynamo_check.py (5)
3048-3064: Good integration of etcd/NATS into DynamoInfo with terse gatingThe inclusion logic and propagation of
terseandthorough_checklook correct.Please confirm CLI help/docs mention that etcd/NATS checks run in terse mode but only display on warn/error.
432-433: Terse mode inclusion list looks good; minor: keep HF in error-only setGood call adding HF cache to terse error-only checks. No changes needed.
52-52: Gitleaks “Generic API Key” is likely a false positive from example outputThe detected string appears to be a sample NATS server ID in the doc block, not a credential. Consider adding an allowlist entry or adjust the example to avoid tripping Gitleaks.
1093-1121: Drop the etcd range change—current code already returns all keys
Per etcd v3 API, setting bothkeyandrange_endto'\0'(base64 "AA==") returns the full keyspace (etcd.io).Likely an incorrect or invalid review comment.
651-713: Mirror varz fallback for subsz and cap subject list
- Retry
subszandsubsz?subs=1requests againsthttp://{host}:8222on exception, just like varz.- Limit
dynamo_subjectsto a maximum of 50 entries to prevent excessively large outputs.
2cf0dca to
abe8032
Compare
abe8032 to
bc0ba12
Compare
deploy/dynamo_check.py
Outdated
| missing_text = ", ".join(missing_frameworks) | ||
| self.desc = missing_text | ||
|
|
||
| def _check_tensorrt_llm_installation(self) -> Optional[str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to add this back because there is a corner case where the GPU is disconnected (Docker issue) and then importing tensorrt_llm fails. But we still need to detect that this is installed.
bc0ba12 to
b3970a3
Compare
b3970a3 to
fd0ede5
Compare
- Add etcd service monitoring with key listing and cluster info - Add NATS service monitoring with subject and subscription details - Add Hugging Face cache inspection showing cached models - Keep dynamo_check.py as symlink to sanity_check.py for compatibility - Reorder checks: services, file system, HF cache, then dev tools Signed-off-by: Keiven Chang <[email protected]>
9738837 to
9b8e7b0
Compare
Overview:
Enhance sanity_check.py with infrastructure monitoring for NATS, etcd, and Hugging Face model cache.
Details:
Example 1:
Example 2:
Where should the reviewer start?
deploy/sanity_check.py - Focus on NatsInfo, EtcdInfo, and HuggingFaceInfo classes and their integration into DynamoInfo section
Related Issues:
/coderabbit profile chill
Summary by CodeRabbit
New Features
Improvements