Skip to content

Conversation

@keivenchang
Copy link
Contributor

@keivenchang keivenchang commented Sep 27, 2025

Overview:

Enhance sanity_check.py with infrastructure monitoring for NATS, etcd, and Hugging Face model cache.

Details:

  • Add NATS service monitoring with connection status, uptime, message stats, and Dynamo-specific subjects
  • Add etcd service monitoring with cluster info, version, and complete key listing with revision data
  • Add Hugging Face model cache detection with download timestamps and sizes
  • Use actual environment variables (NATS_SERVER, ETCD_ENDPOINTS) from codebase for configuration consistency
  • Update documentation with realistic examples matching actual system output

Example 1:

├─ ⚠️ Hugging Face Cache: ~/.cache/huggingface/hub not available
└─ Dynamo: $HOME/dynamo, SHA: 4d326d4a5, Date: 2025-09-27 11:10:10 PDT
   ├─ ⚠️ etcd: service not available on localhost:2379
   ├─ ⚠️ NATS: service not available on localhost:4222

Example 2:

├─ ✅ Hugging Face Cache: 5 models in $HOME/.cache/huggingface/hub
│  ├─ Model 1: EricB/mistralrs_tests, downloaded=2025-07-23 17:46:49 PDT, size=20.6 MB
│  ├─ Model 2: Qwen/Qwen3-0.6B, downloaded=2025-07-26 10:03:25 PDT, size=1.41 GB
│  ├─ Model 3: TinyLlama/TinyLlama-1.1B-Chat-v1.0, downloaded=2025-09-16 09:47:01 PDT, size=2.05 GB
│  ├─ Model 4: deepseek-ai/DeepSeek-R1-Distill-Llama-8B, downloaded=2025-08-09 13:33:57 PDT, size=15.0 GB
│  └─ Model 5: llava-hf/llava-1.5-7b-hf, downloaded=2025-08-22 09:57:11 PDT, size=4.01 MB
└─ Dynamo: $HOME/dynamo, SHA: 9d73be125, Date: 2025-09-26 16:01:01 PDT
   ├─ ✅ etcd: service running on localhost:2379
   │  ├─ Version: 3.6.1
   │  ├─ Cluster Members: 1
   │  └─ Keys: 10 keys found
   ├─ ✅ NATS: service running on localhost:4222
   │  ├─ Server: NA4B6QC5MG5YT3ZY7UPBLNQYNG6QXZLQEGILXT3BQSETPZVMQ6EI5NKN
   │  ├─ Uptime: 8d18h59m5s
   │  ├─ Connections: 3

Where should the reviewer start?

deploy/sanity_check.py - Focus on NatsInfo, EtcdInfo, and HuggingFaceInfo classes and their integration into DynamoInfo section

Related Issues:

/coderabbit profile chill

Summary by CodeRabbit

  • New Features

    • System info now surfaces NATS and etcd status, including connections, uptime, and subscriptions (with deeper details in thorough mode).
    • Adds Hugging Face cache visibility under System info, optionally listing cached models in thorough mode.
    • Python package panel now highlights import errors and hints at detected TensorRT-LLM installs when imports fail.
  • Improvements

    • Terse mode suppresses NATS/etcd sections unless warnings/errors are present.
    • More consistent framework naming in the UI (e.g., SGLang, TensorRT-LLM).
    • More resilient diagnostics with graceful fallbacks if monitoring endpoints are unavailable.

@keivenchang keivenchang requested a review from a team as a code owner September 27, 2025 18:13
@github-actions github-actions bot added the feat label Sep 27, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 27, 2025

Walkthrough

Adds new NodeInfo types for NATS, etcd, and Hugging Face cache; integrates them into the Dynamo system tree. Updates DynamoInfo and PythonPackageInfo constructors (adds terse and import_error). Propagates terse mode, adjusts inclusion logic, adds HTTP-based NATS/etcd stats fetching, updates framework display labels, and augments TensorRT-LLM import diagnostics.

Changes

Cohort / File(s) Summary
New infrastructure info nodes
deploy/dynamo_check.py
Added classes: NatsInfo, EtcdInfo, HuggingFaceInfo; gather and render status, with thorough mode showing detailed stats and listings.
DynamoInfo constructor and tree integration
deploy/dynamo_check.py
DynamoInfo.init now accepts terse: bool; propagates terse to children; conditionally includes NATS/etcd nodes in terse mode; integrates HuggingFaceInfo under System info.
NATS/etcd HTTP stats helpers
deploy/dynamo_check.py
Introduced internal HTTP fetch and formatting helpers for NATS/etcd endpoints with graceful fallbacks when unavailable.
PythonPackageInfo signature
deploy/dynamo_check.py
PythonPackageInfo.init gains import_error: Optional[str]; display/status reflect import failures (e.g., TensorRT-LLM).
Framework labels and diagnostics
deploy/dynamo_check.py
Updated display labels: “SGLang”, “TensorRT-LLM”; added helper to detect potential TensorRT-LLM installation path on import failure; optional extra Cargo debug in thorough mode.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant U as Caller
  participant D as DynamoInfo
  participant S as System Tree
  participant H as HuggingFaceInfo
  participant N as NatsInfo
  participant E as EtcdInfo
  Note over U,D: Initialize with flags<br/>thorough_check, terse
  U->>D: new DynamoInfo(thorough_check, terse)
  D->>S: Build tree root
  D->>H: Add HuggingFaceInfo (always)
  alt terse == false
    D->>N: Add NatsInfo
    D->>E: Add EtcdInfo
  else terse == true
    D->>N: Add only if warnings/errors
    D->>E: Add only if warnings/errors
  end
  D-->>U: Assembled Dynamo tree
Loading
sequenceDiagram
  autonumber
  participant N as NatsInfo
  participant NE as NATS HTTP Endpoint
  participant E as EtcdInfo
  participant EE as etcd HTTP Endpoint
  Note over N: Collect stats (thorough mode)
  N->>NE: GET /metrics or /varz/connz/subsz
  alt endpoint available
    NE-->>N: Stats JSON/metrics
    N->>N: Parse & format details
  else unavailable
    N->>N: Record fallback status
  end
  Note over E: Collect stats (thorough mode)
  E->>EE: GET /metrics/health/leader
  alt endpoint available
    EE-->>E: Stats/health JSON
    E->>E: Parse & format details
  else unavailable
    E->>E: Record fallback status
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

I twitch my ears at NATS that sing,
et-cetera etcd joins the ring—
Hugging Face caches hum in tune,
terse or thorough, we’ll know soon.
If imports trip, I sniff the trail,
TensorRT-LLM won’t fail my tail—
hop, hop, the checks prevail! 🐇✨

Pre-merge checks

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Title Check ⚠️ Warning The PR title mentions "sanity_check.py" but the actual file being modified according to the raw summary is "deploy/dynamo_check.py". While the title correctly identifies the three main features being added (NATS, etcd, and HF model checks) and accurately describes the enhancement being made, the incorrect filename is a factual error that makes the title misleading about which file is being changed. This is a significant detail that could cause confusion for reviewers scanning pull request history. Update the PR title to reference the correct filename: "feat: enhance dynamo_check.py with NATS, etcd, and HF model check". This maintains the clear description of the features being added while accurately reflecting which file is being modified.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed The pull request description follows the template structure with all major sections present and substantively filled out. The Overview section clearly describes the enhancement to infrastructure monitoring, the Details section provides comprehensive information with specific examples showing the expected output format for NATS, etcd, and Hugging Face cache monitoring, and the "Where should the reviewer start?" section provides specific guidance on key classes and files to review. However, the "Related Issues" section deviates from the template requirement, which specifies using action keywords (Closes/Fixes/Resolves/Relates to) followed by a GitHub issue number (e.g., "#xxx"), but instead contains what appears to be a configuration command ("/coderabbit profile chill"). This represents a minor formatting deviation that does not significantly impact the overall quality and completeness of the description.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (5)
deploy/dynamo_check.py (5)

757-795: ETCD_ENDPOINTS parsing ignores multiple endpoints and scheme nuances

Only the first endpoint is used. In HA setups, consider trying each endpoint until one responds and preserving scheme (http/https).

I can provide a helper that yields (host, port, scheme) for each endpoint and tries them with timeouts.


1131-1137: Support standard HF cache env vars (HF_HOME/HF_HUB_CACHE/HUGGINGFACE_HUB_CACHE)

HF_CACHE is non-standard. Add common envs so paths resolve correctly on typical installs.

-        hf_cache_path = os.environ.get("HF_CACHE")
-        if hf_cache_path:
-            hf_cache_path = os.path.join(os.path.expanduser(hf_cache_path), "hub")
-        else:
-            hf_cache_path = os.path.expanduser("~/.cache/huggingface/hub")
+        hf_cache_path = (
+            os.environ.get("HUGGINGFACE_HUB_CACHE")
+            or os.environ.get("HF_HUB_CACHE")
+            or (os.path.join(os.environ.get("HF_HOME", os.path.expanduser("~/.cache/huggingface")), "hub"))
+        )
+        hf_cache_path = os.path.expanduser(hf_cache_path)

1190-1267: HF model listing cost control and naming

  • In thorough mode you compute per-model sizes; good that you cap output to 5, but size-walking can still be heavy for very large repos. Consider skipping size for >N models or summing blobs/ once.
  • The “models--org--name” decoding is correct; consider also handling snapshots/ layout for better size attribution.

320-326: Hard-coded PDT offset breaks outside PDT and during DST transitions

Use zoneinfo for local timezone conversion or print UTC to avoid inaccuracies.

-        # Convert to PDT (UTC-7)
-        dt_pdt = dt_utc - datetime.timedelta(hours=7)
-        return dt_pdt.strftime("%Y-%m-%d %H:%M:%S PDT")
+        try:
+            from zoneinfo import ZoneInfo
+            tz = ZoneInfo(os.environ.get("TZ", "America/Los_Angeles"))
+            return dt_utc.astimezone(tz).strftime("%Y-%m-%d %H:%M:%S %Z")
+        except Exception:
+            return dt_utc.strftime("%Y-%m-%d %H:%M:%S UTC")

599-613: HTTP fetch hardening and logging (urllib)

Numerous except Exception: pass around urllib calls hide actionable errors and trigger S310/BLE001/TRY300 lint warnings. At minimum, log at DEBUG and ensure scheme is http/https (you already build the URLs, but guard future changes).

Want a small helper _http_get_json(url, timeout=2) with scheme check and debug logging, then replace call sites?

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9d73be1 and 4d326d4.

📒 Files selected for processing (1)
  • deploy/dynamo_check.py (12 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: keivenchang
PR: ai-dynamo/dynamo#3035
File: lib/runtime/src/metrics/prometheus_names.rs:49-53
Timestamp: 2025-09-16T00:26:37.092Z
Learning: keivenchang prefers consistency in metric naming standardization over strict adherence to Prometheus conventions about gauge vs counter suffixes. When standardizing metrics naming, prioritize consistency across the codebase rather than technical pedantry about individual metric type conventions.
🧬 Code graph analysis (1)
deploy/dynamo_check.py (1)
lib/runtime/src/system_status_server.rs (1)
  • response (300-305)
🪛 Gitleaks (8.28.0)
deploy/dynamo_check.py

[high] 52-52: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.

(generic-api-key)

🪛 Ruff (0.13.1)
deploy/dynamo_check.py

488-489: try-except-pass detected, consider logging the exception

(S110)


488-488: Do not catch blind exception: Exception

(BLE001)


496-496: Consider moving this statement to an else block

(TRY300)


514-514: Consider moving this statement to an else block

(TRY300)


515-515: Do not catch blind exception: Exception

(BLE001)


580-580: Do not catch blind exception: Exception

(BLE001)


599-599: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)


600-600: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)


603-603: Do not catch blind exception: Exception

(BLE001)


609-609: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)


610-610: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)


613-614: try-except-pass detected, consider logging the exception

(S110)


613-613: Do not catch blind exception: Exception

(BLE001)


662-662: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)


663-663: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)


676-676: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)


677-677: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)


714-714: Do not catch blind exception: Exception

(BLE001)


771-772: try-except-pass detected, consider logging the exception

(S110)


771-771: Do not catch blind exception: Exception

(BLE001)


785-785: Possible binding to all interfaces

(S104)


789-790: try-except-pass detected, consider logging the exception

(S110)


789-789: Do not catch blind exception: Exception

(BLE001)


806-806: Consider moving this statement to an else block

(TRY300)


807-807: Do not catch blind exception: Exception

(BLE001)


911-911: Do not catch blind exception: Exception

(BLE001)


928-928: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)


929-929: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)


932-933: try-except-pass detected, consider logging the exception

(S110)


932-932: Do not catch blind exception: Exception

(BLE001)


947-947: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)


948-948: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)


955-956: try-except-pass detected, consider logging the exception

(S110)


955-955: Do not catch blind exception: Exception

(BLE001)


961-961: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)


962-962: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)


974-975: try-except-pass detected, consider logging the exception

(S110)


974-974: Do not catch blind exception: Exception

(BLE001)


980-980: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)


981-981: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)


985-985: Do not catch blind exception: Exception

(BLE001)


990-994: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)


995-995: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)


999-1000: try-except-pass detected, consider logging the exception

(S110)


999-999: Do not catch blind exception: Exception

(BLE001)


1002-1002: Consider moving this statement to an else block

(TRY300)


1003-1004: try-except-pass detected, consider logging the exception

(S110)


1003-1003: Do not catch blind exception: Exception

(BLE001)


1052-1052: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)


1053-1053: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)


1056-1056: Do not catch blind exception: Exception

(BLE001)


1068-1068: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)


1069-1069: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)


1088-1088: Do not catch blind exception: Exception

(BLE001)


1099-1103: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)


1104-1104: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)


1121-1122: try-except-pass detected, consider logging the exception

(S110)


1121-1121: Do not catch blind exception: Exception

(BLE001)


1217-1217: Do not catch blind exception: Exception

(BLE001)


1224-1224: Do not catch blind exception: Exception

(BLE001)


1228-1229: try-except-pass detected, consider logging the exception

(S110)


1228-1228: Do not catch blind exception: Exception

(BLE001)


1238-1238: Loop control variable dirnames not used within loop body

Rename unused dirnames to _dirnames

(B007)


1246-1247: try-except-pass detected, consider logging the exception

(S110)


1246-1246: Do not catch blind exception: Exception

(BLE001)


2509-2509: Do not catch blind exception: Exception

(BLE001)


2574-2575: try-except-pass detected, consider logging the exception

(S110)


2574-2574: Do not catch blind exception: Exception

(BLE001)


2577-2578: try-except-pass detected, consider logging the exception

(S110)


2577-2577: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: trtllm
  • GitHub Check: sglang
  • GitHub Check: vllm
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (5)
deploy/dynamo_check.py (5)

3048-3064: Good integration of etcd/NATS into DynamoInfo with terse gating

The inclusion logic and propagation of terse and thorough_check look correct.

Please confirm CLI help/docs mention that etcd/NATS checks run in terse mode but only display on warn/error.


432-433: Terse mode inclusion list looks good; minor: keep HF in error-only set

Good call adding HF cache to terse error-only checks. No changes needed.


52-52: Gitleaks “Generic API Key” is likely a false positive from example output

The detected string appears to be a sample NATS server ID in the doc block, not a credential. Consider adding an allowlist entry or adjust the example to avoid tripping Gitleaks.


1093-1121: Drop the etcd range change—current code already returns all keys
Per etcd v3 API, setting both key and range_end to '\0' (base64 "AA==") returns the full keyspace (etcd.io).

Likely an incorrect or invalid review comment.


651-713: Mirror varz fallback for subsz and cap subject list

  • Retry subsz and subsz?subs=1 requests against http://{host}:8222 on exception, just like varz.
  • Limit dynamo_subjects to a maximum of 50 entries to prevent excessively large outputs.

@keivenchang keivenchang force-pushed the keivenchang/dynamo_check.py-check-nats-etcd-cache branch from 2cf0dca to abe8032 Compare September 30, 2025 03:19
@keivenchang keivenchang changed the title feat: enhance dynamo_check.py with comprehensive NATS, etcd, HF feat: enhance dynamo_check.py with NATS, etcd, HF model check Sep 30, 2025
@keivenchang keivenchang force-pushed the keivenchang/dynamo_check.py-check-nats-etcd-cache branch from abe8032 to bc0ba12 Compare September 30, 2025 15:41
missing_text = ", ".join(missing_frameworks)
self.desc = missing_text

def _check_tensorrt_llm_installation(self) -> Optional[str]:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to add this back because there is a corner case where the GPU is disconnected (Docker issue) and then importing tensorrt_llm fails. But we still need to detect that this is installed.

@keivenchang keivenchang force-pushed the keivenchang/dynamo_check.py-check-nats-etcd-cache branch from bc0ba12 to b3970a3 Compare October 4, 2025 04:50
@keivenchang keivenchang force-pushed the keivenchang/dynamo_check.py-check-nats-etcd-cache branch from b3970a3 to fd0ede5 Compare October 7, 2025 21:27
- Add etcd service monitoring with key listing and cluster info
- Add NATS service monitoring with subject and subscription details
- Add Hugging Face cache inspection showing cached models
- Keep dynamo_check.py as symlink to sanity_check.py for compatibility
- Reorder checks: services, file system, HF cache, then dev tools

Signed-off-by: Keiven Chang <[email protected]>
@keivenchang keivenchang force-pushed the keivenchang/dynamo_check.py-check-nats-etcd-cache branch from 9738837 to 9b8e7b0 Compare October 17, 2025 01:02
@keivenchang keivenchang changed the title feat: enhance dynamo_check.py with NATS, etcd, HF model check feat: enhance sanity_check.py with NATS, etcd, and HF model check Oct 17, 2025
@keivenchang keivenchang self-assigned this Oct 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant