Skip to content

Conversation

@nnshah1
Copy link
Contributor

@nnshah1 nnshah1 commented Sep 2, 2025

Overview:

Adds basic docs for health checks and logging with examples.

Where should the reviewer start?

logging.md and health_check.md

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

Summary by CodeRabbit

  • Documentation
    • Added Health Check guide covering frontend and worker liveness/health endpoints (/live, /health, /ready), behavior before/after worker registration, and readiness based on served endpoints.
    • Documented configuration via environment variables (e.g., DYN_SYSTEM_ENABLED, DYN_SYSTEM_PORT, DYN_SYSTEM_HEALTH_PATH, DYN_SYSTEM_LIVE_PATH, DYN_SYSTEM_USE_ENDPOINT_HEALTH_STATUS) with examples and sample responses.
    • Added Logging guide detailing readable and JSONL formats, optional span events, distributed tracing fields (trace_id, span_id), x-request-id propagation, env-based and TOML configuration, and example outputs.
    • Linked related architecture and backend docs.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Sep 2, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@nnshah1 nnshah1 changed the title updating incremental for review docs: health check and structured logs Sep 2, 2025
@github-actions github-actions bot added the docs label Sep 2, 2025
@nnshah1 nnshah1 marked this pull request as ready for review September 2, 2025 17:16
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 2, 2025

Walkthrough

Adds two new documentation guides: health_check and logging. They describe frontend/worker health endpoints, liveness/readiness behavior, JSON responses, and environment configuration; and logging formats (text/JSONL), tracing fields, span events, config via env vars/TOML, with examples. No code or public API changes.

Changes

Cohort / File(s) Summary
Docs: Health & Logging Guides
docs/guides/health_check.md, docs/guides/logging.md
Introduces guides for health checks (frontend/worker liveness/health, readiness, env vars, sample requests/responses) and logging (text/JSONL formats, tracing fields, span events, env/TOML config, examples).

Sequence Diagram(s)

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Possibly related PRs

Poem

A rabbit taps logs with a gentle paw,
Hums “/live” and “/health” with careful awe.
JSON hops, spans blink bright,
Trace IDs twinkle in the night.
Docs now mapped, carrots ready—
Systems up; burrow steady. 🥕✨


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Signed-off-by: nnshah1 <[email protected]>
Signed-off-by: nnshah1 <[email protected]>
Signed-off-by: nnshah1 <[email protected]>
@nnshah1 nnshah1 requested a review from keivenchang September 2, 2025 17:21
@nnshah1 nnshah1 enabled auto-merge (squash) September 2, 2025 17:22
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (21)
docs/guides/logging.md (11)

22-26: Tighten wording and fix grammar.

Replace awkward phrasing and clarify feature behavior.

-Dynamo provides structured logging in both text as well as JSONL. When
-JSONL is enabled logs additionally contain `span` creation and exit
-events as well as support for `trace_id` and `span_id` fields for
-distributed tracing.
+Dynamo provides structured logging in two formats: readable text and JSONL.
+When JSONL is enabled, logs can include span creation/exit events and emit
+trace context fields (`trace_id`, `span_id`) for distributed tracing.

41-44: Add language to fenced code block (markdownlint MD040).

-```
+```bash
 export DYN_LOG="info,dynamo_runtime::system_status_server:trace"
 export DYN_LOGGING_JSONL="0"

---

`48-54`: **Add language to fenced code block (markdownlint MD040).**

```diff
-```
+```text
 2025-09-02T15:50:01.770028Z  INFO main.init: VllmWorker for Qwen/Qwen3-0.6B has been initialized
 2025-09-02T15:50:01.770195Z  INFO main.init: Reading Events from tcp://127.0.0.1:21555
 2025-09-02T15:50:01.770265Z  INFO main.init: Getting engine runtime configuration metadata from vLLM engine...
 2025-09-02T15:50:01.770316Z  INFO main.get_engine_cache_info: Cache config values: {'num_gpu_blocks': 24064}
 2025-09-02T15:50:01.770358Z  INFO main.get_engine_cache_info: Scheduler config values: {'max_num_seqs': 256, 'max_num_batched_tokens': 2048}

---

`60-63`: **Add language to fenced code block (markdownlint MD040).**

```diff
-```
+```bash
 export DYN_LOG="info,dynamo_runtime::system_status_server:trace"
 export DYN_LOGGING_JSONL="1"

---

`77-80`: **Fix typo and clarify condition.**

“greate” → “greater”; make the condition explicit.

```diff
-When `DYN_LOGGING_JSONL` with `DYN_LOG` set to greate than or equal to
-`info` level trace information is added to all spans along with
-`SPAN_CREATED` and `SPAN_CLOSED` events.
+When `DYN_LOGGING_JSONL=1` and `DYN_LOG` is set to `info` or higher, the logger
+emits span metadata, including `SPAN_CREATED` and `SPAN_CLOSED` events.

83-86: Add language to fenced code block (markdownlint MD040).

-```
+```bash
 curl -d '{"model": "Qwen/Qwen3-0.6B", "max_completion_tokens": 2049, "messages":[{"role":"user", "content": "What is the capital of South Africa?" }]}' -H 'Content-Type: application/json' http://localhost:8080/v1/chat/completions

---

`89-103`: **Add language to fenced code block; these are JSON logs.**

```diff
-```
+```json
 # Span Created in HTTP Frontend

 {"time":"2025-09-02T16:38:06.656503Z","level":"INFO","file":"/workspace/lib/runtime/src/logging.rs","line":248,"target":"dynamo_runtime::logging","message":"SPAN_CREATED","method":"POST","span_id":"6959a1b2d1ee41a5","span_name":"http-request","trace_id":"425ef761ca5b44c795b4c912f1d84b39","uri":"/v1/chat/completions","version":"HTTP/1.1"}
 ...

---

`105-109`: **Add language to fenced code block (markdownlint MD040).**

```diff
-```
+```bash
 curl -d '{"model": "Qwen/Qwen3-0.6B", "max_completion_tokens": 2049, "messages":[{"role":"user", "content": "What is the capital of South Africa?" }]}' -H 'Content-Type: application/json' -H 'x-request-id: 8372eac7-5f43-4d76-beca-0a94cfb311d0' http://localhost:8080/v1/chat/completions

---

`111-111`: **Resolve duplicate heading (markdownlint MD024).**

Rename to differentiate from the earlier “Example Logs”.

```diff
-### Example Logs
+### Example Logs (with x-request-id)

113-127: Add language to fenced code block; these are JSON logs.

-```
+```json
 # Span Created in HTTP Frontend
 ...

---

`29-35`: **Consider adding a minimal TOML example since a config path is documented.**

Helps users bootstrap a file quickly.

```diff
 | `DYN_LOGGING_CONFIG_PATH`          | Path to custom TOML logging configuration file            | `DYN_LOGGING_CONFIG_PATH=/path/to/config.toml`|
 
+### Example TOML configuration
+
+```toml
+[log]
+level = "info"
+use_local_tz = false
+jsonl = true
+
+[targets]
+# per-target overrides
+"dynamo_runtime::system_status_server" = "trace"
+```
docs/guides/health_check.md (10)

22-25: Fix grammar.

-Dynamo provides health check and liveness HTTP endpoints for each component which
-can be used configure startup, liveness and readiness probes in
-orchestration frameworks such as Kubernetes.
+Dynamo exposes health and liveness HTTP endpoints for each component, which
+can be used to configure startup, liveness, and readiness probes in
+orchestration frameworks such as Kubernetes.

33-35: Add language to fenced code block (markdownlint MD040).

-```
+```bash
 curl -s localhost:8080/live -q |  jq

---

`39-44`: **Add language to fenced code block (markdownlint MD040).**

```diff
-```
+```json
 {
   "message": "Service is live",
   "status": "live"
 }

---

`59-61`: **Add language to fenced code block (markdownlint MD040).**

```diff
-```
+```bash
 curl -s localhost:8080/health -q |  jq

---

`66-72`: **Add language to fenced code block (markdownlint MD040).**

```diff
-```
+```json
 {
   "instances": [],
   "message": "No endpoints available",
   "status": "unhealthy"
 }

---

`77-112`: **Add language to fenced code block (markdownlint MD040).**

```diff
-```
+```json
 {
   "endpoints": [
     "dyn://dynamo.backend.generate"
   ],
   "instances": [
     {
       "component": "backend",
       "endpoint": "clear_kv_blocks",
       "instance_id": 7587888160958628000,
       "namespace": "dynamo",
       "transport": {
         "nats_tcp": "dynamo_backend.clear_kv_blocks-694d98147d54be25"
       }
     },
     ...
   ],
   "status": "healthy"
 }

---

`116-121`: **Polish wording.**

```diff
-Health checks for components other than the frontend are enabled
-selectively based on environment variables. If a health check for a
-component is enabled the starting status can be set along with the set
-of endpoints that are required to be served before the component is
-declared `ready`.
+Health checks for non-frontend components are enabled selectively via environment variables.
+When enabled, you can set the initial status and the set of endpoints that must be served
+before the component is declared `ready`.

141-146: Add language to fenced code block (markdownlint MD040).

-```
+```bash
 export DYN_SYSTEM_ENABLED="true"
 export DYN_SYSTEM_STARTING_HEALTH_STATUS="notready"
 export DYN_SYSTEM_USE_ENDPOINT_HEALTH_STATUS="[\"generate\"]"
 export DYN_SYSTEM_PORT=9090

---

`158-168`: **Add language to fenced code block (markdownlint MD040).**

```diff
-```
+```json
 {
   "endpoints": {
     "generate": "notready"
   },
   "status": "notready",
   "uptime": {
     "nanos": 775381996,
     "secs": 2
   }
 }

---

`173-185`: **Add language to fenced code block (markdownlint MD040).**

```diff
-```
+```json
 {
   "endpoints": {
     "clear_kv_blocks": "ready",
     "generate": "ready",
     "load_metrics": "ready"
   },
   "status": "ready",
   "uptime": {
     "nanos": 435707697,
     "secs": 55
   }
 }

</blockquote></details>

</blockquote></details>

<details>
<summary>📜 Review details</summary>

**Configuration used**: Path: .coderabbit.yaml

**Review profile**: CHILL

**Plan**: Pro

**💡 Knowledge Base configuration:**

- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between 5bbbeae3596eb6b0babe66be352497dd8cd88cf8 and 89b3f1af439d328512c3c98ceacf91f7ed671e71.

</details>

<details>
<summary>📒 Files selected for processing (2)</summary>

* `docs/guides/health_check.md` (1 hunks)
* `docs/guides/logging.md` (1 hunks)

</details>

<details>
<summary>🧰 Additional context used</summary>

<details>
<summary>🪛 LanguageTool</summary>

<details>
<summary>docs/guides/health_check.md</summary>

[grammar] ~22-~22: There might be a mistake here.
Context: ... HTTP endpoints for each component which can be used configure startup, liveness ...

(QB_NEW_EN)

---

[grammar] ~23-~23: There might be a mistake here.
Context: ...tartup, liveness and readiness probes in orchestration frameworks such as Kuberne...

(QB_NEW_EN)

---

[grammar] ~28-~28: There might be a mistake here.
Context: ...nt reports a status of `live` as long as the service is running.  #### Example Re...

(QB_NEW_EN)

---

[grammar] ~49-~49: There might be a mistake here.
Context: ...During initial startup the frontend will report `unhealthy` until workers have be...

(QB_NEW_EN)

---

[grammar] ~50-~50: Ensure spelling is correct
Context: ...until workers have been initialized and registred with the frontend. Once workers have bee...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

---

[grammar] ~51-~51: There might be a mistake here.
Context: ...orkers have been registered the `health` endpoint additionally will list registre...

(QB_NEW_EN)

---

[grammar] ~52-~52: Ensure spelling is correct
Context: ...health` endpoint additionally will list registred endpoints and instances.    #### Exampl...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

---

[grammar] ~117-~117: There might be a mistake here.
Context: ...nment variables. If a health check for a component is enabled the starting status...

(QB_NEW_EN)

---

[grammar] ~118-~118: There might be a mistake here.
Context: ...ing status can be set along with the set of endpoints that are required to be ser...

(QB_NEW_EN)

---

[grammar] ~119-~119: There might be a mistake here.
Context: ...red to be served before the component is declared `ready`.  Once all endpoints de...

(QB_NEW_EN)

---

[grammar] ~122-~122: There might be a mistake here.
Context: ... `DYN_SYSTEM_USE_ENDPOINT_HEALTH_STATUS` are served the component transitions to ...

(QB_NEW_EN)

---

[grammar] ~123-~123: There might be a mistake here.
Context: ...transitions to a `ready` state until the component is shutdown.  > **Note**: Both...

(QB_NEW_EN)

---

[grammar] ~132-~132: There might be a mistake here.
Context: ...                                      | `true`, `false`                           | | `DYN_SYSTEM_PORT`        | Specifies th...

(QB_NEW_EN)

---

[grammar] ~133-~133: There might be a mistake here.
Context: ...090`                                   | | `DYN_SYSTEM_STARTING_HEALTH_STATUS`   ...

(QB_NEW_EN)

---

[grammar] ~134-~134: There might be a mistake here.
Context: ...tem (ready/not ready).                | `ready`, `notready`      | | `DYN_SYSTEM_HEALTH_PATH`               ...

(QB_NEW_EN)

---

[grammar] ~135-~135: There might be a mistake here.
Context: ...          | `/custom/health`           | | `DYN_SYSTEM_LIVE_PATH`                ...

(QB_NEW_EN)

---

[grammar] ~136-~136: There might be a mistake here.
Context: ...           | `/custom/live`            | | `DYN_SYSTEM_USE_ENDPOINT_HEALTH_STATUS...

(QB_NEW_EN)

---

[grammar] ~154-~154: There might be a mistake here.
Context: ...alth -q |  jq ```  #### Example Response Before endpoints are being served:  ``` ...

(QB_NEW_EN)

</details>
<details>
<summary>docs/guides/logging.md</summary>

[grammar] ~22-~22: There might be a mistake here.
Context: ...ging in both text as well as JSONL. When JSONL is enabled logs additionally conta...

(QB_NEW_EN)

---

[grammar] ~23-~23: There might be a mistake here.
Context: ...ionally contain `span` creation and exit events as well as support for `trace_id`...

(QB_NEW_EN)

---

[grammar] ~24-~24: There might be a mistake here.
Context: ... for `trace_id` and `span_id` fields for distributed tracing.  ## Environment Var...

(QB_NEW_EN)

---

[grammar] ~77-~77: Ensure spelling is correct
Context: ...YN_LOGGING_JSONL` with `DYN_LOG` set to greate than or equal to `info` level trace inf...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

---

[grammar] ~77-~77: There might be a mistake here.
Context: ...`DYN_LOG` set to greate than or equal to `info` level trace information is added ...

(QB_NEW_EN)

---

[grammar] ~78-~78: There might be a mistake here.
Context: ...rmation is added to all spans along with `SPAN_CREATED` and `SPAN_CLOSED` events....

(QB_NEW_EN)

</details>

</details>
<details>
<summary>🪛 markdownlint-cli2 (0.17.2)</summary>

<details>
<summary>docs/guides/health_check.md</summary>

41-41: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

---

48-48: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

---

60-60: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

---

83-83: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

---

89-89: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

---

107-107: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

---

111-111: Multiple headings with the same content

(MD024, no-duplicate-heading)

---

113-113: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

</details>
<details>
<summary>docs/guides/logging.md</summary>

41-41: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

---

48-48: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

---

60-60: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

---

83-83: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

---

89-89: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

---

107-107: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

---

111-111: Multiple headings with the same content

(MD024, no-duplicate-heading)

---

113-113: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

</details>

</details>

</details>

<details>
<summary>⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)</summary>

* GitHub Check: Build and Test - dynamo

</details>

<details>
<summary>🔇 Additional comments (3)</summary><blockquote>

<details>
<summary>docs/guides/logging.md (1)</summary><blockquote>

`131-134`: **All cross-doc links verified**

All linked targets in `docs/guides/logging.md` resolve correctly and no broken paths were found.

</blockquote></details>
<details>
<summary>docs/guides/health_check.md (2)</summary><blockquote>

`126-126`: **Confirm behavior: /live and /ready returning identical payloads.**

If intentional, call it out earlier; if not, clarify differences to avoid misconfiguring probes.

Would you like me to add Kubernetes probe examples once confirmed?

---

`189-191`: **Cross-doc links resolve.** All referenced files exist at their respective paths.

</blockquote></details>

</blockquote></details>

</details>

<!-- This is an auto-generated comment by CodeRabbit for review status -->

nnshah1 and others added 2 commits September 2, 2025 10:29
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Signed-off-by: Neelay Shah <[email protected]>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Signed-off-by: Neelay Shah <[email protected]>
Signed-off-by: Neelay Shah <[email protected]>
Signed-off-by: Neelay Shah <[email protected]>
Signed-off-by: Neelay Shah <[email protected]>
Signed-off-by: Neelay Shah <[email protected]>
Signed-off-by: Neelay Shah <[email protected]>
Signed-off-by: Neelay Shah <[email protected]>
Signed-off-by: nnshah1 <[email protected]>
Signed-off-by: Neelay Shah <[email protected]>
Signed-off-by: nnshah1 <[email protected]>
Signed-off-by: Neelay Shah <[email protected]>
Signed-off-by: nnshah1 <[email protected]>
Signed-off-by: nnshah1 <[email protected]>
Signed-off-by: Neelay Shah <[email protected]>
Signed-off-by: Neelay Shah <[email protected]>
Signed-off-by: Neelay Shah <[email protected]>
@nnshah1 nnshah1 requested a review from a team as a code owner September 10, 2025 17:17
Signed-off-by: nnshah1 <[email protected]>
Signed-off-by: Neelay Shah <[email protected]>
@grahamking
Copy link
Contributor

For future me:

  • This makes healthy and ready basically equivalent. They mean the server is running.
  • A component shouldn't depend on another component for it's health, so we redefine healthy here as not "I can serve a request" (before) but "I am ready to handle HTTP connections".

@nnshah1
Copy link
Contributor Author

nnshah1 commented Sep 10, 2025

/ok to test

@copy-pr-bot
Copy link

copy-pr-bot bot commented Sep 10, 2025

/ok to test

@nnshah1, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

@nnshah1
Copy link
Contributor Author

nnshah1 commented Sep 10, 2025

/ok to test 9715dab

@nnshah1
Copy link
Contributor Author

nnshah1 commented Sep 10, 2025

/ok to test 9715dab

@nnshah1 nnshah1 merged commit 5d729c1 into main Sep 10, 2025
16 of 17 checks passed
@nnshah1 nnshah1 deleted the nnshah1/docs branch September 10, 2025 20:47
#### Example Request

```
curl -s localhost:8080/live -q | jq
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think someone (maybe @PeaBrane?) recently switched all the FE to be 8000. May want to check with Rudy.

The frontend liveness endpoint reports a status of `live` as long as
the service is running.

> **Note**: Frontend liveness doesn't depend on worker health or liveness only on the Frontend service itself.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"... doesn't depend on worker health or liveness only ..."

=>

"... doesn't depend on worker health or liveness,(comma) only ..."

nnshah1 added a commit that referenced this pull request Sep 10, 2025
ayushag-nv pushed a commit that referenced this pull request Sep 15, 2025
zhongdaor-nv pushed a commit that referenced this pull request Sep 15, 2025
saturley-hall pushed a commit that referenced this pull request Sep 16, 2025
Signed-off-by: nnshah1 <[email protected]>
Co-authored-by: mohammedabdulwahhab <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants