chore: Add EKS docs and tweak backend READMEs #2157

athreesh · 2025-07-28T22:17:21Z

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

closes GitHub issue: #xxx

Summary by CodeRabbit

Documentation
- Major restructuring and expansion of backend deployment guides for TensorRT-LLM and vLLM, including clearer quick start instructions, feature support matrices, and improved navigation.
- Enhanced architecture documentation for distributed runtime with updated examples and clarified component roles.
- Added comprehensive, step-by-step guides for deploying Dynamo Cloud and vLLM on Amazon EKS with EFS integration, including setup, configuration, and validation instructions.

copy-pr-bot · 2025-07-28T22:17:25Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2025-07-28T22:21:26Z

Walkthrough

This update introduces significant improvements and expansions to documentation across several components. The TensorRT-LLM and vLLM backend READMEs are extensively restructured for clarity, modularity, and completeness, with new feature support matrices and deployment instructions. Three new EKS deployment guides are added, detailing cluster setup, Dynamo Cloud installation, and vLLM deployment. The distributed runtime architecture documentation is modernized and clarified.

Changes

Cohort / File(s)	Change Summary
TensorRT-LLM Backend Documentation `components/backends/trtllm/README.md`	Major restructuring and expansion: simplified title, detailed Table of Contents, new Feature Support Matrix, rewritten Quick Start, clearer deployment examples (single node, advanced, disaggregation, KV cache), improved architecture diagrams, and modular organization.
vLLM Backend Documentation `components/backends/vllm/README.md`	Extensive revision: new Table of Contents, Feature Support Matrix, improved guidance on releases, expanded Quick Start, explicit deployment examples, updated Kubernetes instructions referencing public images, and removal of outdated content.
Distributed Runtime Architecture `docs/architecture/distributed_runtime.md`	Updated and clarified the description of Dynamo's DistributedRuntime, modernized deployment examples, generalized worker/component explanations, updated instantiation details, and removed outdated or overly specific content.
EKS Deployment Guides `examples/deployments/EKS/Create_EKS_EFS.md`, `examples/deployments/EKS/Deploy_Dynamo_Cloud.md`, `examples/deployments/EKS/Deploy_VLLM_example.md`	Added three new markdown guides: (1) EKS cluster creation with EFS integration, (2) Dynamo Cloud installation on EKS from source (including Docker/ECR/Helm steps), and (3) step-by-step vLLM backend deployment and testing on EKS.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Documentation
    participant Kubernetes Cluster
    participant Dynamo Cloud
    participant vLLM/TRTLLM Backend

    User->>Documentation: Follow EKS setup guide
    User->>Kubernetes Cluster: Create EKS cluster with EFS
    User->>Documentation: Follow Dynamo Cloud install guide
    User->>Kubernetes Cluster: Deploy Dynamo Cloud components
    User->>Documentation: Follow vLLM/TRTLLM deployment guide
    User->>Kubernetes Cluster: Deploy vLLM/TRTLLM backend
    User->>Dynamo Cloud: Send inference request
    Dynamo Cloud->>vLLM/TRTLLM Backend: Route request
    vLLM/TRTLLM Backend-->>Dynamo Cloud: Return result
    Dynamo Cloud-->>User: Return inference response

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Possibly related PRs

docs: Reference Deepseek R1 configs in TRTLLM README #1414: Adds Deepseek R1 references to the TRTLLM README, directly related to the expanded deployment documentation for Deepseek R1 in this PR.
refactor: Move TRTLLM example to the component/backends #1976: Relocates TRTLLM example files and updates documentation paths, complementing this PR’s content overhaul of the TRTLLM backend README.
docs: Add multi-node TRTLLM worker example (Deepseek R1) #1511: Adds a multi-node Deepseek R1 TRTLLM example, which aligns with this PR’s expanded and clarified multi-node deployment documentation for the same backend.

Poem

🐇
In docs and guides, we hop anew,
With clusters, clouds, and YAML too!
From vLLM to TRTLLM,
Clearer paths for every gem.
EKS and EFS, all set to go—
Deploy with ease, let knowledge flow!
🥕

Note

⚡️ Unit Test Generation is now available in beta!

Learn more here, or try it out under "Finishing Touches" below.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 1

🔭 Outside diff range comments (1)

components/backends/trtllm/README.md (1)
201-204: Typo in link path

kv-cache-tranfer.md → kv-cache-transfer.md
- see the [KV cache transfer guide](./kv-cache-tranfer.md).
+ see the [KV cache transfer guide](./kv-cache-transfer.md).

🧹 Nitpick comments (16)

examples/deployments/EKS/Deploy_VLLM_example.md (2)
28-41: Second fenced block is missing its language identifier.

Same MD040 issue as above – add bash (or shell) so the CI linter passes.
-```
+```bash
 kubectl port-forward deployment/vllm-agg-router-frontend 8080:8000 -n dynamo-cloud
 curl localhost:8080/v1/chat/completions \
 ...
   }'
-```
+```
6-8: Make the “edit YAML” step reproducible & non-interactive.

vim in scripted docs breaks copy-paste automation. Suggest showing a one-liner or a yq/sed example instead, e.g.:
# yq eval '.metadata.namespace = "dynamo-cloud" | .spec.template.spec.containers[0].image = env(IMAGE)' -i agg_router.yaml
This keeps the guide copy-paste friendly.
examples/deployments/EKS/Deploy_Dynamo_Cloud.md (3)
5-5: Typo: “repositoriy”.
-Create 1 ECR repositoriy
+Create an ECR repository
8-10: Add language tags to fenced blocks (MD040).

Applies to every block in this file; adding bash once per section will clear the linter noise.

69-77: Secret creation duplicated; will error on second run.

kubectl create secret ... appears both before and after dependency build. Either drop the second block or switch to kubectl apply.
components/backends/vllm/README.md (1)
55-56: Grammar nit – duplicate article.
-Below we provide a guide that lets you run all of our the common deployment patterns on a single node. 
+Below we provide a guide that lets you run all of the common deployment patterns on a single node.
docs/architecture/distributed_runtime.md (3)
29-31: Minor grammar: plural agreement.
-...in practice, each dynamo components typically are deployed...
+...in practice, each Dynamo component is typically deployed...
58-60: Spelling: “Componenet”.
-... `/services/{namespace}/{component}/{endpoint}-{lease_id}`. Note that the endpoints of different workers of the same type (i.e., two `PrefillWorker`s in one deployment) share the same `Namespace`, `Componenet`, and `Endpoint` name. 
+... share the same `Namespace`, `Component`, and `Endpoint` name.
78-79: Dangling back-tick at end of sentence.
-... `/components/backends`. ` for full implementation details.
+... `/components/backends` for full implementation details.
examples/deployments/EKS/Create_EKS_EFS.md (3)
7-12: Add language identifiers to fenced blocks (MD040).

Same issue repeats throughout the file; apply bash (or yaml where appropriate) to silence the linter.

112-112: Style: “inside of” → “inside”.
-...accessed by processes running inside of said pods...
+...accessed by processes running inside said pods...
125-125: Concise wording: “are able to” → “can”.
-...whether your nodes are able to access the created EFS file system.
+...whether your nodes can access the created EFS file system.
components/backends/trtllm/README.md (4)
36-41: Broken anchor in Table of Contents

Single Node Examples points to #single-node-deployments, but the section header renders to #single-node-examples.
This breaks intra-doc navigation.
- - [Single Node Examples](#single-node-deployments)
+ - [Single Node Examples](#single-node-examples)
36-45: List indentation violates MD007

Markdown-lint flags the bullets in the TOC. Add two spaces after the hyphen (or disable the rule locally) for consistent list-indentation.

67-70: Minor grammar tweak

all of our the common → all of the common.
-Below we provide a guide that lets you run all of our the common deployment patterns on a single node.
+Below we provide a guide that lets you run all of the common deployment patterns on a single node.
73-77: Command assumes repo root

docker compose -f deploy/docker-compose.yml … only works when run from the repository root, but the README sits three levels deep. Add an explicit cd $DYNAMO_HOME (or a relative path) to avoid copy-paste failures.
- docker compose -f deploy/docker-compose.yml up -d
+ # Run from the repository root
+ docker compose -f deploy/docker-compose.yml up -d

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cfc6178 and ceb1564.

📒 Files selected for processing (6)

components/backends/trtllm/README.md (8 hunks)
components/backends/vllm/README.md (3 hunks)
docs/architecture/distributed_runtime.md (3 hunks)
examples/deployments/EKS/Create_EKS_EFS.md (1 hunks)
examples/deployments/EKS/Deploy_Dynamo_Cloud.md (1 hunks)
examples/deployments/EKS/Deploy_VLLM_example.md (1 hunks)

🧰 Additional context used

🧠 Learnings (3)

examples/deployments/EKS/Deploy_VLLM_example.md (1)

Learnt from: julienmancuso
PR: #2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml:92-98
Timestamp: 2025-07-18T16:04:31.771Z
Learning: CRD schemas in files like deploy/cloud/helm/crds/templates/*.yaml are auto-generated from Kubernetes library upgrades and should not be manually modified as changes would be overwritten during regeneration.

components/backends/trtllm/README.md (4)

Learnt from: GuanLuo
PR: #1371
File: examples/llm/benchmarks/vllm_multinode_setup.sh:18-25
Timestamp: 2025-06-05T01:46:15.509Z
Learning: In multi-node setups with head/worker architecture, the head node typically doesn't need environment variables pointing to its own services (like NATS_SERVER, ETCD_ENDPOINTS) because local processes can access them via localhost. Only worker nodes need these environment variables to connect to the head node's external IP address.

Learnt from: fsaady
PR: #1730
File: examples/sglang/slurm_jobs/job_script_template.j2:59-59
Timestamp: 2025-07-02T13:20:28.800Z
Learning: In the SLURM job script template at examples/sglang/slurm_jobs/job_script_template.j2, the --total_nodes parameter represents the total nodes per worker type (prefill or decode), not the total nodes in the entire cluster. Each worker type needs to know its own group size for distributed coordination.

Learnt from: PeaBrane
PR: #1409
File: examples/router_standalone/worker.py:171-186
Timestamp: 2025-06-08T08:30:45.126Z
Learning: Example code in the examples/ directory may intentionally use hard-coded values or simplified implementations that wouldn't be appropriate for production code, but are acceptable for demonstration and testing purposes.

Learnt from: nnshah1
PR: #2124
File: components/backends/vllm/deploy/disagg.yaml:54-60
Timestamp: 2025-07-25T22:34:11.384Z
Learning: In vLLM worker deployments, startup probes (with longer periods and higher failure thresholds like periodSeconds: 10, failureThreshold: 60) are used to handle the slow model loading startup phase, while liveness probes are intentionally kept aggressive (periodSeconds: 5, failureThreshold: 1) for quick failure detection once the worker is operational. This pattern separates startup concerns from operational health monitoring in GPU-heavy workloads.

docs/architecture/distributed_runtime.md (4)

Learnt from: nnshah1
PR: #1444
File: tests/fault_tolerance/utils/metrics.py:30-32
Timestamp: 2025-07-01T13:55:03.940Z
Learning: The @dynamo_worker() decorator in the dynamo codebase returns a wrapper that automatically injects the runtime parameter before calling the wrapped function. This means callers only need to provide the non-runtime parameters, while the decorator handles injecting the runtime argument automatically. For example, a function with signature async def get_metrics(runtime, log_dir) decorated with @dynamo_worker() can be called as get_metrics(log_dir) because the decorator wrapper injects the runtime parameter.

Learnt from: fsaady
PR: #1730
File: examples/sglang/slurm_jobs/scripts/worker_setup.py:230-244
Timestamp: 2025-07-03T10:14:30.570Z
Learning: In examples/sglang/slurm_jobs/scripts/worker_setup.py, background processes (like nats-server, etcd) are intentionally left running even if later processes fail. This design choice allows users to manually connect to nodes and debug issues without having to restart the entire SLURM job from scratch, providing operational flexibility for troubleshooting in cluster environments.

Learnt from: GuanLuo
PR: #1371
File: examples/llm/benchmarks/vllm_multinode_setup.sh:18-25
Timestamp: 2025-06-05T01:46:15.509Z
Learning: In multi-node setups with head/worker architecture, the head node typically doesn't need environment variables pointing to its own services (like NATS_SERVER, ETCD_ENDPOINTS) because local processes can access them via localhost. Only worker nodes need these environment variables to connect to the head node's external IP address.

Learnt from: PeaBrane
PR: #1409
File: examples/router_standalone/worker.py:171-186
Timestamp: 2025-06-08T08:30:45.126Z
Learning: Example code in the examples/ directory may intentionally use hard-coded values or simplified implementations that wouldn't be appropriate for production code, but are acceptable for demonstration and testing purposes.

🪛 markdownlint-cli2 (0.17.2)

components/backends/trtllm/README.md

49-49: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)

50-50: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)

examples/deployments/EKS/Deploy_Dynamo_Cloud.md

7-7: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

14-14: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

25-25: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

35-35: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

43-43: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

58-58: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

66-66: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

88-88: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

examples/deployments/EKS/Create_EKS_EFS.md

5-5: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

13-13: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

27-27: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

46-46: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/2157/merge) by athreesh.

docs/architecture/distributed_runtime.md

[error] 31-31: pre-commit hook 'trailing-whitespace' failed. Trailing whitespace was fixed in this file.

components/backends/vllm/README.md

[error] 52-52: pre-commit hook 'trailing-whitespace' failed. Trailing whitespace was fixed in this file.

🪛 LanguageTool

examples/deployments/EKS/Create_EKS_EFS.md

[style] ~112-~112: This phrase is redundant. Consider using “inside”.
Context: ... and then accessed by processes running inside of said pods as if they were part of the p...

(OUTSIDE_OF)

[style] ~125-~125: As a shorter alternative for ‘able to’, consider using “can”.
Context: ...ly. This will affect whether your nodes are able to access the created EFS file system. ##...

(BE_ABLE_TO)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: pre-merge-rust (lib/bindings/python)
GitHub Check: pre-merge-rust (lib/runtime/examples)
GitHub Check: pre-merge-rust (.)
GitHub Check: Build and Test - vllm

🔇 Additional comments (4)

examples/deployments/EKS/Deploy_VLLM_example.md (1)

5-9: Add a language identifier to satisfy markdown‐lint (MD040).

Without the specifier, the pre-commit hook will keep flagging the file.
[ suggest_nitpick ]
-```
+```bash
 cd dynamo/components/backends/vllm/deploy
 vim agg_router.yaml    #under metadata add namespace: dynamo-cloud and change image to your built base image
 kubectl apply -f agg_router.yaml
-```
+```
examples/deployments/EKS/Deploy_Dynamo_Cloud.md (1)

59-64: Namespace inconsistency may lead to orphaned CRDs.

You install CRDs into default but everything else lives in ${NAMESPACE} (dynamo-cloud). Unless there is a deliberate reason, target the same namespace to avoid confusion.
-  --namespace default \
+  --namespace ${NAMESPACE} \
components/backends/trtllm/README.md (2)

61-66: Inconsistent “GB200 Support” row

The ✅ icon claims support while the Notes column says Not supported. Please verify and make the row unambiguous.

205-210: Potentially broken client link

../llm/README.md#client resolves to components/backends/llm/README.md, which likely does not exist.
Confirm the correct target (maybe ../../llm/README.md or the repo-root README) and update the path.

examples/deployments/EKS/Deploy_Dynamo_Cloud.md

coderabbitai · 2025-07-28T22:24:48Z

Walkthrough

This update introduces substantial improvements to documentation across multiple components. It restructures and clarifies the README files for the TensorRT-LLM and vLLM backends, updates the distributed runtime architecture documentation, and adds three new EKS deployment guides for Dynamo Cloud and vLLM. No code or configuration logic is changed; all modifications are limited to documentation.

Changes

Cohort / File(s)	Change Summary
TensorRT-LLM Backend Documentation `components/backends/trtllm/README.md`	Major restructuring and expansion of the README: simplified title, new Table of Contents, detailed feature support matrix, clarified quick start, reorganized deployment and advanced examples, improved formatting, and removal of outdated sections. No changes to code or configs.
vLLM Backend Documentation `components/backends/vllm/README.md`	Extensive revision of the README: added feature support matrix, improved structure and clarity, updated deployment and Kubernetes instructions, explicit usage of public container images, and reorganized example commands. No code changes.
Distributed Runtime Architecture `docs/architecture/distributed_runtime.md`	Revised and clarified architectural documentation: updated deployment examples, clarified component roles and instantiation, improved explanation of namespaces and endpoints, and removed outdated build notes. No code changes.
EKS Deployment Guides (New Docs) `examples/deployments/EKS/Create_EKS_EFS.md`, `examples/deployments/EKS/Deploy_Dynamo_Cloud.md`, `examples/deployments/EKS/Deploy_VLLM_example.md`	Added three new markdown files providing step-by-step guides for setting up EKS with EFS, deploying Dynamo Cloud, and deploying vLLM on EKS. Includes prerequisites, configuration, deployment, validation, and testing instructions. No code or config changes.

Sequence Diagram(s)

No sequence diagrams are generated, as all changes are limited to documentation content and structure.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Possibly related PRs

docs: Reference Deepseek R1 configs in TRTLLM README #1414: Also modifies the TRTLLM README, adding references to Deepseek R1 configs; related by file but not overlapping in content.
refactor: Move TRTLLM example to the component/backends #1976: Refactors directory paths and URLs in the TRTLLM documentation and scripts, overlapping in documentation clarity and file location.
docs: MTP + TensorRT LLM + DS R1 disaggregated example #1473: Introduces new multi-node disaggregated examples for TRTLLM; related contextually as both PRs improve TRTLLM documentation.

Poem

In burrows deep, the docs were old,
But now they're clear, concise, and bold!
With tables, guides, and EKS in view,
Deployments hop along—brand new.
A rabbit’s cheer for docs so neat—
These changes make our warren complete!
🐇✨

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (20)

examples/deployments/EKS/Create_EKS_EFS.md (6)
7-12: Add apt-get update before installing unzip.

Without refreshing package indices, some CI images fail to locate the unzip package.
 sudo apt install unzip
+sudo apt-get update -y
+sudo apt-get install unzip -y
17-21: Use the latest stable kubectl instead of pinning to 1.30.0.

Hard-coding a minor version risks drift between control-plane and client. Prefer the upstream pattern:
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
52-54: Remind readers to save the YAML to eks_cluster_config.yaml.

The subsequent eksctl create cluster -f eks_cluster_config.yaml assumes this filename, but the guide never instructs users to write the file.

112-114: Minor wording & redundancy fixes.

“too large to loaded” ➜ “too large to be loaded”
“inside of said pods” ➜ “inside the pods”

125-126: Concise wording.

Replace “are able to access” with “can access”.

135-149: Consider adding reclaimPolicy and replacing hard-coded fileSystemId.
reclaimPolicy: Retain        # prevents accidental data loss
parameters:
  fileSystemId: <YOUR_FS_ID> # make it explicit this must be edited
examples/deployments/EKS/Deploy_VLLM_example.md (2)
6-8: Use a generic editor placeholder instead of forcing vim.
-vim agg_router.yaml    #under metadata add namespace: dynamo-cloud and change image to your built base image
+$EDITOR agg_router.yaml   # edit metadata.namespace and image fields
14-23: Clarify that pod names and ages will differ.

Real outputs vary per cluster; suggesting users rely on exact names can confuse copy-pastes.
examples/deployments/EKS/Deploy_Dynamo_Cloud.md (3)
5-10: Typo & pluralisation.

“Create 1 ECR repositoriy” ➜ “Create an ECR repository”.

26-29: Remove trailing slash from registry URL.

Some Docker versions treat the trailing / as part of the registry hostname and fail auth.
-docker login --username AWS --password-stdin <ECR_REGISTRY>/
+docker login --username AWS --password-stdin <ECR_REGISTRY>
67-77: Duplicate secret creation.

Namespace and docker-imagepullsecret are created twice (lines 44–49 and 71–76). Factor this into a single step to avoid “already exists” warnings.
components/backends/vllm/README.md (2)
14-20: Fetch tags before git describe to avoid detached-HEAD errors.
git fetch --tags
git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
55-56: Grammar: remove extra “the”.

“run all of our the common deployment patterns” ➜ “run all of our common deployment patterns”.
components/backends/trtllm/README.md (2)

24-26: Capitalize product name.

“dynamo” ➜ “Dynamo”.

69-70: Grammar: remove extra “the”.

Same phrasing issue as the vLLM README.
docs/architecture/distributed_runtime.md (5)
22-22: Capitalize language names & use “e.g.” instead of “i.e.”

“Rust” and “Python” are proper nouns and should be capitalized. The parenthetical gives an example rather than an exhaustive list, so “e.g.” is the appropriate Latin abbreviation.
-Dynamo's `DistributedRuntime` is the core infrastructure in the framework that enables distributed communication and coordination between different Dynamo components. It is implemented in rust (`/lib/runtime`) and exposed to other programming languages via bindings (i.e., python bindings can be found in `/lib/bindings/python`). `DistributedRuntime` follows a hierarchical structure:
+Dynamo's `DistributedRuntime` is the core infrastructure in the framework that enables distributed communication and coordination between different Dynamo components. It is implemented in Rust (`/lib/runtime`) and exposed to other programming languages via bindings (e.g., Python bindings can be found in `/lib/bindings/python`). `DistributedRuntime` follows a hierarchical structure:
33-35: Minor grammar fix – subject/verb agreement
-Then, it route the request to the `Worker`.
+Then, it routes the request to the `Worker`.
38-38: Namespace example contradicts preceding explanation

The paragraph says all components share a namespace such as vllm-v1-agg or sglang-agg, yet the example code uses runtime.namespace("dynamo"). This is confusing for readers and may lead them to create inconsistent namespaces.
-… programmatically (e.g., `runtime.namespace("dynamo").component("worker")`), …
+… programmatically (e.g., `runtime.namespace("vllm-v1-agg").component("worker")`), …
78-79: Remove stray back-tick and tighten wording

There is an extraneous back-tick at the end of the sentence, and “directories in … for full implementation details” reads awkwardly.
- - Python: We also provide complete examples of using `DistributedRuntime` for communication and Dynamo's LLM library for prompt templates and (de)tokenization to deploy Dynamo graphs. Please refer to the directories in `/components/backends`. `
+- Python: Complete examples that combine `DistributedRuntime` with Dynamo’s LLM utilities (prompt templates, (de)tokenization, etc.) are available in `/components/backends/`; see those directories for full implementation details.
59-59: Typo: “Componenet” → “Component”

Small spelling error that might hinder searchability in the rendered docs.
-  - etcd Registration: The endpoint information is stored in etcd at a path following the naming: `/services/{namespace}/{component}/{endpoint}-{lease_id}`. Note that the endpoints of different workers of the same type (i.e., two `PrefillWorker`s in one deployment) share the same `Namespace`, `Componenet`, and `Endpoint` name. They are distinguished by their different primary `lease_id` of their `DistributedRuntime`.
+  - etcd Registration: The endpoint information is stored in etcd at a path following the naming: `/services/{namespace}/{component}/{endpoint}-{lease_id}`. Note that the endpoints of different workers of the same type (i.e., two `PrefillWorker`s in one deployment) share the same `Namespace`, `Component`, and `Endpoint` name. They are distinguished by their different primary `lease_id` of their `DistributedRuntime`.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cfc6178 and ceb1564.

📒 Files selected for processing (6)

components/backends/trtllm/README.md (8 hunks)
components/backends/vllm/README.md (3 hunks)
docs/architecture/distributed_runtime.md (3 hunks)
examples/deployments/EKS/Create_EKS_EFS.md (1 hunks)
examples/deployments/EKS/Deploy_Dynamo_Cloud.md (1 hunks)
examples/deployments/EKS/Deploy_VLLM_example.md (1 hunks)

🧰 Additional context used

🧠 Learnings (4)

examples/deployments/EKS/Deploy_VLLM_example.md (1)

Learnt from: julienmancuso
PR: #2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml:92-98
Timestamp: 2025-07-18T16:04:31.771Z
Learning: CRD schemas in files like deploy/cloud/helm/crds/templates/*.yaml are auto-generated from Kubernetes library upgrades and should not be manually modified as changes would be overwritten during regeneration.

examples/deployments/EKS/Deploy_Dynamo_Cloud.md (1)

Learnt from: julienmancuso
PR: #2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml:92-98
Timestamp: 2025-07-18T16:04:31.771Z
Learning: CRD schemas in files like deploy/cloud/helm/crds/templates/*.yaml are auto-generated from Kubernetes library upgrades and should not be manually modified as changes would be overwritten during regeneration.

docs/architecture/distributed_runtime.md (4)

Learnt from: nnshah1
PR: #1444
File: tests/fault_tolerance/utils/metrics.py:30-32
Timestamp: 2025-07-01T13:55:03.940Z
Learning: The @dynamo_worker() decorator in the dynamo codebase returns a wrapper that automatically injects the runtime parameter before calling the wrapped function. This means callers only need to provide the non-runtime parameters, while the decorator handles injecting the runtime argument automatically. For example, a function with signature async def get_metrics(runtime, log_dir) decorated with @dynamo_worker() can be called as get_metrics(log_dir) because the decorator wrapper injects the runtime parameter.

Learnt from: fsaady
PR: #1730
File: examples/sglang/slurm_jobs/scripts/worker_setup.py:230-244
Timestamp: 2025-07-03T10:14:30.570Z
Learning: In examples/sglang/slurm_jobs/scripts/worker_setup.py, background processes (like nats-server, etcd) are intentionally left running even if later processes fail. This design choice allows users to manually connect to nodes and debug issues without having to restart the entire SLURM job from scratch, providing operational flexibility for troubleshooting in cluster environments.

Learnt from: GuanLuo
PR: #1371
File: examples/llm/benchmarks/vllm_multinode_setup.sh:18-25
Timestamp: 2025-06-05T01:46:15.509Z
Learning: In multi-node setups with head/worker architecture, the head node typically doesn't need environment variables pointing to its own services (like NATS_SERVER, ETCD_ENDPOINTS) because local processes can access them via localhost. Only worker nodes need these environment variables to connect to the head node's external IP address.

Learnt from: PeaBrane
PR: #1409
File: examples/router_standalone/worker.py:171-186
Timestamp: 2025-06-08T08:30:45.126Z
Learning: Example code in the examples/ directory may intentionally use hard-coded values or simplified implementations that wouldn't be appropriate for production code, but are acceptable for demonstration and testing purposes.

components/backends/trtllm/README.md (3)

Learnt from: GuanLuo
PR: #1371
File: examples/llm/benchmarks/vllm_multinode_setup.sh:18-25
Timestamp: 2025-06-05T01:46:15.509Z
Learning: In multi-node setups with head/worker architecture, the head node typically doesn't need environment variables pointing to its own services (like NATS_SERVER, ETCD_ENDPOINTS) because local processes can access them via localhost. Only worker nodes need these environment variables to connect to the head node's external IP address.

Learnt from: fsaady
PR: #1730
File: examples/sglang/slurm_jobs/job_script_template.j2:59-59
Timestamp: 2025-07-02T13:20:28.800Z
Learning: In the SLURM job script template at examples/sglang/slurm_jobs/job_script_template.j2, the --total_nodes parameter represents the total nodes per worker type (prefill or decode), not the total nodes in the entire cluster. Each worker type needs to know its own group size for distributed coordination.

Learnt from: PeaBrane
PR: #1409
File: examples/router_standalone/worker.py:171-186
Timestamp: 2025-06-08T08:30:45.126Z
Learning: Example code in the examples/ directory may intentionally use hard-coded values or simplified implementations that wouldn't be appropriate for production code, but are acceptable for demonstration and testing purposes.

🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/2157/merge) by athreesh.

components/backends/vllm/README.md

[error] 52-52: pre-commit hook 'trailing-whitespace' failed. Trailing whitespace was fixed in this file.

docs/architecture/distributed_runtime.md

[error] 31-31: pre-commit hook 'trailing-whitespace' failed. Trailing whitespace was fixed in this file.

🪛 LanguageTool

examples/deployments/EKS/Create_EKS_EFS.md

[style] ~112-~112: This phrase is redundant. Consider using “inside”.
Context: ... and then accessed by processes running inside of said pods as if they were part of the p...

(OUTSIDE_OF)

[style] ~125-~125: As a shorter alternative for ‘able to’, consider using “can”.
Context: ...ly. This will affect whether your nodes are able to access the created EFS file system. ##...

(BE_ABLE_TO)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: pre-merge-rust (lib/bindings/python)
GitHub Check: pre-merge-rust (lib/runtime/examples)
GitHub Check: pre-merge-rust (.)

🔇 Additional comments (1)

examples/deployments/EKS/Create_EKS_EFS.md (1)

44-46: Warn about g6e.48xlarge & EFA availability across regions.

That SKU is still spot-market-only in most regions; cluster creation silently stalls if capacity is unavailable. Add a footnote or alternative (e.g., p5.48xlarge) to reduce first-time friction.

components/backends/vllm/README.md

examples/deployments/EKS/Deploy_Dynamo_Cloud.md

examples/deployments/EKS/Deploy_VLLM_example.md

components/backends/trtllm/README.md

components/backends/vllm/README.md

components/backends/trtllm/README.md

Co-authored-by: Ryan McCormick <[email protected]> Signed-off-by: Anish <[email protected]>

components/backends/trtllm/README.md

Co-authored-by: Ryan McCormick <[email protected]> Signed-off-by: Anish <[email protected]>

components/backends/trtllm/README.md

examples/deployments/EKS/Deploy_Dynamo_Cloud.md

rmccorm4

LGTM other than any unresolved comments

Co-authored-by: Ryan McCormick <[email protected]> Signed-off-by: Anish <[email protected]>

ishandhanani

Left some comments across the board

docs/architecture/distributed_runtime.md

examples/deployments/EKS/Deploy_Dynamo_Cloud.md

examples/deployments/EKS/Deploy_VLLM_example.md

docs/architecture/distributed_runtime.md

…dynamo into readmes-eks-additions

…stributed runtime docs

…anch up to date

addressed Ishan's comments, will add a Github issue to add SGLang/TRT-LLM YAMLs to EKS example

athreesh added 4 commits July 23, 2025 17:38

removed SDK mentions and slight tidying up

6eeda55

Merge branch 'main' of https://github.com/ai-dynamo/dynamo

581cc7f

added an EKS example

d3a66e9

Merge branch 'main' of https://github.com/ai-dynamo/dynamo

ceb1564

athreesh requested review from a team, nnshah1 and whoisj as code owners July 28, 2025 22:17

pull-request-size bot added the size/XL label Jul 28, 2025

coderabbitai bot reviewed Jul 28, 2025

View reviewed changes

examples/deployments/EKS/Deploy_Dynamo_Cloud.md Show resolved Hide resolved

slight typo with dynamo-run vs. new dynamo.frontend

c1ad0ef

coderabbitai bot reviewed Jul 28, 2025

View reviewed changes

components/backends/vllm/README.md Show resolved Hide resolved

examples/deployments/EKS/Deploy_Dynamo_Cloud.md Show resolved Hide resolved

examples/deployments/EKS/Deploy_VLLM_example.md Show resolved Hide resolved

athreesh added 4 commits July 28, 2025 15:28

tiny nitpick changes

bd5461a

readme updates

6c20510

vllm readme change

de0539b

fix: Remove trailing whitespace (pre-commit hook)

9e2f306

athreesh enabled auto-merge (squash) July 29, 2025 04:24