Skip to content

Conversation

@justinvyu
Copy link
Contributor

Description

Updates the vicuna lightning deepspeed example to run w/ Train V2.

Related issues

Types of change

  • Bug fix πŸ›
  • New feature ✨
  • Enhancement πŸš€
  • Code refactoring πŸ”§
  • Documentation update πŸ“–
  • Chore 🧹
  • Style 🎨

Checklist

Does this PR introduce breaking changes?

  • Yes ⚠️
  • No

Testing:

  • Added/updated tests for my changes
  • Tested the changes manually
  • This PR is not tested ❌ (please explain why)

Code Quality:

  • Signed off every commit (git commit -s)
  • Ran pre-commit hooks (setup guide)

Documentation:

  • Updated documentation (if applicable) (contribution guide)
  • Added new APIs to doc/source/ (if applicable)

Additional context

Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This PR updates the cluster configuration for the Vicuna example to use 16 worker GPUs for training, which is a valid change. However, this introduces an inconsistency with the documentation in the accompanying Jupyter Notebook, which has not been updated. I've left a specific comment on the configuration file change. Please update the notebook documentation to match the new cluster setup to avoid user confusion.

Comment on lines 6 to 12
instance_type: m5.4xlarge

worker_node_types:
- name: worker_node
instance_type: g5.4xlarge
min_workers: 15
max_workers: 15
min_workers: 16
max_workers: 16
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

These changes correctly adjust the cluster to use 16 worker GPUs for training. However, this makes the documentation in the corresponding notebook (vicuna_13b_lightning_deepspeed_finetune.ipynb) outdated and misleading.

The notebook's 'Cluster Setting' section needs to be updated to reflect:

  • The head node is now m5.4xlarge (a CPU instance).
  • There are now 16 worker nodes.
  • The tip about using a GPU head node for inference is no longer accurate and should be revised, as inference will now run on a worker node.

Please update the notebook to ensure the example remains consistent and clear for users.

Copy link
Contributor

@JasonLi1909 JasonLi1909 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! I kicked off the release test so we should make sure it passes there too.

"name": "stderr",
"output_type": "stream",
"text": [
"2025-10-15 15:50:45,333\tINFO worker.py:1833 -- Connecting to existing Ray cluster at address: 10.0.171.127:6379...\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should delete these output cells here and elsewhere so they don't show up in the docs.

Signed-off-by: Justin Yu <[email protected]>
@ray-gardener ray-gardener bot added docs An issue or change related to documentation train Ray Train Related Issue release-test release test labels Oct 16, 2025
Signed-off-by: Justin Yu <[email protected]>
cursor[bot]

This comment was marked as outdated.

Signed-off-by: Justin Yu <[email protected]>
cursor[bot]

This comment was marked as outdated.

Signed-off-by: Justin Yu <[email protected]>
cursor[bot]

This comment was marked as outdated.

@justinvyu justinvyu enabled auto-merge (squash) October 22, 2025 21:46
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Oct 22, 2025
@justinvyu justinvyu merged commit e7a79ba into ray-project:master Oct 22, 2025
8 checks passed
JasonLi1909 pushed a commit to JasonLi1909/ray that referenced this pull request Oct 23, 2025
Updates the vicuna lightning deepspeed example to run w/ Train V2.

---------

Signed-off-by: Justin Yu <[email protected]>
aslonnie pushed a commit that referenced this pull request Oct 23, 2025
xinyuangui2 pushed a commit to xinyuangui2/ray that referenced this pull request Oct 27, 2025
Updates the vicuna lightning deepspeed example to run w/ Train V2.

---------

Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: xgui <[email protected]>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
Updates the vicuna lightning deepspeed example to run w/ Train V2.

---------

Signed-off-by: Justin Yu <[email protected]>
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
Updates the vicuna lightning deepspeed example to run w/ Train V2.

---------

Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Aydin Abiar <[email protected]>
Future-Outlier pushed a commit to Future-Outlier/ray that referenced this pull request Dec 7, 2025
Updates the vicuna lightning deepspeed example to run w/ Train V2.

---------

Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs An issue or change related to documentation go add ONLY when ready to merge, run all tests release-test release test train Ray Train Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants