Skip to content

Conversation

@xinyuangui2
Copy link
Contributor

@xinyuangui2 xinyuangui2 commented Nov 6, 2025

Description

The timeout is due to moto-server which mocks the s3. Remove the remote storage for now.

Related issues

Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234".

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

@xinyuangui2 xinyuangui2 requested a review from a team as a code owner November 6, 2025 17:21
@xinyuangui2 xinyuangui2 requested a review from justinvyu November 6, 2025 17:21
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request effectively introduces a skip_validation parameter to the StorageContext class, allowing the omission of unnecessary folder creation for read-only operations. The changes are well-implemented, and the usage in result.py and test_result.py correctly leverages this new functionality. The code is clean and directly addresses the stated problem, improving efficiency for read-only scenarios.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Validation Skip Causes Persist-Checkpoint Failure

The persist_current_checkpoint method unconditionally calls self._check_validation_file() (line 473), but when a StorageContext is created with skip_validation=True, the validation file is never created. This creates an inconsistency where a StorageContext initialized with skip_validation=True (intended for read-only operations) will fail if persist_current_checkpoint is called, with a confusing error message about the validation file not existing. The method should either check if validation was skipped before calling _check_validation_file(), or document that persist_current_checkpoint cannot be used on contexts created with skip_validation=True.

python/ray/train/v2/_internal/execution/storage.py#L450-L500

def persist_current_checkpoint(
self, checkpoint: "Checkpoint", checkpoint_dir_name: str
) -> "Checkpoint":
"""Persists a given checkpoint to the current checkpoint path on the filesystem.
This method copies the checkpoint files to the storage location.
It's up to the user to delete the original checkpoint files if desired.
For example, the original directory is typically a local temp directory.
Args:
checkpoint: The checkpoint to persist to
(fs, experiment_fs_path / checkpoint_dir_name).
Returns:
Checkpoint: A Checkpoint pointing to the persisted checkpoint location.
"""
# TODO(justinvyu): Fix this cyclical import.
from ray.train import Checkpoint
checkpoint_fs_path = self.build_checkpoint_path_from_name(checkpoint_dir_name)
logger.debug(
"Copying checkpoint files to storage path:\n"
"({source_fs}, {source}) -> ({dest_fs}, {destination})".format(
source=checkpoint.path,
destination=checkpoint_fs_path,
source_fs=checkpoint.filesystem,
dest_fs=self.storage_filesystem,
)
)
# Raise an error if the storage path is not accessible when
# attempting to upload a checkpoint from a remote worker.
# Ex: If storage_path is a local path, then a validation marker
# will only exist on the head node but not the worker nodes.
self._check_validation_file()
self.storage_filesystem.create_dir(checkpoint_fs_path)
_pyarrow_fs_copy_files(
source=checkpoint.path,
destination=checkpoint_fs_path,
source_filesystem=checkpoint.filesystem,
destination_filesystem=self.storage_filesystem,
)
persisted_checkpoint = Checkpoint(
filesystem=self.storage_filesystem,
path=checkpoint_fs_path,
)

Fix in Cursor Fix in Web


@ray-gardener ray-gardener bot added the train Ray Train Related Issue label Nov 6, 2025
@xinyuangui2 xinyuangui2 changed the title [Train] Remove unnecessary folder creation in StorageContext [Train] Try to fix the timeout for test_result Nov 6, 2025
@matthewdeng matthewdeng enabled auto-merge (squash) November 6, 2025 23:40
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Nov 6, 2025
@matthewdeng matthewdeng merged commit 00803d9 into ray-project:master Nov 7, 2025
8 checks passed
YoussefEssDS pushed a commit to YoussefEssDS/ray that referenced this pull request Nov 8, 2025
The timeout is due to `moto-server` which mocks the s3. Remove the
remote storage for now.

---------

Signed-off-by: xgui <[email protected]>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
The timeout is due to `moto-server` which mocks the s3. Remove the
remote storage for now.

---------

Signed-off-by: xgui <[email protected]>
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
The timeout is due to `moto-server` which mocks the s3. Remove the
remote storage for now.

---------

Signed-off-by: xgui <[email protected]>
Signed-off-by: Aydin Abiar <[email protected]>
ykdojo pushed a commit to ykdojo/ray that referenced this pull request Nov 27, 2025
The timeout is due to `moto-server` which mocks the s3. Remove the
remote storage for now.

---------

Signed-off-by: xgui <[email protected]>
Signed-off-by: YK <[email protected]>
SheldonTsen pushed a commit to SheldonTsen/ray that referenced this pull request Dec 1, 2025
The timeout is due to `moto-server` which mocks the s3. Remove the
remote storage for now.

---------

Signed-off-by: xgui <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests train Ray Train Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ray fails to serialize self-reference objects

2 participants