-
Notifications
You must be signed in to change notification settings - Fork 7k
[Train] Try to fix the timeout for test_result #58432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Train] Try to fix the timeout for test_result #58432
Conversation
Signed-off-by: xgui <[email protected]>
Signed-off-by: xgui <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
The pull request effectively introduces a skip_validation parameter to the StorageContext class, allowing the omission of unnecessary folder creation for read-only operations. The changes are well-implemented, and the usage in result.py and test_result.py correctly leverages this new functionality. The code is clean and directly addresses the stated problem, improving efficiency for read-only scenarios.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Validation Skip Causes Persist-Checkpoint Failure
The persist_current_checkpoint method unconditionally calls self._check_validation_file() (line 473), but when a StorageContext is created with skip_validation=True, the validation file is never created. This creates an inconsistency where a StorageContext initialized with skip_validation=True (intended for read-only operations) will fail if persist_current_checkpoint is called, with a confusing error message about the validation file not existing. The method should either check if validation was skipped before calling _check_validation_file(), or document that persist_current_checkpoint cannot be used on contexts created with skip_validation=True.
python/ray/train/v2/_internal/execution/storage.py#L450-L500
ray/python/ray/train/v2/_internal/execution/storage.py
Lines 450 to 500 in 1fe182e
| def persist_current_checkpoint( | |
| self, checkpoint: "Checkpoint", checkpoint_dir_name: str | |
| ) -> "Checkpoint": | |
| """Persists a given checkpoint to the current checkpoint path on the filesystem. | |
| This method copies the checkpoint files to the storage location. | |
| It's up to the user to delete the original checkpoint files if desired. | |
| For example, the original directory is typically a local temp directory. | |
| Args: | |
| checkpoint: The checkpoint to persist to | |
| (fs, experiment_fs_path / checkpoint_dir_name). | |
| Returns: | |
| Checkpoint: A Checkpoint pointing to the persisted checkpoint location. | |
| """ | |
| # TODO(justinvyu): Fix this cyclical import. | |
| from ray.train import Checkpoint | |
| checkpoint_fs_path = self.build_checkpoint_path_from_name(checkpoint_dir_name) | |
| logger.debug( | |
| "Copying checkpoint files to storage path:\n" | |
| "({source_fs}, {source}) -> ({dest_fs}, {destination})".format( | |
| source=checkpoint.path, | |
| destination=checkpoint_fs_path, | |
| source_fs=checkpoint.filesystem, | |
| dest_fs=self.storage_filesystem, | |
| ) | |
| ) | |
| # Raise an error if the storage path is not accessible when | |
| # attempting to upload a checkpoint from a remote worker. | |
| # Ex: If storage_path is a local path, then a validation marker | |
| # will only exist on the head node but not the worker nodes. | |
| self._check_validation_file() | |
| self.storage_filesystem.create_dir(checkpoint_fs_path) | |
| _pyarrow_fs_copy_files( | |
| source=checkpoint.path, | |
| destination=checkpoint_fs_path, | |
| source_filesystem=checkpoint.filesystem, | |
| destination_filesystem=self.storage_filesystem, | |
| ) | |
| persisted_checkpoint = Checkpoint( | |
| filesystem=self.storage_filesystem, | |
| path=checkpoint_fs_path, | |
| ) |
Signed-off-by: xgui <[email protected]>
Signed-off-by: xgui <[email protected]>
Signed-off-by: xgui <[email protected]>
Signed-off-by: xgui <[email protected]>
The timeout is due to `moto-server` which mocks the s3. Remove the remote storage for now. --------- Signed-off-by: xgui <[email protected]>
The timeout is due to `moto-server` which mocks the s3. Remove the remote storage for now. --------- Signed-off-by: xgui <[email protected]>
The timeout is due to `moto-server` which mocks the s3. Remove the remote storage for now. --------- Signed-off-by: xgui <[email protected]> Signed-off-by: Aydin Abiar <[email protected]>
The timeout is due to `moto-server` which mocks the s3. Remove the remote storage for now. --------- Signed-off-by: xgui <[email protected]> Signed-off-by: YK <[email protected]>
The timeout is due to `moto-server` which mocks the s3. Remove the remote storage for now. --------- Signed-off-by: xgui <[email protected]>
Description
The timeout is due to
moto-serverwhich mocks the s3. Remove the remote storage for now.Related issues
Additional information