Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release test long_running_distributed_pytorch_pbt_failure.aws failed #50149

Closed
can-anyscale opened this issue Jan 31, 2025 · 3 comments
Closed
Assignees
Labels
bug Something that is supposed to be working; but isn't ml P0 Issues that should be fixed in short order ray-test-bot Issues managed by OSS test policy release-test release test stability triage Needs triage (eg: priority, bug/not-bug, and owning component) weekly-release-blocker Issues that will be blocking Ray weekly releases

Comments

@can-anyscale
Copy link
Collaborator

Release test long_running_distributed_pytorch_pbt_failure.aws failed. See https://buildkite.com/ray-project/release/builds/31485#0194b5e6-dba7-42cc-ae01-2182bad0a160 for more details.

Managed by OSS Test Policy

@can-anyscale can-anyscale added bug Something that is supposed to be working; but isn't ml P0 Issues that should be fixed in short order ray-test-bot Issues managed by OSS test policy release-test release test stability triage Needs triage (eg: priority, bug/not-bug, and owning component) weekly-release-blocker Issues that will be blocking Ray weekly releases labels Jan 31, 2025
@can-anyscale
Copy link
Collaborator Author

Blamed commit: 87c865e found by bisect job https://buildkite.com/ray-project/release-tests-bisect/builds/1980

@justinvyu
Copy link
Contributor

This test seems to be marked as failing because of an unrelated issue, but the underlying release test has been encountering a bug for a while:

4-12-19 18:39:14,016 ERROR tune_controller.py:1331 -- Trial task failed for trial TorchTrainer_cf52a_00001
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
    result = ray.get(future)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/worker.py", line 2773, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/worker.py", line 920, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): ray::_Inner.train() (pid=1043985, ip=10.0.43.13, actor_id=037c75df9ec1cfba1543600602000000, repr=TorchTrainer)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/tune/trainable/trainable.py", line 331, in train
    raise skipped from exception_cause(skipped)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/train/_internal/utils.py", line 57, in check_for_failure
    ray.get(object_ref)
ray.exceptions.RayTaskError(RuntimeError): ray::_RayTrainWorker__execute.get_next() (pid=802103, ip=10.0.51.209, actor_id=e9e83558eb01bb1635696b3602000000, repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x78f49f5be5e0>)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/train/_internal/worker_group.py", line 33, in __execute
    raise skipped from exception_cause(skipped)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/train/_internal/utils.py", line 176, in discard_return_wrapper
    train_func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/train/examples/pytorch/tune_cifar_torch_pbt_example.py", line 90, in train_func
    checkpoint_dict = cpickle.load(fp)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/torch/storage.py", line 381, in _load_from_bytes
    return torch.load(io.BytesIO(b))
  File "/home/ray/anaconda3/lib/python3.9/site-packages/torch/serialization.py", line 1040, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/torch/serialization.py", line 1272, in _legacy_load
    result = unpickler.load()
  File "/home/ray/anaconda3/lib/python3.9/site-packages/torch/serialization.py", line 1205, in persistent_load
    obj = restore_location(obj, location)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/torch/serialization.py", line 390, in default_restore_location
    result = fn(storage, location)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/torch/serialization.py", line 265, in _cuda_deserialize
    device = validate_cuda_device(location)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/torch/serialization.py", line 256, in validate_cuda_device
    raise RuntimeError('Attempting to deserialize object on CUDA device '
RuntimeError: Attempting to deserialize object on CUDA device 1 but torch.cuda.device_count() is 1. Please use torch.load with map_location to map your storages to an existing device.

The checkpoint is saved from device 1, but loaded onto a machine with only 1 visible CUDA device. This is because the cluster setup is 3 nodes with 2 GPUs, and each trial uses 3 GPUs, so one of the workers only has 1 visible CUDA device.

This should be fixed by moving the state dict to CPU first before saving.

@aslonnie aslonnie assigned aslonnie and justinvyu and unassigned aslonnie Feb 6, 2025
@can-anyscale
Copy link
Collaborator Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't ml P0 Issues that should be fixed in short order ray-test-bot Issues managed by OSS test policy release-test release test stability triage Needs triage (eg: priority, bug/not-bug, and owning component) weekly-release-blocker Issues that will be blocking Ray weekly releases
Projects
None yet
Development

No branches or pull requests

3 participants