Release test long_running_distributed_pytorch_pbt_failure.aws failed #50149

can-anyscale · 2025-01-31T00:13:24Z

Release test long_running_distributed_pytorch_pbt_failure.aws failed. See https://buildkite.com/ray-project/release/builds/31485#0194b5e6-dba7-42cc-ae01-2182bad0a160 for more details.

Managed by OSS Test Policy

can-anyscale · 2025-01-31T00:13:26Z

Blamed commit: 87c865e found by bisect job https://buildkite.com/ray-project/release-tests-bisect/builds/1980

justinvyu · 2025-01-31T19:21:15Z

This test seems to be marked as failing because of an unrelated issue, but the underlying release test has been encountering a bug for a while:

4-12-19 18:39:14,016 ERROR tune_controller.py:1331 -- Trial task failed for trial TorchTrainer_cf52a_00001
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
    result = ray.get(future)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/worker.py", line 2773, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/worker.py", line 920, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): ray::_Inner.train() (pid=1043985, ip=10.0.43.13, actor_id=037c75df9ec1cfba1543600602000000, repr=TorchTrainer)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/tune/trainable/trainable.py", line 331, in train
    raise skipped from exception_cause(skipped)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/train/_internal/utils.py", line 57, in check_for_failure
    ray.get(object_ref)
ray.exceptions.RayTaskError(RuntimeError): ray::_RayTrainWorker__execute.get_next() (pid=802103, ip=10.0.51.209, actor_id=e9e83558eb01bb1635696b3602000000, repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x78f49f5be5e0>)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/train/_internal/worker_group.py", line 33, in __execute
    raise skipped from exception_cause(skipped)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/train/_internal/utils.py", line 176, in discard_return_wrapper
    train_func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/train/examples/pytorch/tune_cifar_torch_pbt_example.py", line 90, in train_func
    checkpoint_dict = cpickle.load(fp)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/torch/storage.py", line 381, in _load_from_bytes
    return torch.load(io.BytesIO(b))
  File "/home/ray/anaconda3/lib/python3.9/site-packages/torch/serialization.py", line 1040, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/torch/serialization.py", line 1272, in _legacy_load
    result = unpickler.load()
  File "/home/ray/anaconda3/lib/python3.9/site-packages/torch/serialization.py", line 1205, in persistent_load
    obj = restore_location(obj, location)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/torch/serialization.py", line 390, in default_restore_location
    result = fn(storage, location)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/torch/serialization.py", line 265, in _cuda_deserialize
    device = validate_cuda_device(location)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/torch/serialization.py", line 256, in validate_cuda_device
    raise RuntimeError('Attempting to deserialize object on CUDA device '
RuntimeError: Attempting to deserialize object on CUDA device 1 but torch.cuda.device_count() is 1. Please use torch.load with map_location to map your storages to an existing device.

The checkpoint is saved from device 1, but loaded onto a machine with only 1 visible CUDA device. This is because the cluster setup is 3 nodes with 2 GPUs, and each trial uses 3 GPUs, so one of the workers only has 1 visible CUDA device.

This should be fixed by moving the state dict to CPU first before saving.

can-anyscale · 2025-02-07T06:35:13Z

Test passed on latest run: https://buildkite.com/ray-project/release/builds/32096#0194d9f1-f65a-4622-8c24-8eadadc2a8a6

aslonnie assigned aslonnie and justinvyu and unassigned aslonnie Feb 6, 2025

can-anyscale closed this as completed Feb 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release test long_running_distributed_pytorch_pbt_failure.aws failed #50149

Release test long_running_distributed_pytorch_pbt_failure.aws failed #50149

can-anyscale commented Jan 31, 2025

can-anyscale commented Jan 31, 2025

justinvyu commented Jan 31, 2025

can-anyscale commented Feb 7, 2025

Release test long_running_distributed_pytorch_pbt_failure.aws failed #50149

Release test long_running_distributed_pytorch_pbt_failure.aws failed #50149

Comments

can-anyscale commented Jan 31, 2025

can-anyscale commented Jan 31, 2025

justinvyu commented Jan 31, 2025

can-anyscale commented Feb 7, 2025