Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_ddp_correctness_with_gradient_as_bucket_view fails for multi-device CUDA #8841

Open
ysiraichi opened this issue Mar 17, 2025 · 0 comments
Labels
bug Something isn't working xla:gpu

Comments

@ysiraichi
Copy link
Collaborator

ysiraichi commented Mar 17, 2025

🐛 Bug

The test TestXrtDistributedDataParallel.test_ddp_correctness_with_gradient_as_bucket_view skipped by #8593 was introduced in #8521. That was after GPU CI was disabled in #8286. I tried running it with commit bc3bf1f (commit correponding to #8521 merged in master), but that also failed. So, I'm pretty sure it didn't work with GPU CI from the start.

This test is failing specifically when there are multiple GPUs. On single GPU, this doesn't fail.

Running tests under Python 3.10.16: /usr/local/bin/python
[ RUN      ] TestXrtDistributedDataParallel.test_ddp_correctness_with_gradient_as_bucket_view
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1742049964.886415  190778 coordination_service_agent.cc:367] Coordination agent has successfully connected.
I0000 00:00:1742049964.886745  190778 coordination_service_agent.cc:440] Polling for error from coordination service. This is a long-running RPC that will return only if an error is encountered or cancelled (e.g. due to shutdown).
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1742049964.887096  190775 coordination_service_agent.cc:367] Coordination agent has successfully connected.
I0000 00:00:1742049964.887259  190775 coordination_service_agent.cc:440] Polling for error from coordination service. This is a long-running RPC that will return only if an error is encountered or cancelled (e.g. due to shutdown).
I0000 00:00:1742049965.217900  190778 service.cc:152] XLA service 0x5bbd97cb7220 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1742049965.218049  190778 service.cc:160]   StreamExecutor device (0): Quadro RTX 8000, Compute Capability 7.5
I0000 00:00:1742049965.230750  190778 se_gpu_pjrt_client.cc:972] Using BFC allocator.
I0000 00:00:1742049965.230917  190778 gpu_helpers.cc:129] XLA backend allocating 38067290112 bytes on device 0 for BFCAllocator.
I0000 00:00:1742049965.231123  190778 gpu_helpers.cc:169] XLA backend will use up to 12689096704 bytes on device 0 for CollectiveBFCAllocator.
I0000 00:00:1742049965.234545  190775 service.cc:152] XLA service 0x563bff371280 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1742049965.234683  190775 service.cc:160]   StreamExecutor device (0): Quadro RTX 8000, Compute Capability 7.5
I0000 00:00:1742049965.236253  190775 se_gpu_pjrt_client.cc:972] Using BFC allocator.
I0000 00:00:1742049965.236362  190775 gpu_helpers.cc:129] XLA backend allocating 38066896896 bytes on device 1 for BFCAllocator.
I0000 00:00:1742049965.236519  190775 gpu_helpers.cc:169] XLA backend will use up to 12688965632 bytes on device 1 for CollectiveBFCAllocator.
I0000 00:00:1742049965.308203  190778 cuda_dnn.cc:529] Loaded cuDNN version 90101
I0000 00:00:1742049965.308558  190775 cuda_dnn.cc:529] Loaded cuDNN version 90101
I0000 00:00:1742049986.440682  190778 coordination_service_agent.cc:617] Coordination agent has initiated Shutdown().
I0000 00:00:1742049986.469940  190775 coordination_service_agent.cc:617] Coordination agent has initiated Shutdown().
I0000 00:00:1742049986.471929  190775 coordination_service_agent.cc:638] Coordination agent has successfully shut down.
I0000 00:00:1742049986.472198  190778 coordination_service_agent.cc:638] Coordination agent has successfully shut down.
I0000 00:00:1742049986.472486  191292 coordination_service_agent.cc:447] Cancelling error polling because the service or the agent is shutting down.
I0000 00:00:1742049986.472827  191239 coordination_service_agent.cc:447] Cancelling error polling because the service or the agent is shutting down.
[  FAILED  ] TestXrtDistributedDataParallel.test_ddp_correctness_with_gradient_as_bucket_view
======================================================================
FAIL: test_ddp_correctness_with_gradient_as_bucket_view (__main__.TestXrtDistributedDataParallel)
TestXrtDistributedDataParallel.test_ddp_correctness_with_gradient_as_bucket_view
----------------------------------------------------------------------
concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/concurrent/futures/process.py", line 246, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/usr/local/lib/python3.10/concurrent/futures/process.py", line 205, in _process_chunk
    return [fn(*args) for args in chunk]
  File "/usr/local/lib/python3.10/concurrent/futures/process.py", line 205, in <listcomp>
    return [fn(*args) for args in chunk]
  File "pytorch/xla/torch_xla/_internal/pjrt.py", line 77, in _run_thread_per_device
    replica_results = list(
  File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 621, in result_iterator
    yield _result_or_cancel(fs.pop())
  File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 319, in _result_or_cancel
    return fut.result(timeout)
  File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/usr/local/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "pytorch/xla/torch_xla/_internal/pjrt.py", line 70, in _thread_fn
    return fn()
  File "pytorch/xla/torch_xla/_internal/pjrt.py", line 185, in __call__
    self.fn(runtime.global_ordinal(), *self.args, **self.kwargs)
  File "pytorch/xla/test/torch_distributed/test_ddp.py", line 32, in _ddp_correctness
    util.ddp_correctness(
  File "pytorch/xla/test/distributed_util.py", line 154, in ddp_correctness
    assert_all_close(cpu_model.parameters(), ddp_model.parameters())
  File "pytorch/xla/test/distributed_util.py", line 71, in assert_all_close
    assert torch.allclose(param_a.cpu(), param_b.cpu(), atol=1e-3)
AssertionError
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "pytorch/xla/test/torch_distributed/test_ddp.py", line 42, in test_ddp_correctness_with_gradient_as_bucket_view
    torch_xla.launch(self._ddp_correctness, args=(False, FLAGS.debug, True))
  File "pytorch/xla/torch_xla/torch_xla.py", line 233, in launch
    xmp.spawn(fn, args=args, nprocs=nprocs, start_method=start_method)
  File "pytorch/xla/torch_xla/distributed/xla_multiprocessing.py", line 39, in spawn
    return pjrt.spawn(fn, nprocs, start_method, args)
  File "pytorch/xla/torch_xla/_internal/pjrt.py", line 213, in spawn
    run_multiprocess(spawn_fn, start_method=start_method)
  File "pytorch/xla/torch_xla/_internal/pjrt.py", line 169, in run_multiprocess
    replica_results = list(
  File "pytorch/xla/torch_xla/_internal/pjrt.py", line 170, in <genexpr>
    itertools.chain.from_iterable(
  File "/usr/local/lib/python3.10/concurrent/futures/process.py", line 575, in _chain_from_iterable_of_lists
    for element in iterable:
  File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 621, in result_iterator
    yield _result_or_cancel(fs.pop())
  File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 319, in _result_or_cancel
    return fut.result(timeout)
  File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
AssertionError

----------------------------------------------------------------------
Ran 1 test in 32.094s

FAILED (failures=1)

Environment

  • Reproducible on XLA backend [CPU/TPU/CUDA]: CUDA
  • torch_xla version: since bc3bf1f, but still failing on 6804c65

cc @yaochengji @qihqi @tengyifei

@ysiraichi ysiraichi added bug Something isn't working xla:gpu labels Mar 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working xla:gpu
Projects
None yet
Development

No branches or pull requests

1 participant