We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test_ddp_correctness_with_gradient_as_bucket_view
The test TestXrtDistributedDataParallel.test_ddp_correctness_with_gradient_as_bucket_view skipped by #8593 was introduced in #8521. That was after GPU CI was disabled in #8286. I tried running it with commit bc3bf1f (commit correponding to #8521 merged in master), but that also failed. So, I'm pretty sure it didn't work with GPU CI from the start.
This test is failing specifically when there are multiple GPUs. On single GPU, this doesn't fail.
Running tests under Python 3.10.16: /usr/local/bin/python [ RUN ] TestXrtDistributedDataParallel.test_ddp_correctness_with_gradient_as_bucket_view WARNING: All log messages before absl::InitializeLog() is called are written to STDERR I0000 00:00:1742049964.886415 190778 coordination_service_agent.cc:367] Coordination agent has successfully connected. I0000 00:00:1742049964.886745 190778 coordination_service_agent.cc:440] Polling for error from coordination service. This is a long-running RPC that will return only if an error is encountered or cancelled (e.g. due to shutdown). WARNING: All log messages before absl::InitializeLog() is called are written to STDERR I0000 00:00:1742049964.887096 190775 coordination_service_agent.cc:367] Coordination agent has successfully connected. I0000 00:00:1742049964.887259 190775 coordination_service_agent.cc:440] Polling for error from coordination service. This is a long-running RPC that will return only if an error is encountered or cancelled (e.g. due to shutdown). I0000 00:00:1742049965.217900 190778 service.cc:152] XLA service 0x5bbd97cb7220 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices: I0000 00:00:1742049965.218049 190778 service.cc:160] StreamExecutor device (0): Quadro RTX 8000, Compute Capability 7.5 I0000 00:00:1742049965.230750 190778 se_gpu_pjrt_client.cc:972] Using BFC allocator. I0000 00:00:1742049965.230917 190778 gpu_helpers.cc:129] XLA backend allocating 38067290112 bytes on device 0 for BFCAllocator. I0000 00:00:1742049965.231123 190778 gpu_helpers.cc:169] XLA backend will use up to 12689096704 bytes on device 0 for CollectiveBFCAllocator. I0000 00:00:1742049965.234545 190775 service.cc:152] XLA service 0x563bff371280 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices: I0000 00:00:1742049965.234683 190775 service.cc:160] StreamExecutor device (0): Quadro RTX 8000, Compute Capability 7.5 I0000 00:00:1742049965.236253 190775 se_gpu_pjrt_client.cc:972] Using BFC allocator. I0000 00:00:1742049965.236362 190775 gpu_helpers.cc:129] XLA backend allocating 38066896896 bytes on device 1 for BFCAllocator. I0000 00:00:1742049965.236519 190775 gpu_helpers.cc:169] XLA backend will use up to 12688965632 bytes on device 1 for CollectiveBFCAllocator. I0000 00:00:1742049965.308203 190778 cuda_dnn.cc:529] Loaded cuDNN version 90101 I0000 00:00:1742049965.308558 190775 cuda_dnn.cc:529] Loaded cuDNN version 90101 I0000 00:00:1742049986.440682 190778 coordination_service_agent.cc:617] Coordination agent has initiated Shutdown(). I0000 00:00:1742049986.469940 190775 coordination_service_agent.cc:617] Coordination agent has initiated Shutdown(). I0000 00:00:1742049986.471929 190775 coordination_service_agent.cc:638] Coordination agent has successfully shut down. I0000 00:00:1742049986.472198 190778 coordination_service_agent.cc:638] Coordination agent has successfully shut down. I0000 00:00:1742049986.472486 191292 coordination_service_agent.cc:447] Cancelling error polling because the service or the agent is shutting down. I0000 00:00:1742049986.472827 191239 coordination_service_agent.cc:447] Cancelling error polling because the service or the agent is shutting down. [ FAILED ] TestXrtDistributedDataParallel.test_ddp_correctness_with_gradient_as_bucket_view ====================================================================== FAIL: test_ddp_correctness_with_gradient_as_bucket_view (__main__.TestXrtDistributedDataParallel) TestXrtDistributedDataParallel.test_ddp_correctness_with_gradient_as_bucket_view ---------------------------------------------------------------------- concurrent.futures.process._RemoteTraceback: """ Traceback (most recent call last): File "/usr/local/lib/python3.10/concurrent/futures/process.py", line 246, in _process_worker r = call_item.fn(*call_item.args, **call_item.kwargs) File "/usr/local/lib/python3.10/concurrent/futures/process.py", line 205, in _process_chunk return [fn(*args) for args in chunk] File "/usr/local/lib/python3.10/concurrent/futures/process.py", line 205, in <listcomp> return [fn(*args) for args in chunk] File "pytorch/xla/torch_xla/_internal/pjrt.py", line 77, in _run_thread_per_device replica_results = list( File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 621, in result_iterator yield _result_or_cancel(fs.pop()) File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 319, in _result_or_cancel return fut.result(timeout) File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 458, in result return self.__get_result() File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result raise self._exception File "/usr/local/lib/python3.10/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) File "pytorch/xla/torch_xla/_internal/pjrt.py", line 70, in _thread_fn return fn() File "pytorch/xla/torch_xla/_internal/pjrt.py", line 185, in __call__ self.fn(runtime.global_ordinal(), *self.args, **self.kwargs) File "pytorch/xla/test/torch_distributed/test_ddp.py", line 32, in _ddp_correctness util.ddp_correctness( File "pytorch/xla/test/distributed_util.py", line 154, in ddp_correctness assert_all_close(cpu_model.parameters(), ddp_model.parameters()) File "pytorch/xla/test/distributed_util.py", line 71, in assert_all_close assert torch.allclose(param_a.cpu(), param_b.cpu(), atol=1e-3) AssertionError """ The above exception was the direct cause of the following exception: Traceback (most recent call last): File "pytorch/xla/test/torch_distributed/test_ddp.py", line 42, in test_ddp_correctness_with_gradient_as_bucket_view torch_xla.launch(self._ddp_correctness, args=(False, FLAGS.debug, True)) File "pytorch/xla/torch_xla/torch_xla.py", line 233, in launch xmp.spawn(fn, args=args, nprocs=nprocs, start_method=start_method) File "pytorch/xla/torch_xla/distributed/xla_multiprocessing.py", line 39, in spawn return pjrt.spawn(fn, nprocs, start_method, args) File "pytorch/xla/torch_xla/_internal/pjrt.py", line 213, in spawn run_multiprocess(spawn_fn, start_method=start_method) File "pytorch/xla/torch_xla/_internal/pjrt.py", line 169, in run_multiprocess replica_results = list( File "pytorch/xla/torch_xla/_internal/pjrt.py", line 170, in <genexpr> itertools.chain.from_iterable( File "/usr/local/lib/python3.10/concurrent/futures/process.py", line 575, in _chain_from_iterable_of_lists for element in iterable: File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 621, in result_iterator yield _result_or_cancel(fs.pop()) File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 319, in _result_or_cancel return fut.result(timeout) File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 458, in result return self.__get_result() File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result raise self._exception AssertionError ---------------------------------------------------------------------- Ran 1 test in 32.094s FAILED (failures=1)
cc @yaochengji @qihqi @tengyifei
The text was updated successfully, but these errors were encountered:
No branches or pull requests
🐛 Bug
The test TestXrtDistributedDataParallel.test_ddp_correctness_with_gradient_as_bucket_view skipped by #8593 was introduced in #8521. That was after GPU CI was disabled in #8286. I tried running it with commit bc3bf1f (commit correponding to #8521 merged in master), but that also failed. So, I'm pretty sure it didn't work with GPU CI from the start.
This test is failing specifically when there are multiple GPUs. On single GPU, this doesn't fail.
Environment
cc @yaochengji @qihqi @tengyifei
The text was updated successfully, but these errors were encountered: