-
Notifications
You must be signed in to change notification settings - Fork 562
Description
🐛 Bug
torch_xla.core.xla_model.mesh_reduce(...) results in a RuntimeError:
tensorflow/compiler/xla/xla_client/mesh_service.cc:243 : Failed to meet rendezvous 'eval_lguids': Received message larger than max (5602816 vs. 4194304) (8)
Note that for context that I'm adapting @jysohn23's run_glue_tpu.py to MS-MARCO's passage ranking dataset (much larger than the individual GLUE datasets). I believe the equivalent lines of code in run_glue_tpu.py would be 271-272:
preds = xm.mesh_reduce("eval_preds", preds, np.concatenate)
out_label_ids = xm.mesh_reduce("eval_out_label_ids", out_label_ids, np.concatenate)
This probably has something to do with grpc's max send and receive limits. Adding grpc.max_send_message_length=1000000000,grpc.max_receive_message_length=1000000000 to os.environ['TF_GRPC_DEFAULT_OPTIONS'] in _setup_grpc() in the torch_xla/__init__.py file might help.
Building from source doesn't work on colab so I wasn't able to test if it did. I'm unsure if it can even take on limits of size 1 GB (Note I used a subset of the dataset which corresponds to 5602816, so a size of this much would be required by calculations) or if there is any other way to circumvent this issue. Although I do believe a 4MB limit is a bit too small for larger datasets. Thanks!
Environment
- Reproducible on XLA backend [CPU/TPU]: TPU
- torch_xla version: latest nightly
- Running on google colab