Skip to content

xm.mesh_reduce results in RuntimeError concerning message size #1924

@ronakice

Description

@ronakice

🐛 Bug

torch_xla.core.xla_model.mesh_reduce(...) results in a RuntimeError:

tensorflow/compiler/xla/xla_client/mesh_service.cc:243 : Failed to meet rendezvous 'eval_lguids': Received message larger than max (5602816 vs. 4194304) (8)

Note that for context that I'm adapting @jysohn23's run_glue_tpu.py to MS-MARCO's passage ranking dataset (much larger than the individual GLUE datasets). I believe the equivalent lines of code in run_glue_tpu.py would be 271-272:

preds = xm.mesh_reduce("eval_preds", preds, np.concatenate)
out_label_ids = xm.mesh_reduce("eval_out_label_ids", out_label_ids, np.concatenate)

This probably has something to do with grpc's max send and receive limits. Adding grpc.max_send_message_length=1000000000,grpc.max_receive_message_length=1000000000 to os.environ['TF_GRPC_DEFAULT_OPTIONS'] in _setup_grpc() in the torch_xla/__init__.py file might help.

Building from source doesn't work on colab so I wasn't able to test if it did. I'm unsure if it can even take on limits of size 1 GB (Note I used a subset of the dataset which corresponds to 5602816, so a size of this much would be required by calculations) or if there is any other way to circumvent this issue. Although I do believe a 4MB limit is a bit too small for larger datasets. Thanks!

Environment

  • Reproducible on XLA backend [CPU/TPU]: TPU
  • torch_xla version: latest nightly
  • Running on google colab

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions