Add Distributed Model Parallel Tests for models_test.py #363

swong3-sc · 2025-10-16T19:08:25Z

Scope of work done

Adds two DMP-related tests to models_test.py just to make sure we can use DMP to wrap a model. One test for forward, one for gradient flow.
Slight fix, to deal with tensors being cast as awaitable objects, which didn't allow for DMP wrapping

Where is the documentation for this feature?: N/A

Did you add automated tests or write a test plan?

Yes, added two unit tests.

Updated Changelog.md? NO

Ready for code review?: YES

…e object

kmontemayor2-sc

Thanks Sam!

kmontemayor2-sc · 2025-10-16T19:49:52Z

python/tests/unit/module/models_test.py

+        """
+        Test that DMP-wrapped LightGCN produces the same output as non-wrapped model. Note: We only test with a single process for unit test.
+        """
+        from torchrec.distributed.model_parallel import DistributedModelParallel as DMP


nit. import at the top of the file?

Yeah, changed.

kmontemayor2-sc · 2025-10-16T19:54:46Z

python/tests/unit/module/models_test.py

+        if not dist.is_initialized():
+            dist.init_process_group(
+                backend="gloo",
+                init_method="tcp://localhost:29500",
+                rank=0,
+                world_size=1,  # Single process for unit test
+            )


lets cleanup the process group after every test? Like we do here

This way we can get rid of the try/catch here.

Thanks, I added a tear down method

kmontemayor2-sc · 2025-10-16T19:56:10Z

python/tests/unit/module/models_test.py

Can we follow the pattern here to test against world size > 1?

Added a test to do so. I think we should have some discussion when you get back about the nature of this test, ie. use CPU vs CUDA. I went ahead and did a world size of 2, but with CPU, so we weren't really testing the sharding here, just that it works with a larger world size.

This PR is not urgent, however, so we can discuss late.

kmontemayor2-sc · 2025-10-16T19:57:25Z

python/gigl/module/models.py

+        # When using DMP, EmbeddingBagCollection returns Awaitable that needs to be resolved
+        if isinstance(embeddings_0, Awaitable):
+            embeddings_0 = embeddings_0.wait()


It does seem unfortunate/surprising that the rest of our code and/or pyg code doesn't support this type of tensor.

Can we add a TODO to look into this?

I will add a TODO. This seems like an expected result of TorchRec sharding, as introduced by TorchRec itself.

https://docs.pytorch.org/tutorials/intermediate/torchrec_intro_tutorial.html#gpu-training-with-lazyawaitable

kmontemayor2-sc · 2025-10-16T20:35:45Z

python/tests/unit/module/models_test.py

+        if not dist.is_initialized():
+            dist.init_process_group(
+                backend="gloo",
+                init_method="tcp://localhost:29500",


let's also use get_process_group_init_method 1 so we can always have a free port?

Changed to this.

Added DMP tests (single process) and simple fix to deal with awaitabl…

0b19adb

…e object

swong3-sc requested review from kmontemayor2-sc, mkolodner-sc, nshah-sc, svij-sc, xgao4-sc, yliu2-sc and zfan3-sc as code owners October 16, 2025 19:08

kmontemayor2-sc reviewed Oct 16, 2025

View reviewed changes

swong3-sc added 2 commits October 17, 2025 23:29

Minor changes to DMP testing

93873d5

Added larger world size test

636186e

Add Distributed Model Parallel Tests for models_test.py #363

Are you sure you want to change the base?

Add Distributed Model Parallel Tests for models_test.py #363

Uh oh!

Conversation

swong3-sc commented Oct 16, 2025

Uh oh!

kmontemayor2-sc left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants