fix: ep clipping with no ep grads by garrett361 · Pull Request #1541 · pytorch/torchtitan

garrett361 · 2025-08-06T20:57:48Z

The current EP grad clipping logic assumes that when using EP all of the norms returned by torch.nn.utils.get_total_norm are DTensors. This assumption can be violated and the subsequent full_tensor call can correspondingly fail in the edge case where the ep_grad list is empty, in which case get_total_norm returns tensor(0.), a non-DTensor.

torchtitan/torchtitan/distributed/utils.py

Lines 421 to 423 in a1fdd7e

    
           ep_grads_total_norm = torch.nn.utils.get_total_norm( 
        
               ep_grads, norm_type, error_if_nonfinite, foreach 
        
           ).full_tensor()

File "/app/torchtitan/torchtitan/distributed/utils.py", line 423, in _clip_grad_norm_with_ep
    ).full_tensor()
      ^^^^^^^^^^^
AttributeError: 'Tensor' object has no attribute 'full_tensor'

This edge case can occur in PP+EP setups when model uses some fully dense and some MoE layers (like DSv3), in which case some pp ranks may not be assigned any MoE layers.

I suppose it is possible that non_ep_grads could also be empty, but I can only imagine this happening in extreme cases, so I did not change the non_ep_grads code.

CC @tianyu-l

tianyu-l

Thanks, had a minor comment.

tianyu-l · 2025-08-06T23:04:18Z

torchtitan/distributed/utils.py

        ep_grads, norm_type, error_if_nonfinite, foreach
-    ).full_tensor()
+    )
+    # ep_grads may be an empty list, in which case get_total_norm returns tensor(0.), a non-DTensor


This edge case can occur in PP+EP setups when model uses some fully dense and some MoE layers (like DSv3), in which case some pp ranks may not be assigned any MoE layers.

Oh makes sense to me. Could you actually put this example edge case in the comment too? I think it'd be very helpful.

I suppose it is possible that non_ep_grads could also be empty, but I can only imagine this happening in extreme cases, so I did not change the non_ep_grads code.

I think this is not possible if a PP stage always

contains any non-MoE params

contains full MoE modules -- the shared expert and router.gate will be non_ep_params anyways

Expanded the comment.

I think this is not possible if a PP stage always [...]

Yeah, I was imagining very extreme cases where PP is very granularly applied an somehow a PP rank only ends up owning MoE layers and nothing else. Can't happen for any model or parallelism you could setup with torchtitan:main today, for sure. I was mostly just explaining why I only touched the ep_grads code.

Need anything else from me on this one @tianyu-l ?

oh sorry forgot to merge :)

tianyu-l · 2025-08-08T23:35:44Z

torchtitan/distributed/utils.py

        ep_grads, norm_type, error_if_nonfinite, foreach
-    ).full_tensor()
+    )
+    # ep_grads may be an empty list, in which case get_total_norm returns tensor(0.), a non-DTensor


oh sorry forgot to merge :)

garrett361 · 2025-08-09T16:00:31Z

Np, thanks!

fix: ep clipping with no ep grads

48e9e21

garrett361 requested review from fegin, tianyu-l, wconstab and wwwjn as code owners August 6, 2025 20:57

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Aug 6, 2025

tianyu-l reviewed Aug 6, 2025

View reviewed changes

expand on edge case

633d1a3

tianyu-l approved these changes Aug 8, 2025

View reviewed changes

tianyu-l merged commit 23e4dfc into pytorch:main Aug 8, 2025
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: ep clipping with no ep grads#1541

fix: ep clipping with no ep grads#1541
tianyu-l merged 2 commits intopytorch:mainfrom
garrett361:fix-ep-clip

garrett361 commented Aug 6, 2025 •

edited

Loading

Uh oh!

tianyu-l left a comment

Uh oh!

tianyu-l Aug 6, 2025

Uh oh!

garrett361 Aug 7, 2025

Uh oh!

garrett361 Aug 8, 2025

Uh oh!

tianyu-l Aug 8, 2025

Uh oh!

tianyu-l Aug 8, 2025

Uh oh!

Uh oh!

garrett361 commented Aug 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	ep_grads_total_norm = torch.nn.utils.get_total_norm(
	ep_grads, norm_type, error_if_nonfinite, foreach
	).full_tensor()

Conversation

garrett361 commented Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

tianyu-l Aug 6, 2025

Choose a reason for hiding this comment

Uh oh!

garrett361 Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

garrett361 Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

tianyu-l Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

tianyu-l Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

garrett361 commented Aug 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

garrett361 commented Aug 6, 2025 •

edited

Loading