Add main grad before fwd pass #1142

vedanuj · 2023-10-04T15:31:56Z

Adds main_grad before FWD pass to FlatParameter

to be used with https://github.com/fairinternal/xlformers/pull/1418

mkardas · 2023-10-04T16:00:37Z

fairscale/nn/data_parallel/fully_sharded_data_parallel.py

-        assert param.grad is not None, param.shape
-        if param.grad.requires_grad:
-            raise RuntimeError("FSDP only works with gradients that don't require gradients")
+        # assert param.grad is not None, param.shape


Perhaps some check is needed to make sure parameters are not shared (as would be the case with weights tying)?

mkardas · 2023-10-04T16:00:54Z

fairscale/nn/data_parallel/fully_sharded_data_parallel.py

-                param.grad = None
+                if param.main_grad is not None:
+                    grad = param.main_grad
+                    param.main_grad = None


Doesn't .main_grad need to be restored somewhere before next forward?

awgu · 2023-10-04T16:17:30Z

If we construct flat_param.main_grad before forward and set individual param.main_grad as view into flat_param.main_grad before forward, then we hold the flat_param.main_grad in memory for all FSDP instances going into backward, which may increase peak memory.

An alternative option is to construct flat_param.main_grad (as zeros or empty depending on if TE adds or copies to the memory) in the pre-backward hook and separately set param.main_grad as views into flat_param.main_grad only in the pre-backward hook.

fairscale/fairscale/nn/data_parallel/fully_sharded_data_parallel.py

Line 1521 in 0f2229e

def _pre_backward_hook(*unused: Any) -> None:

One option could be to add this logic to _prep_grads_for_backward():

fairscale/fairscale/nn/data_parallel/fully_sharded_data_parallel.py

Line 1561 in 0f2229e

self._prep_grads_for_backward()

awgu · 2023-10-04T16:30:37Z

fairscale/nn/data_parallel/fully_sharded_data_parallel.py

-                param.grad.data = param.grad.data.float()
+                if param.grad is not None:
+                    if param.main_grad is not None:
+                        param.main_grad.copy_(param.grad.float())


nit: torch can upcast and copy in one kernel:

Suggested change

param.main_grad.copy_(param.grad.float())

param.main_grad.copy_(param.grad)

Existing

Upcast kernel + copy kernel

New

Only upcast kernel

Correctness Example

>>> t_fp32 = torch.empty((4,)) >>> t_bf16 = torch.randn((4,), dtype=torch.bfloat16) >>> t_fp32 tensor([-8.3762e-20, 3.0801e-41, -1.3043e-16, 3.0801e-41]) >>> t_bf16 tensor([-1.3516, -0.5156, -0.6055, 0.3535], dtype=torch.bfloat16) >>> t_fp32.copy_(t_bf16) tensor([-1.3516, -0.5156, -0.6055, 0.3535]) >>> t_fp32 tensor([-1.3516, -0.5156, -0.6055, 0.3535])

fairscale/nn/data_parallel/fully_sharded_data_parallel.py

vedanuj · 2023-10-05T22:59:02Z

@awgu It seems from my testing that the changes are still necessary in FlattenParamsWrapper otherwise it complains that .main_grad is not there for parameter.

@jspark1105 I have borrowed some changes from your PR #1136 to update the view when reallocating the zero buffers for main_grad.

awgu · 2023-10-06T15:30:46Z

fairscale/nn/data_parallel/fully_sharded_data_parallel.py

@@ -1721,35 +1722,48 @@ def _post_backward_hook(self, param: Parameter, *unused: Any) -> None:
        # reductions in post_backward stream.
        self._streams["post_backward"].wait_stream(torch.cuda.current_stream())
        with torch.cuda.stream(self._streams["post_backward"]):
-            orig_grad_data = param.grad.data
+            if param.main_grad is not None and not param.main_grad.eq(0).all():


Are we concerned that this param.main_grad.eq(0).all() might be a CPU sync? Perhaps, it is not so much a concern if we already have CPU syncs for rate limiting FSDP.

Is there another way I can check if main_grad is non zero without doing a CPU sync?

We are checking if this is all zeros to skip modules that didn't use main_grad?

yes .. because all parameters have .main_grad, so not sure how to make sure we are not using the ones that do not have the grads stored in .main_grad

vedanuj changed the base branch from main to ngoyal_changes_for_pp_fp8 October 4, 2023 15:32

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 4, 2023

vedanuj requested review from jspark1105, tmarkstrum, ngoyal2707, awgu and jianyuh and removed request for tmarkstrum October 4, 2023 15:32

vedanuj mentioned this pull request Oct 4, 2023

Add main_grad #1140

Open

10 tasks

mkardas reviewed Oct 4, 2023

View reviewed changes

awgu reviewed Oct 4, 2023

View reviewed changes

jianyuh reviewed Oct 4, 2023

View reviewed changes

fairscale/nn/data_parallel/fully_sharded_data_parallel.py Outdated Show resolved Hide resolved

jspark1105 mentioned this pull request Oct 4, 2023

Fp8 all gather hack #1136

Open

awgu reviewed Oct 6, 2023

View reviewed changes

vedanuj added 5 commits October 9, 2023 21:18

changes for main_grad before fwd

88589c1

move changes after orig_grad_data

f9083cf

address comments

c116927

ensure grads are not downcasted to bf16

e8f54b6

guard main_grad None

8cf28fa

vedanuj force-pushed the add_main_grad_before_fwd branch from 0be2e5e to 8cf28fa Compare October 10, 2023 04:19

jspark1105 mentioned this pull request May 28, 2024

[FSDPv1] Only perform cat() during last microbatch backward() within FlattenParamsWrapper #1184

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add main grad before fwd pass #1142

Add main grad before fwd pass #1142

vedanuj commented Oct 4, 2023 •

edited

Loading

mkardas Oct 4, 2023

mkardas Oct 4, 2023

awgu commented Oct 4, 2023

awgu Oct 4, 2023

vedanuj commented Oct 5, 2023

awgu Oct 6, 2023

vedanuj Oct 6, 2023

jspark1105 Oct 6, 2023

vedanuj Oct 6, 2023

	param.main_grad.copy_(param.grad.float())
	param.main_grad.copy_(param.grad)

Add main grad before fwd pass #1142

Are you sure you want to change the base?

Add main grad before fwd pass #1142

Conversation

vedanuj commented Oct 4, 2023 • edited Loading

mkardas Oct 4, 2023

Choose a reason for hiding this comment

mkardas Oct 4, 2023

Choose a reason for hiding this comment

awgu commented Oct 4, 2023

awgu Oct 4, 2023

Choose a reason for hiding this comment

vedanuj commented Oct 5, 2023

awgu Oct 6, 2023

Choose a reason for hiding this comment

vedanuj Oct 6, 2023

Choose a reason for hiding this comment

jspark1105 Oct 6, 2023

Choose a reason for hiding this comment

vedanuj Oct 6, 2023

Choose a reason for hiding this comment

vedanuj commented Oct 4, 2023 •

edited

Loading