Memory efficient backward #33

dbaranchuk · 2022-09-11T03:59:41Z

TODO #1: double check that both memory_efficient_backward and has_fp16_weights options work properly.
TODO #2: make a clearer PR description.

This PR provides two features:

Memory efficient backward option:

Stores int8 weights in row-major format;
In forward pass, we have to cast row-major matrix to Turing/Ampere at each iteration. This leads to noticeable computational overhead for inference. Thus, we suggest using this option only for training.
In backward pass, we transform row-major weight matrix to the fp16 weights and perform fp16 matmul with fp16 grad_outputs. Note that we do not store fp16 weights and just efficiently compute them on the fly. Therefore, overall computational and memory overheads are negligible.

Cast inputs and outputs of the 8bit layer to fp16 and back to the initial input dtype, respectively. This allows us to use models of arbitrary dtypes without their conversion to fp16.

Update main branch

TimDettmers

Looks good to me overall. Please include also a test that tests for minimal difference between fp16 gradients and int8->fp16 gradients. You can add these tests either to the existing tests or create a new one.

TimDettmers · 2022-09-11T23:21:07Z

bitsandbytes/autograd/_functions.py

+        # Cast A to fp16
+        A_dtype = A.dtype
+        A = A.to(torch.float16)


This is exactly what we talked about, this should take care of bfloat16! One thing we need to think about is some kind of warning or something when this is first executed to make people aware that a cast happens and the operation quantization is performed in fp16.

Additionally, it would be great to have bf16 tests that verify everything works correctly with those inputs. I think you just need to change one line (and check if everything else is correctly executed).

added a warnings.warn the first time A is cast to a different dtype

added FP16 tests; fine-tuned typecasts to fit into almost all thresholds

increased atol 0.0175 -> 0.02 in one specific case for BF16 (this does not affect FP16)

e.g. grad_bias is now computed on natural grad_output before it is cast (and before it loses precision)

memory_efficient_backward will set Linear8bitLt.weight.requires_grad = False by default

added a test for backward pass with memory_efficient_backward

… people aware that a cast happens and the operation quantization is performed in fp16.

justheuristic · 2022-09-17T21:46:41Z

bitsandbytes/autograd/_functions.py

+            output = output.to(A.dtype).add_(bias)

        # 4. Mixed-precision decomposition matmul
        if coo_tensorA is not None and subA is not None:


curiously, when i tried to replace the next line from
output += torch.matmul(subA, state.subB)

to
output.addmm_(subA, state.subB)

the precision would drop and the tests would fail.
I have no idea why - the dtypes of output, subA and subB are always equal (tested).

I cannot remember if I stumbled upon the same thing. I remember trying to make this matrix multiplication more efficient but failed. What is the increase that you see in errors?

It does not make much sense to me since in cuBLAS you perform (A @ B) + D = C and the results of A @ B is in fp32 so the entire operation should be more precise. The same goes for fused multiply-add in general, which is more precise than multiplication followed by addition. It might be some weird tensor core issue, but it makes no sense to me.

If the error is only smaller some of the time and it has more variance, it would still be okay to have this. I believe it would be a good chunk faster.

justheuristic · 2022-09-17T22:32:30Z

Ran all tests 10 times to check for stability

TimDettmers

Looks all good to me. Let's discuss this briefly tomorrow. I am curious if we can get the .addmm_ to work. Otherwise, just a couple of questions on the test performance. Overall great work! Thank you so much, Yozh!

TimDettmers · 2022-09-20T02:21:23Z

bitsandbytes/autograd/_functions.py

-                SCB = (state.SCB.unsqueeze(1) / 127.0).half()
-                CB *= SCB
-                grad_A = torch.mm(grad_output, CB).view(ctx.grad_shape)
+                CB = state.CB.to(ctx.dtype_A, copy=True).mul_(state.SCB.unsqueeze(1).div(127.0))


Not sure how PyTorch implements div, but multiplication is about 30x faster than division. Since we apply it over a matrix this might make a tiny but significant difference. So .mul(1/127.0) might be better here.

applied, thanks

TimDettmers · 2022-09-20T02:26:56Z

tests/test_modules.py

+        (o1 * grad_proj).sum().backward()
+        grad_ref = grad_proj.flatten(2) @ w2.half() @ w1.half()
+        scale = grad_ref.abs().mean()
+        assert torch.allclose(b1.grad, grad_ref, rtol=0, atol=0.05 * scale)


I remember I had some tests that were using relative difference normalized by the standard deviation, which is similar to this. What is the range of errors that you see? It might also be good to test for a maximum of k elements that exceed a threshold. This helps to differentiate worse-case vs general performance.

got it, i've added a separate assert with outliers:

torch.testing.assert_allclose(b1.grad, grad_ref, rtol=0, atol=0.05 * scale) idx = torch.isclose(b1.grad, grad_ref, atol=0.01 * scale, rtol=0.1) assert (idx == 0).sum().item() <= b1.numel() * 0.005

TimDettmers · 2022-09-20T02:37:07Z

bitsandbytes/autograd/_functions.py

+            output = output.to(A.dtype).add_(bias)

        # 4. Mixed-precision decomposition matmul
        if coo_tensorA is not None and subA is not None:


I cannot remember if I stumbled upon the same thing. I remember trying to make this matrix multiplication more efficient but failed. What is the increase that you see in errors?

It does not make much sense to me since in cuBLAS you perform (A @ B) + D = C and the results of A @ B is in fp32 so the entire operation should be more precise. The same goes for fused multiply-add in general, which is more precise than multiplication followed by addition. It might be some weird tensor core issue, but it makes no sense to me.

If the error is only smaller some of the time and it has more variance, it would still be okay to have this. I believe it would be a good chunk faster.

TimDettmers · 2022-09-20T04:09:18Z

All good! Thank you both!

dbaranchuk added 14 commits August 23, 2022 23:39

add memory efficient backward

8ae9bb2

refactoring

1753aa0

minor fixes

656de8e

minor fixes

876387d

delete CxB from state

ef2936a

memory efficient fp16 backward

4d6174b

add dtype <-> fp16 cast

b3fee1e

req_gradA for casted & more efficient and accurate fp16 backward

8d34d36

Merge pull request #1 from TimDettmers/main

843ad06

Update main branch

add memory effcient backward option

42b5fc9

clarified an exception message

ee325f0

refactoring

d358999

refactoring

4dd475c

bug fix

e2a7576

TimDettmers requested changes Sep 11, 2022

View reviewed changes

TimDettmers mentioned this pull request Sep 11, 2022

Memory efficient 8bit backward #23

Closed

justheuristic added 14 commits September 17, 2022 18:42

Merge branch 'TimDettmers:main' into memory-efficient-backward

3634fc7

some kind of warning or something when this is first executed to make…

cc4858c

… people aware that a cast happens and the operation quantization is performed in fp16.

test_bf16

469d5a6

cast to half before double_quant

a9c7953

check dtypes first

140cdbe

check dtypes first

9379df8

clearer assertions

e29c5f5

clearer assertions

fc4a135

recast to fp16

a9fe0ff

cast bias too

eac9aca

copypaste tolerances

7facedd

un-fuse bias

d9ca0ed

un-fuse bias

56a074f

un-fuse bias

e9b8711

justheuristic added 2 commits September 18, 2022 00:43

addmm_

18f142e

rollback

76ece2c

justheuristic reviewed Sep 17, 2022

View reviewed changes

justheuristic added 17 commits September 18, 2022 00:47

reduce diff

579b8c7

add memory efficient backward

591f603

run backward

2cd047e

debugpritn

7906dc4

debugprint

4b4a9ef

debug

4da2227

debug

5d65817

debug

d9b8789

pre-cast

6a826c4

debug

37f805b

cast before allclose

95dafc6

cast before allclose

28a9313

cast device

725cc72

cast device

e4086a2

cast device

01b4c6a

cast device

32a9a88

cast device

cff3a71

TimDettmers reviewed Sep 20, 2022

View reviewed changes

TimDettmers and others added 4 commits September 20, 2022 06:36

review

9b7d307

review

a07825a

set threshold

292a478

try fp32

76ce9aa

TimDettmers marked this pull request as ready for review September 20, 2022 04:08

TimDettmers approved these changes Sep 20, 2022

View reviewed changes

TimDettmers merged commit 439f2b0 into bitsandbytes-foundation:main Sep 20, 2022

Uh oh!

Memory efficient backward #33

Memory efficient backward #33

Uh oh!

Conversation

dbaranchuk commented Sep 11, 2022

Uh oh!

TimDettmers left a comment

Choose a reason for hiding this comment

Uh oh!

TimDettmers Sep 11, 2022

Choose a reason for hiding this comment

Uh oh!

justheuristic Sep 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

justheuristic Sep 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TimDettmers Sep 20, 2022

Choose a reason for hiding this comment

Uh oh!

justheuristic commented Sep 17, 2022

Uh oh!

TimDettmers left a comment

Choose a reason for hiding this comment

Uh oh!

TimDettmers Sep 20, 2022

Choose a reason for hiding this comment

Uh oh!

justheuristic Sep 20, 2022

Choose a reason for hiding this comment

Uh oh!

TimDettmers Sep 20, 2022

Choose a reason for hiding this comment

Uh oh!

justheuristic Sep 20, 2022

Choose a reason for hiding this comment

Uh oh!

TimDettmers Sep 20, 2022

Choose a reason for hiding this comment

Uh oh!

TimDettmers commented Sep 20, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

justheuristic Sep 17, 2022 •

edited

Loading

justheuristic Sep 17, 2022 •

edited

Loading