[FSDP2][NF4Tensor][2/n] implement torch.chunk and other ops #150

weifengpy · 2024-04-19T20:37:32Z

why FSDP needs those ops

torch.chunk / aten.split.Tensor: dim0 sharding on parameters torch.chunk(tensor, world_size, dim=0)
tensor.new_zeros / aten.new_zeros.default: allocate storage for padded params.
tensor[:end_idx] / aten.slice.Tensor and tensor.copy_: copy sharded params into padded params
tensor.view(-1) / aten.view.default: flatten ND tensors into 1D
torch.as_strided(tensor, orig_size) / aten.as_strided.default: restore 1D tensors to ND
tensor.pin_memory: move cpu tensor to pin memory for nonblocking D2H copy
tensor.cpu(): move gpu tensor to cpu

unit test: pytest test/dtypes/test_nf4.py

run fsdp in TorchTune

git clone https://github.com/weifengpy/torchtune.git
cd torchtune && pip install -e ".[dev]"
tune download meta-llama/Llama-2-7b-hf --output-dir /tmp/Llama-2-7b-hf --hf-token <HF_TOKEN>
tune run --nnodes 1 --nproc_per_node 2 lora_finetune_distributed --config recipes/configs/llama2/7B_qlora_single_device.yaml max_steps_per_epoch=1

user flow and gaps

step 1: load llama2/3 from HF checkpoints. gap is memory spikes in NF4Tensor.from_tensor [NF4][FSDP2] DTensor + fused adam on cpu #205
step 2: training loop
- numerics [NF4][FSDP2]: enable multi-gpu CI #202
- perf and compare with answer.ai https://github.com/pytorch/ao/issues/203
- memory: 500 MB unexplained
- torch.compile(TransformerBlock)
- cpu offloading with DTensor + fused adam [NF4][FSDP2] DTensor + fused adam on cpu #205
step 3: save checkpoint: verify if DTensor(NF4Tensor).full_tensor + torch.save works for NF4Tensor
step 4: load checkpoint to resume finetuning: verify if torch.load + DTensor(NF4Tensor).distributed works

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

msaroufim

Thank you! Made a first pass and can do a second one tomorrow morning

msaroufim · 2024-04-29T23:18:33Z

test/dtypes/test_nf4.py

+    @unittest.skipIf(not torch.cuda.is_available(), "Need CUDA available")
+    def test_to_cpu(self):
+        nf4_tensor = to_nf4(torch.randn(512 * 512, device='cuda'))
+        nf4_tensor.cpu()


so this is just testing against crashes or do also expect the nf4_tensor.device to be cpu?

good catch. this is testing against crashes but i will add assertion on nf4_tensor.device.type == 'cpu'

msaroufim · 2024-04-29T23:23:45Z

test/dtypes/test_nf4.py

+                torch.as_strided(nf4_tensor, nf4_tensor.size(), stride, nf4_tensor.storage_offset())
+
+    @unittest.skipIf(not torch.cuda.is_available(), "Need CUDA available")
+    def test_pin_memory(self):


I think you mentioned this briefly last week but could you remind me how you figured out these would be the functions that needed to be tested. (I'm thinking ahead with a tutorial for someone who wants to upstream some new exotic dtpye and get it working with fsdp). That's probably a good candidate for what I mean by we should add another smoke test so we know for sure FSDP will work

So I ran the tests locally and they all worked and fast! So this gives me confidence the nf4 tensor now supports many new ops but it doesnt give me confidence that fsdp won't break in some way

I was hoping we could have a smoke test of the sort fsdp(torch.nn.Sequential(LinearNF4(64,64))) that would ensure nothing breaks and that fsdp doesn't silently drop the dtype since that functionality wasn't tested for fsdp 1 and we had to rely on twitter to get that signal

agree that we need a smoke test on fsdp(model). Not sure how to setup a multi-gpu test in torchao though. Is there some .ci files to change? Is there some example in torchAO? I am happy to fill in the actual logic into the template. As a reference, FSDP tests in pytorch are done like this pytorch/test/distributed/_composable/fsdp/test_fully_shard_training.py

Something identical should work the machines we have in CI, every commit is already running on 4 A10Gs linux.g5.12xlarge. No existing example since this is our first distributed test

Let's just do this, first thing we meet tomorrow

msaroufim · 2024-04-29T23:30:00Z

torchao/dtypes/nf4tensor.py

 def noop_detach(func, *args, **kwargs):
    return args[0][0]


+@implements(


more of a n00b q to @drisspg : what's up with all the args[0] I feel like there's some sort of contract I can't quite parse

EDIT: It's the NF4 tensor, could we add some comment somewhere to make this clearer?

I updated PR with nf4tensor = args[0] at the begining to make it clearer

msaroufim · 2024-04-29T23:33:37Z

torchao/dtypes/nf4tensor.py

+            self.scaler_block_size,
+            self.scaler_mean,
+            self.nf4,
+            mesh.get_group().size(),


n00b q: what is this doing?

Also more generally I don't follow what the 2 fsdp tests are trying to do. I think in fsdp_post_all_gather you are testing to make sure nf4 tensors are preserved and not silently casted to some other type

I don't follow what the 2 fsdp tests are trying to do

This is core logic in nf4tensor.py. unit tests happens in another file test_nf4.py

fsdp_pre_all_gather returns a tuple of two things

tuple[0] are quantized_scalers, quantization_factor and quantized_data. They are input for all-gather

tuple[1] are SubclassTensorArgs, block_size etc are metadata to reconstruct NF4Tensor. mesh.get_group().size() is the group size for all-gather (how many gpus). it's helpful to restore NF4Tensor.size. Eg for 2 gpus, all-gathering tensor(512) will return tensor(512 x 2)

msaroufim · 2024-04-29T23:42:14Z

torchao/dtypes/nf4tensor.py

+    scaler_mean = aten_op(args[0].scaler_mean, *args[1:], **kwargs)
+    nf4 = aten_op(args[0].nf4, *args[1:], **kwargs)
+    tensor_meta = SubclassTensorArgs(
+        args[0].size(),


+1 This also confused me. I think what driss means is just give a human readable name to args[0] so its easier to read the code

msaroufim

Thank you! Made a first pass and can do a second one tomorrow morning

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

awgu · 2024-04-30T13:42:10Z

torchao/dtypes/nf4tensor.py

+        aten.detach.default,
+    ]
+)
+def nf4_detach(aten_op, args, kwargs=None):


If we make that assumption that requires_grad=False and detach is a no-op, can we add an assertion that checks for args[0].requires_grad?

Also, I am not sure that we need to detach all inner tensors. cc: @bdhirsh

awgu · 2024-04-30T13:44:34Z

torchao/dtypes/nf4tensor.py

+        raise NotImplementedError(f"aten.new_zeros(NF4Tensor) with new size {new_size}")
+    ratio = nf4_tensor.numel() // math.prod(new_size)
+
+    assert nf4_tensor.quantized_scalers.size(0) % ratio == 0, f"quantized_scalers.numel() must be divisible by {ratio}"


nit: These assertion messages preferably should include the values (i.e. both nf4_tensor.quantized_scalers.size(0) and ratio) so that they can be more actionable.

awgu · 2024-04-30T13:48:04Z

torchao/dtypes/nf4tensor.py

+    quantization_factor = aten_op(nf4_tensor.quantization_factor, *(args[1:]), **kwargs)
+    quantized_data = aten_op(nf4_tensor.quantized_data, *(args[1:]), **kwargs)
+    return NF4Tensor(
+        SubclassTensorArgs(


nit: I seem to see this pattern a lot where we construct SubclassTensorArgs directly from an existing nf4_tensor. Perhaps, consider making this into a helper to avoid the duplication.

haha. not nit at all. added util function to keep the code dry: NF4Tensor(*construct_nf4_args(nf4tensor, updated_attrs))

awgu · 2024-04-30T13:49:43Z

torchao/dtypes/nf4tensor.py

+            assert (
+                quantized_scalers.untyped_storage().data_ptr()
+                == out.quantized_scalers.untyped_storage().data_ptr() and
+                quantization_factor.untyped_storage().data_ptr()
+                == out.quantization_factor.untyped_storage().data_ptr() and
+                quantized_data.untyped_storage().data_ptr()
+                == out.quantized_data.untyped_storage().data_ptr()
+            ), f"Expects out's data to be the all-gather output"


We may consider removing these asserts (in the future) especially if tracing through this becomes an issue. In theory, NF4Tensor should not need to make this kind of assert, but for now, it might be helpful for debugging as the FSDP extension is still in its early stages.

awgu · 2024-04-30T13:52:23Z

torchao/dtypes/nf4tensor.py

+            )
+        )
+    ) and len(args) == 2:
+        # Tensor.to(device, non_blocking)


Does this mean that if we tried to use __torch_dispatch__, we would not be able to tell that it is simply .to(device, non_blocking=True) without a dtype argument/dtype change?

What is the story for dequantization? Namely, what is the outer NF4Tensor's dtype, and what happens when you call .to(dtype) with that same dtype? (e.g. if NF4Tensor.dtype == torch.bfloat16, what if you call NF4Tensor.to(torch.bfloat16)?)

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

weifengpy · 2024-04-30T22:06:50Z

torchao/dtypes/nf4tensor.py


 NF4_OPS_TABLE: Dict[Any, Any] = {}

+INNER_TENSOR_NAMES_FOR_FSDP = ["quantized_scalers", "quantization_factor", "quantized_data"]


I exclude two tiny tensors: nf4 (numel=16) and scaler_mean (numel=1)
when GPU > numel, we need to implement padding for inner tensors. it's not worth the time in my opinion

This seems like it'd apply to more than just FSDP. Is that correct?

This seems like it'd apply to more than just FSDP. Is that correct?

it applies general distributed case when we shard a single tensor to N GPUs. I can change the name to INNER_TENSOR_NAMES_FOR_SHARDING if that's clearer

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

cpuhrsch · 2024-05-01T16:17:16Z

torchao/dtypes/nf4tensor.py

+    assert nf4tensor.quantized_scalers.size(0) % ratio == 0, f"quantized_scalers.numel() must be divisible by {ratio}"
+    quantized_scalers = aten_op(nf4tensor.quantized_scalers, [nf4tensor.quantized_scalers.size(0) // ratio], **kwargs)
+
+    assert nf4tensor.quantization_factor.size(0) % ratio == 0, f"quantization_factor.size(0) must be divisible by {ratio}"


Maybe these asserts could be unified?

good suggestion. I removed duplicative asserts with for loop over inner tensors

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

cpuhrsch · 2024-05-01T18:16:10Z

torchao/dtypes/nf4tensor.py


 NF4_OPS_TABLE: Dict[Any, Any] = {}

+INNER_TENSOR_NAMES_FOR_SHARDING = ["quantized_scalers", "quantization_factor", "quantized_data"]


So this is something FSDP2 requires any Tensor subclass to have defined?

for any Tensor subclass, we prefer reusing __tensor_flatten__ to lookup inner tensors. For NF4, we define INNER_TENSOR_NAMES_FOR_SHARDING as a subset of __tensor_flatten__ because scaler_mean and nf4 are too tiny to shard

Hm, isn't that something you could filter with a numel based heuristic within FSDP itself instead of requiring some tensor subclasses to communicate it?

I think the inner tensors that are sharded needs to match the torch.chunk implementation in the subclass, so FSDP cannot necessarily determine the tensors to shard itself. (E.g., if FSDP filtered by numel but the subclass implemented torch.chunk to still shard some tensor smaller than the numel threshold, then there would be a correctness issue.)

changed to private const with underscore _INNER_TENSOR_NAMES_FOR_SHARDING after discussion with @cpuhrsch

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

msaroufim

Thank you for the heroic work. Let's open up an issue with known gaps

weifengpy · 2024-05-01T22:34:17Z

Thank you for the heroic work. Let's open up an issue with known gaps

yes, will open issue for the renaming work

weifengpy · 2024-05-02T02:32:29Z

Thank you for the heroic work. Let's open up an issue with known gaps

opened issues and linked them here

)

…ack (pytorch#150) * new gguf parsing for Q40 that conforms with pytorch's quantization stack * updates * add q6_k and clean up q40 * fixes to unpack_q40

weifengpy and others added 26 commits April 3, 2024 18:18

proof of concept for FSDP2 + NF4Tensor

0a13e6a

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

Merge branch 'main' into main

9a56eaa

fsdp extention for tensor subclass

8180540

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

support fp32

95b03e1

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

Merge branch 'pytorch-labs:main' into main

3ac9d81

UNIT TEST FOR STATE DICT

38461b3

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

implement to

bc7a764

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

remove torch.override from torch function

8b1d037

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

use dtype in compile unit test

7ff6855

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

add dtype in all unit test

d9bcf71

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

keep original dtype

923bef2

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

fix linter

e15d244

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

use torch testing @parametrize

d4beb8f

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

remove unused import

f41cb3d

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

Merge branch 'pytorch-labs:main' into main

952fbdd

sm8 for fp16

950d9fd

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

remove sm check for fp16

d4eae0b

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

skip 2.2.2 and below for tracing tensor subclass

9444f2c

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

Merge branch 'pytorch-labs:main' into main

b2c3c02

include kwargs

9be2de3

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

raise unimplemented

2981393

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

Merge branch 'main' into main

3ced998

Merge branch 'pytorch-labs:main' into main

3f1e19a

fsdp2 ops

761416a

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

better diff layout

c656f1e

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

set pg size in metadata

c56d7e2

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 19, 2024

weifengpy marked this pull request as draft April 19, 2024 20:38

weifengpy added 2 commits April 19, 2024 13:41

remove irrelevant changes

d656b93

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

add unit test

5c4fe2b

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

msaroufim reviewed Apr 29, 2024

View reviewed changes

weifengpy added 2 commits April 29, 2024 19:27

assert cpu device

699079d

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

name args[0] as nf4tensor

c8b047c

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

awgu reviewed Apr 30, 2024

View reviewed changes

utils for apply to inner tensors and constructor

925602c

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

weifengpy commented Apr 30, 2024

View reviewed changes

weifengpy added 2 commits April 30, 2024 15:50

use original copy_

e36ab6c

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

decorator for args check

a007027

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

weifengpy requested review from msaroufim and cpuhrsch May 1, 2024 01:39

weifengpy mentioned this pull request May 1, 2024

enable QLoRA + FSDP2 pytorch/torchtune#909

Merged

Merge branch 'main' into main

c352552

cpuhrsch reviewed May 1, 2024

View reviewed changes

weifengpy and others added 2 commits May 1, 2024 11:08

INNER_TENSOR_NAMES_FOR_SHARDING and unify assert in split and new_zeros

c83fdad

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

Merge branch 'pytorch:main' into main

574fecd

weifengpy requested a review from cpuhrsch May 1, 2024 18:14

cpuhrsch reviewed May 1, 2024

View reviewed changes

weifengpy and others added 2 commits May 1, 2024 14:20

indicate private constant with _

f27760b

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

Merge branch 'main' into fsdp2ops

b4f51b9

weifengpy requested a review from cpuhrsch May 1, 2024 22:06

msaroufim approved these changes May 1, 2024

View reviewed changes

cpuhrsch merged commit ac53d7f into pytorch:main May 1, 2024
13 checks passed

msaroufim mentioned this pull request May 1, 2024

do not land: Create test_distributed #190

Closed

msaroufim mentioned this pull request May 2, 2024

[Tracker] WIP Features for torchao v0.2 #132

Closed

22 tasks

msaroufim mentioned this pull request May 30, 2024

[Tracker] WIP features for torchao 0.3 #252

Closed

19 tasks

dbyoung18 pushed a commit to dbyoung18/ao that referenced this pull request Jul 31, 2024

[FSDP2][NF4Tensor][2/n] implement torch.chunk and other ops (pytorch#150

a049baf

)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FSDP2][NF4Tensor][2/n] implement torch.chunk and other ops #150

[FSDP2][NF4Tensor][2/n] implement torch.chunk and other ops #150

weifengpy commented Apr 19, 2024 •

edited

Loading

msaroufim left a comment

msaroufim Apr 29, 2024

weifengpy Apr 29, 2024

msaroufim Apr 29, 2024 •

edited

Loading

weifengpy Apr 30, 2024 •

edited

Loading

msaroufim Apr 30, 2024 •

edited

Loading

msaroufim Apr 29, 2024

weifengpy Apr 30, 2024

msaroufim Apr 29, 2024 •

edited

Loading

weifengpy Apr 30, 2024

msaroufim Apr 29, 2024

msaroufim left a comment

awgu Apr 30, 2024

awgu Apr 30, 2024

awgu Apr 30, 2024

weifengpy May 1, 2024

awgu Apr 30, 2024

awgu Apr 30, 2024

weifengpy Apr 30, 2024

cpuhrsch May 1, 2024

weifengpy May 1, 2024

cpuhrsch May 1, 2024

weifengpy May 1, 2024

cpuhrsch May 1, 2024

weifengpy May 1, 2024

cpuhrsch May 1, 2024

awgu May 1, 2024

weifengpy May 1, 2024

msaroufim left a comment

weifengpy commented May 1, 2024

weifengpy commented May 2, 2024


		NF4_OPS_TABLE: Dict[Any, Any] = {}

		INNER_TENSOR_NAMES_FOR_FSDP = ["quantized_scalers", "quantization_factor", "quantized_data"]


		NF4_OPS_TABLE: Dict[Any, Any] = {}

		INNER_TENSOR_NAMES_FOR_SHARDING = ["quantized_scalers", "quantization_factor", "quantized_data"]

[FSDP2][NF4Tensor][2/n] implement torch.chunk and other ops #150

[FSDP2][NF4Tensor][2/n] implement torch.chunk and other ops #150

Conversation

weifengpy commented Apr 19, 2024 • edited Loading

msaroufim left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

msaroufim Apr 29, 2024 • edited Loading

Choose a reason for hiding this comment

weifengpy Apr 30, 2024 • edited Loading

Choose a reason for hiding this comment

msaroufim Apr 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

msaroufim Apr 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

msaroufim left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

msaroufim left a comment

Choose a reason for hiding this comment

weifengpy commented May 1, 2024

weifengpy commented May 2, 2024

weifengpy commented Apr 19, 2024 •

edited

Loading

msaroufim Apr 29, 2024 •

edited

Loading

weifengpy Apr 30, 2024 •

edited

Loading

msaroufim Apr 30, 2024 •

edited

Loading

msaroufim Apr 29, 2024 •

edited

Loading