add FSDP QLoRA test and revert failing PR #403

weifengpy · 2024-06-19T03:02:00Z

fix error when running torchtune QLoRA + FSDP2 #380
TypeError: nf4_detach() missing 1 required positional argument: 'args'

torchtune command

tune download meta-llama/Llama-2-7b-hf --output-dir /tmp/Llama-2-7b-hf --hf-token <HF_TOKEN>
tune run --nnodes 1 --nproc_per_node 4 lora_finetune_fsdp2 --config llama2/7B_qlora enable_activation_checkpointing=False

revert NF4 changes from Factor out dispatch and layout registration table #360
FSDP2 + QLoRA multi-gpu test: pytest -s test/dtypes/test_nf4.py -k test_qlora: e2e fsdp2 + qlora test
NF4.clone test pytest -s test/dtypes/test_nf4.py -k test_tensor_copy: torchtune implemented NF4.clone, upstream it to TorchAO. This is needed by unit test copy.deepcopy(model)

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

pytorch-bot · 2024-06-19T03:02:04Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/403

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit a9f6cca with merge base 6b0ca2d ():

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

Run Regression Tests / test (CUDA Nightly, linux.g5.12xlarge.nvidia.gpu, --pre torch --index-url https://download.pytorc... / linux-job (gh) (trunk failure)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

weifengpy · 2024-06-19T22:06:08Z

torchao/dtypes/nf4tensor.py

@@ -11,10 +11,6 @@
 from torch import Tensor
 from torch.distributed.device_mesh import DeviceMesh
 from torch._prims_common import make_contiguous_strides_for
-from torchao.dtypes.utils import (


#360 consolidate _implements and _ATEN_OP_OR_TORCH_FN_TABLE but it breaks torchtune. revert for now to unblock torchtune quickly

Do you know how exactly this breaks torchtune? is it a versioning thing between saved models and this new model?

the error is TypeError: nf4_detach() missing 1 required positional argument: 'args'. So there is something incompatiable around _ATEN_OP_OR_TORCH_FN_TABLE[func](*args, **kwargs)

the errors shows up when people start training in TorchTune for the 1st time

@jerryzh168 any thoughts on why this is happening, otherwise are you okay to undo your changes?

@drisspg to add a bit more to what @weifengpy already said, the full stack trace is here.

I think there is something weird going on with the torch_function dispatch, where the args are becoming the arg name for aten.. If someone wants to track down how this should work and why this is I am for it but otherwhise I will approve to unblock

weifengpy · 2024-06-19T22:06:46Z

test/dtypes/test_nf4.py

+    def test_qlora_fsdp2(self):
+        from torch.distributed._composable.fsdp import CPUOffloadPolicy, OffloadPolicy
+
+        self.run_subtests(


e2e mutli-gpu FSDP + QLoRA test should be able to catch regression in the future

* add FSDP QLoRA test and revert failing PR Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * check pytorch version and cuda for ci Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * revert linter Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

Add description of commandline quantization vs quantization json recipe

add FSDP QLoRA test and revert failing PR

50cf76f

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 19, 2024

weifengpy marked this pull request as draft June 19, 2024 03:02

msaroufim requested review from jerryzh168 and drisspg June 19, 2024 03:08

weifengpy added 2 commits June 19, 2024 13:35

check pytorch version and cuda for ci

1ee3f40

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

revert linter

a9f6cca

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

weifengpy marked this pull request as ready for review June 19, 2024 22:02

weifengpy commented Jun 19, 2024

View reviewed changes

drisspg approved these changes Jun 21, 2024

View reviewed changes

weifengpy merged commit 2eb08be into pytorch:main Jun 21, 2024
12 of 13 checks passed

yanbing-j pushed a commit to yanbing-j/ao that referenced this pull request Dec 9, 2024

Update quantization.md (pytorch#403)

039e886

Add description of commandline quantization vs quantization json recipe

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add FSDP QLoRA test and revert failing PR #403

add FSDP QLoRA test and revert failing PR #403

weifengpy commented Jun 19, 2024 •

edited

Loading

pytorch-bot bot commented Jun 19, 2024 •

edited

Loading

weifengpy Jun 19, 2024

drisspg Jun 20, 2024

weifengpy Jun 20, 2024 •

edited

Loading

drisspg Jun 21, 2024

ebsmothers Jun 21, 2024

drisspg Jun 21, 2024

weifengpy Jun 19, 2024

add FSDP QLoRA test and revert failing PR #403

add FSDP QLoRA test and revert failing PR #403

Conversation

weifengpy commented Jun 19, 2024 • edited Loading

pytorch-bot bot commented Jun 19, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/403

✅ You can merge normally! (1 Unrelated Failure)

weifengpy Jun 19, 2024

Choose a reason for hiding this comment

drisspg Jun 20, 2024

Choose a reason for hiding this comment

weifengpy Jun 20, 2024 • edited Loading

Choose a reason for hiding this comment

drisspg Jun 21, 2024

Choose a reason for hiding this comment

ebsmothers Jun 21, 2024

Choose a reason for hiding this comment

drisspg Jun 21, 2024

Choose a reason for hiding this comment

weifengpy Jun 19, 2024

Choose a reason for hiding this comment

weifengpy commented Jun 19, 2024 •

edited

Loading

pytorch-bot bot commented Jun 19, 2024 •

edited

Loading

weifengpy Jun 20, 2024 •

edited

Loading