Clean up FP6-LLM #304

gau-nernst · 2024-06-03T14:11:37Z

Remove original FP6 quantization code (qtorch and C++ bit-packing)
Replace FP32<->FP6 dtype conversion with @vkuzo's implementation for MX dtypes
- I also migrate some of my FP32->FP6 rounding test cases to MX custom cast test.

pytorch-bot · 2024-06-03T14:11:39Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/304

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Rebase your PRs: Unstable CUDA signal in CI caused by cudnn 9 update

✅ No Failures

As of commit 6f8e7e9 with merge base 000a0fd ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

vkuzo · 2024-06-04T14:18:39Z

Replace FP32<->FP6 dtype conversion with @vkuzo's implementation for MX dtypes

I have two questions:

Just curious, have you benchmarked the performance of this change? I have not optimized the mx types for performance yet.
I think it's not ideal to depend on code in prototype folder, if other people need custom_cast.py it might be good to move it to a common place outside of the prototype folder.

gau-nernst · 2024-06-04T14:41:31Z

@vkuzo

I did benchmarking. IIRC, your implementation is faster than mine for both CPU and GPU (w/ torch.compile). Will benchmark it again and update the result in this comment. (I also did correctness comparison. Your implementation and mine are bit-identical for all FP16 bit patterns - can't test all FP32 bit patterns because it would take too long)
Yea I want to discuss with you about this also. It will be good to move it to a separate file for low bit-width floating point conversion. The FP6-LLM author added support for FP5 E2M2, so I want to support FP5 also (mx doesn't have FP5).

A question for @msaroufim. Is there a guideline when we should or should not decorate a function with @torch.compile? These functions rely on torch.compile to be fast, but there are cold-start (first time is slow) and dynamic shape (can we avoid re-compile on different shapes, in a guaranteed way?) problems. (haven't looked into the problems too much yet, maybe it is not that significant). (also saw @cpuhrsch comment on another PR about @torch.compile won't work on windows - perhaps we need a wrapper around torch.compile decorator? but windows is not officially supported for now I suppose)

Update: FP32->FP6_E3M2 (8192,8192) matrix (main branch) - benchmark with torch.utils.benchmark.Timer (so CPU is using only 1 thread) - CPU Ryzen 5600 and GPU 4070Ti SUPER

device	mode	op	time (ms)
CPU	eager	_to_float6_e3m2_pt	1702.95
CPU	eager	f32_to_f6_e3m2_unpacked	1604.09
CPU	compile	_to_float6_e3m2_pt	445.011
CPU	compile	f32_to_f6_e3m2_unpacked	214.897
CPU	C++	to_float6_e3m2_unpacked_cpu	360.433
CUDA	eager	_to_float6_e3m2_pt	13.4336
CUDA	eager	f32_to_f6_e3m2_unpacked	14.9207
CUDA	compile	_to_float6_e3m2_pt	0.578769
CUDA	compile	f32_to_f6_e3m2_unpacked	0.577399

CUDA is memory-bound so the implementation does not matter much (as long as it is correct). For CPU, your implementation is faster, especially with torch.compile (and faster than my C++ implementation). Though I found that CPU benchmark results tend to vary greatly across CPUs...

from functools import partial

import torch
import pandas as pd
from torch.utils.benchmark import Timer
from torchao.prototype.mx_formats.custom_cast import f32_to_f6_e3m2_unpacked
from torchao.dtypes.float6_e3m2 import _to_float6_e3m2_pt


def benchmark(f, *args):
    measurement = Timer(
        stmt="f(*args)",
        globals={"f": f, "args": args},
    ).blocked_autorange()
    return measurement.median * 1000


if __name__ == "__main__":
    M = 8192
    N = 8192
    fp32_weight = torch.randn(M, N)
    fp32_weight_cuda = fp32_weight.cuda()

    functions = [
        ("_to_float6_e3m2_pt", partial(_to_float6_e3m2_pt, no_bit_packing=True)),
        ("f32_to_f6_e3m2_unpacked", f32_to_f6_e3m2_unpacked),
    ]

    results = []
    for name, f in functions:
        results.append(["CPU", "eager", name, benchmark(f, fp32_weight)])
        results.append(["CUDA", "eager", name, benchmark(f, fp32_weight_cuda)])

        results.append(["CPU", "compile", name, benchmark(torch.compile(f), fp32_weight)])
        results.append(["CUDA", "compile", name, benchmark(torch.compile(f), fp32_weight_cuda)])

    df = pd.DataFrame(results, columns=["device", "mode", "op", "time (ms)"])
    df = df.sort_values(["device", "mode"], ascending=[True, False])
    print(df.to_markdown(index=False))

msaroufim · 2024-06-05T04:09:32Z

So for Windows the main issue is torch.compile() codegenerates triton kernels which hasn't prioritized Windows support. I think for inductor cpu backend this should be less of an issue, I suspect there might be an overly aggressive assert somewhere though.

I would say overall everything should be compilable, the cold start problems is indeed annoying and is actively being worked, there are some broader plans that have been shared though https://dev-discuss.pytorch.org/t/how-to-bring-compile-time-down-to-zero-our-plans-and-direction-may-14th-edition/2089

Regarding dynamic shapes the way I iterate through things is first eliminate graph breaks then recompilations, this has been my goto guide https://github.com/pytorch/pytorch/blob/main/docs/source/torch.compiler_troubleshooting.rst

Also just FYI we removed the requirement to have branches up to date before merge, there was a breaking change in PyTorch that was just reverted so please rebase your changes to get rid of CI flakes

msaroufim

that's a lot of deletions 🗡️

This reverts commit cd8f647.

* override load from state dict * fix prefix * migrate to mx primitive * remove unneeded code * comment out test * remove * add rounding test for f6_e3m2 * update tests * remove openmp flag * update benchmark script * test negative number * remove qtorch dep * fix type casting * add view * fix strange pytest behavior * only skip tests requiring PyTorch 2.4 * remove weight loading magic

* eval and GPTQ work Summary: fleshing out the eval code so it works reliably, adding ci, adding gptq. fixed defaults for eval/gptq so they generally working meaningfully without being specified. note, we need a better way to save/load gptq models since they take so long to quantize. I tried using .so but it doesn't seem to work reliably. also added eval and gptq to ci. Test Plan: python eval.py --checkpoint-path checkpoints/$MODEL_REPO/model.pth \ --device cuda --dtype bfloat16 python eval.py --checkpoint-path checkpoints/$MODEL_REPO/model.pth \ --dtype bfloat16 --device cuda \ --quant '{"linear:int4" : {"groupsize" : 32} }' \ --compile python eval.py --checkpoint-path checkpoints/$MODEL_REPO/model.pth \ --dtype bfloat16 --device cuda \ --quant '{"linear:int4" : {"groupsize" : 32} }' python eval.py --checkpoint-path checkpoints/$MODEL_REPO/model.pth \ --dtype bfloat16 --device cuda \ --quant '{"linear:int4-gptq" : {"groupsize" : 32} }' ...running... Reviewers: Subscribers: Tasks: Tags: * fix language in help doc Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * declare scales_and_zeros --------- Co-authored-by: HDCharles <[email protected]>

gau-nernst added 11 commits May 31, 2024 07:27

override load from state dict

7fabc8f

fix prefix

1c08568

migrate to mx primitive

d89e9da

remove unneeded code

6f84293

comment out test

571910b

Merge branch 'pytorch:main' into improve_fp6_llm_linear

73b8354

remove

4e2964a

add rounding test for f6_e3m2

adefee8

update tests

f8268f0

remove openmp flag

ebbff67

update benchmark script

25e4be7

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 3, 2024

gau-nernst added 6 commits June 3, 2024 22:14

test negative number

21dfc60

remove qtorch dep

64e24f7

fix type casting

6d6f5dd

add view

474ebc2

Merge branch 'main' into improve_fp6_llm_linear

e64fbac

Merge branch 'main' into improve_fp6_llm_linear

3231039

gau-nernst added 6 commits June 5, 2024 19:01

Merge branch 'pytorch:main' into improve_fp6_llm_linear

d7ec248

Merge branch 'pytorch:main' into improve_fp6_llm_linear

86562c2

Merge branch 'main' into improve_fp6_llm_linear

60c8e6a

fix strange pytest behavior

509217c

only skip tests requiring PyTorch 2.4

11dcba3

remove weight loading magic

6f8e7e9

gau-nernst marked this pull request as ready for review June 9, 2024 16:19

gau-nernst requested a review from msaroufim June 9, 2024 16:19

gau-nernst requested a review from vkuzo June 9, 2024 16:19

msaroufim approved these changes Jun 9, 2024

View reviewed changes

msaroufim merged commit cd8f647 into pytorch:main Jun 9, 2024
13 checks passed

msaroufim added a commit that referenced this pull request Jun 9, 2024

Revert "Clean up FP6-LLM (#304)"

b49713b

This reverts commit cd8f647.

msaroufim mentioned this pull request Jun 9, 2024

Revert "Clean up FP6-LLM" #338

Merged

gau-nernst mentioned this pull request Jun 10, 2024

FP6-LLM clean up (again) #339

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clean up FP6-LLM #304

Clean up FP6-LLM #304

gau-nernst commented Jun 3, 2024

pytorch-bot bot commented Jun 3, 2024 •

edited

Loading

vkuzo commented Jun 4, 2024

gau-nernst commented Jun 4, 2024 •

edited

Loading

msaroufim commented Jun 5, 2024

msaroufim left a comment

Clean up FP6-LLM #304

Clean up FP6-LLM #304

Conversation

gau-nernst commented Jun 3, 2024

pytorch-bot bot commented Jun 3, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/304

❗ 1 Active SEVs

✅ No Failures

vkuzo commented Jun 4, 2024

gau-nernst commented Jun 4, 2024 • edited Loading

msaroufim commented Jun 5, 2024

msaroufim left a comment

Choose a reason for hiding this comment

pytorch-bot bot commented Jun 3, 2024 •

edited

Loading

gau-nernst commented Jun 4, 2024 •

edited

Loading