Add FP16Act-FP6Weight Linear #223

gau-nernst · 2024-05-07T14:39:22Z

Closes #208

References:

Paper: https://arxiv.org/abs/2401.14112
Original implementation: https://github.com/usyd-fsalab/fp6_llm
DeepSpeed implementation: microsoft/DeepSpeed@ccfdb84

TODO:

Port FP6 weight pre-packing
Port FP16Act-FP6Weight Linear kernel
Port FP6 -> FP16 dequantization
Port FP16 -> FP6 quantization
Add correctness check (wrt FP16-FP16 matmul) - Note: compare against dequant FP16 weight, not original weight
Add speed benchmarks
Add split-k map from DeepSpeed: https://github.com/microsoft/DeepSpeed/blob/0b224edcf7d83713b95ad6b989694a8bdf01809e/deepspeed/inference/v2/kernels/core_ops/cuda_linear/cuda_linear.py (TODO in a future PR - autotune?)

benchmarks/benchmark_fp6.py results - 4070 Ti SUPER, PyTorch 2.3, CUDA 12.1

m	k	n	fp6_latency (ms)	fp16_latency (ms)	speedup (d/s)	correct
1	10240	8192	0.10459	0.272524	2.60565	1
1	8192	8192	0.0259702	0.218884	8.42828	1
1	57344	8192	0.590541	1.49265	2.5276	1
1	8192	28672	0.286375	0.773923	2.70248	1
2	10240	8192	0.105072	0.272984	2.59807	1
2	8192	8192	0.0283768	0.221068	7.79045	1
2	57344	8192	0.588764	1.50597	2.55785	1
2	8192	28672	0.289398	0.749859	2.5911	1
4	10240	8192	0.106225	0.273336	2.57318	1
4	8192	8192	0.0352987	0.221425	6.27289	1
4	57344	8192	0.590647	1.51019	2.55685	1
4	8192	28672	0.294598	0.752468	2.55422	1
8	10240	8192	0.108011	0.27399	2.53669	1
8	8192	8192	0.0501209	0.22264	4.44206	1
8	57344	8192	0.588333	1.51822	2.58054	1
8	8192	28672	0.303168	0.759707	2.50589	1
16	10240	8192	0.112524	0.298142	2.64959	1
16	8192	8192	0.0690163	0.222848	3.22893	1
16	57344	8192	0.624313	1.53145	2.45302	1
16	8192	28672	0.319848	0.762582	2.38421	1
64	10240	8192	0.175586	0.287482	1.63727	1
64	8192	8192	0.126386	0.248988	1.97006	1
64	57344	8192	0.879407	1.58489	1.80222	1
64	8192	28672	0.482216	0.794717	1.64805	1
128	10240	8192	0.295514	0.305409	1.03349	1
128	8192	8192	0.226178	0.243452	1.07638	1
128	57344	8192	1.48975	1.64592	1.10483	1
128	8192	28672	0.850137	0.897592	1.05582	1
256	10240	8192	0.579169	0.528245	0.912073	1
256	8192	8192	0.454445	0.394432	0.867942	1
256	57344	8192	2.93721	2.77987	0.946432	1
256	8192	28672	1.66438	1.41421	0.84969	1
512	10240	8192	1.10103	0.984327	0.894003	1
512	8192	8192	1.03181	0.770269	0.746521	1
512	57344	8192	5.86696	5.47663	0.933469	1
512	8192	28672	3.40694	2.76967	0.812948	1
1024	10240	8192	2.08286	1.95155	0.936956	1
1024	8192	8192	1.81341	1.5645	0.862739	1
1024	57344	8192	11.6214	10.8689	0.935248	1
1024	8192	28672	6.27002	5.5432	0.88408	1
2048	10240	8192	4.17314	3.91984	0.939303	1
2048	8192	8192	3.34931	3.15769	0.942786	1
2048	57344	8192	23.4409	21.4201	0.913792	1
2048	8192	28672	11.4675	10.7142	0.934307	1
4096	10240	8192	8.37251	7.69253	0.918785	1
4096	8192	8192	6.71261	6.15112	0.916353	1
4096	57344	8192	46.804	42.3869	0.905626	1
4096	8192	28672	23.3502	21.1444	0.905533	1

benchmarks/benchmark_fp6.py results - 4090. Courtesy to @Iron-Bound

m	k	n	fp6_latency (ms)	fp16_latency (ms)	speedup (d/s)	correct
1	10240	8192	0.0249717	0.177861	7.12252	1
1	8192	8192	0.0185284	0.142879	7.71136	1
1	57344	8192	0.374375	0.989105	2.64202	1
1	8192	28672	0.192924	0.492164	2.55107	1
2	10240	8192	0.0251179	0.178913	7.12294	1
2	8192	8192	0.0185992	0.149051	8.01384	1
2	57344	8192	0.375186	0.983217	2.62061	1
2	8192	28672	0.194807	0.508954	2.61261	1
4	10240	8192	0.0251999	0.179157	7.10943	1
4	8192	8192	0.0187983	0.149632	7.95988	1
4	57344	8192	0.376361	0.983849	2.61411	1
4	8192	28672	0.197904	0.510254	2.57829	1
8	10240	8192	0.0257653	0.179633	6.97189	1
8	8192	8192	0.0195129	0.149917	7.683	1
8	57344	8192	0.378805	0.984245	2.59829	1
8	8192	28672	0.202335	0.513004	2.53542	1
16	10240	8192	0.0363614	0.180923	4.97567	1
16	8192	8192	0.0264505	0.150544	5.69153	1
16	57344	8192	0.383548	0.985312	2.56894	1
16	8192	28672	0.212429	0.515403	2.42624	1
64	10240	8192	0.121198	0.208701	1.72198	1
64	8192	8192	0.0838365	0.17458	2.08239	1
64	57344	8192	0.469056	1.07661	2.29527	1
64	8192	28672	0.308592	0.561562	1.81975	1
128	10240	8192	0.183175	0.207593	1.13331	1
128	8192	8192	0.157488	0.170052	1.07978	1
128	57344	8192	1.03939	1.11073	1.06863	1
128	8192	28672	0.490236	0.56882	1.1603	1
256	10240	8192	0.302165	0.272471	0.901729	1
256	8192	8192	0.240262	0.226496	0.942705	1
256	57344	8192	1.55872	1.4771	0.947637	1
256	8192	28672	0.961353	0.819177	0.852109	1
512	10240	8192	0.578808	0.525709	0.908262	1
512	8192	8192	0.561836	0.420924	0.749193	1
512	57344	8192	3.11518	2.83765	0.910911	1
512	8192	28672	1.91902	1.53363	0.799173	1
1024	10240	8192	1.11179	1.04132	0.936611	1
1024	8192	8192	0.966707	0.835691	0.864472	1
1024	57344	8192	6.17944	5.68417	0.919852	1
1024	8192	28672	3.48025	2.82627	0.812087	1
2048	10240	8192	2.22177	2.08143	0.936833	1
2048	8192	8192	1.78033	1.68593	0.946973	1
2048	57344	8192	12.5079	11.5538	0.923727	1
2048	8192	28672	6.2242	5.6579	0.909016	1
4096	10240	8192	4.46022	4.02399	0.902194	1
4096	8192	8192	3.56057	3.22559	0.905919	1
4096	57344	8192	24.7088	22.9535	0.928962	1
4096	8192	28672	12.4933	11.5338	0.9232	1

pytorch-bot · 2024-05-07T22:09:33Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/223

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit a8b4dd3 with merge base ad12663 ():

NEW FAILURE - The following job has failed:

Run Regression Tests / test (CUDA Nightly, linux.g5.12xlarge.nvidia.gpu, --pre torch --index-url https://download.pytorc... / linux-job (gh)
test/integration/test_integration.py::TestAutoQuant::test_autoquant_one_input_29_cuda

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Iron-Bound · 2024-05-08T20:36:34Z

Hey, I was looking at the fp6 code as well,
got stuck on TORCH_LIBRARY_IMPL so good stuff solving that 👍🏼

I was going to bring up was the code has a few utility/kernel files which is be reusable for implementing the next quant type.
What do people think about have a folder for this now or later?

msaroufim · 2024-05-08T21:32:11Z

I'd opt for generalizing things in a future PR but will @gau-nernst decide what makes sense for them. @Iron-Bound which future work were you hoping to build on top?

Iron-Bound · 2024-05-08T21:45:26Z

@msaroufim Could hack on CFloat8_1_4_3 and CFloat8_1_5_2 if people think its valuable?

msaroufim · 2024-05-08T21:50:07Z

I haven't fllowed our float8 work closely but have you gotten the chance to take a look at https://github.com/pytorch-labs/float8_experimental

Granted I would like an API that looks like to(torch.float6/8) and that's one of the benefits of using tensor subclasses in this repo

gau-nernst · 2024-05-09T01:11:36Z

I was going to bring up was the code has a few utility/kernel files which is be reusable for implementing the next quant type. What do people think about have a folder for this now or later?

I will leave it for a future PR to refactor. I don't understand much of the parts that involved in the kernel, so I won't be touching them and leave them as is.

Regarding float dtype. The actual FP6 used in FP6_LLM is E3M2, without nan/inf. Two pointers

It will be good to signal E3M2 somehow in the code, since obviously FP6 is non-standard. Also, do we also need to signal whether nan/inf are represented?
FP6_LLM re-arrange the fp6 weight layout to optimize data access (see weight_matrix_prepacking()). This "non-standard" layout may make generalized float tensor subclass difficult, since how should the users know what is the underlying layout? (perhaps we need to keep track of them somehow?)
- the DeepSpeed kernel actual splits FP6 weight into 2 parts - 4-bit and 2-bit parts. https://github.com/microsoft/DeepSpeed/blob/0b224edcf7d83713b95ad6b989694a8bdf01809e/deepspeed/inference/v2/modules/implementations/linear/quantized_linear.py#L179

Also, another interesting thing to work on is to replicate qtorch.quant.float_quantize() from https://github.com/Tiiiger/QPyTorch.

msaroufim · 2024-05-10T03:28:02Z

fp6_test.py

@@ -0,0 +1,98 @@
+# from https://github.com/usyd-fsalab/fp6_llm/blob/main/tests/python/kernel_test.py


move the relevant files to either benchmark or test folder

msaroufim

Ok so I think to merge this what we can do

Move the relevant benchmark and test files to either the benchmark or test repo
In CI either do the numerics check on an op level or at a macro eval level (ideally the first for now)
We can worry about the subclass and torch.compile stuff in a future PR
Make sure the acknowledgements to original repos are crystal clear everywhere
Make the speedup clear in the PR description over the fp16/bf16 baselines

msaroufim · 2024-05-10T16:42:49Z

H100 benchmarks

msaroufim · 2024-05-13T14:02:37Z

Ok I think we're ready to merge this, last thing is add limitations in README for small batch sizes here usyd-fsalab/fp6_llm#8 and explain that this should be used to speed up autoregressive decoding

And for the next PR let's start to do evals with an end to end model, I'm hoping we can leverage this PR for that #189

* add files from fp6_llm * try to port weight packing first * rename * rename fp6 weight packing * add fp16act_fp6weight_linear * fix function def * delete duplicate file * move weight quant file * rename * add pytorch interface for fp6 weight dequant * add fake_fp6 to fp6 * move weight_quant to csrc/cuda due to cuda_fp16.h dependency * add fake_fp6_to_fp6 test * add test for fp16act_fp6weight_linear * add test for fp6_weight_dequant * Fp6WeightOnlyQuantizedLinearWeight (not working yet) * skip some tests, since the functions are not built w/o CUDA * add the original test * implement transpose and clone so that F.linear will work * remove print * remove dequantize * add notes and some rename * typo * small cleanup * improve tensor subclass and add test (which is failing for torch-compile) * add note * add note * add qtorch as dev requirement * update error message * add __repr__ and fix transposed issue * add fp6 perplexity test * rename variables * remove subclass * add correctness test * remove unwanted changes * add apache 2.0 notice * add benchmark script * add note about FP6 kernel * relax tolerance --------- Co-authored-by: Mark Saroufim <[email protected]>

gau-nernst added 7 commits May 7, 2024 01:06

add files from fp6_llm

c742a0e

Merge branch 'pytorch:main' into fp6

963bfa6

try to port weight packing first

4eb8be6

rename

8608664

rename fp6 weight packing

b7c7b28

add fp16act_fp6weight_linear

3c9aac7

fix function def

031379a

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 7, 2024

msaroufim self-requested a review May 7, 2024 17:18

Merge branch 'main' into fp6

75ca602

gau-nernst added 5 commits May 8, 2024 00:39

delete duplicate file

c436c43

move weight quant file

12823fe

rename

9180fef

add pytorch interface for fp6 weight dequant

1b24424

add fake_fp6 to fp6

2671c9c

msaroufim mentioned this pull request May 8, 2024

FloatQuantization subclass #228

Open

gau-nernst added 6 commits May 8, 2024 20:31

move weight_quant to csrc/cuda due to cuda_fp16.h dependency

e61be51

add fake_fp6_to_fp6 test

21acfd1

add test for fp16act_fp6weight_linear

67fd6f8

add test for fp6_weight_dequant

084b7e4

Fp6WeightOnlyQuantizedLinearWeight (not working yet)

6d2fc3e

skip some tests, since the functions are not built w/o CUDA

68f2415

Merge branch 'main' into fp6

0c78635

add the original test

5989599

gau-nernst added 7 commits May 9, 2024 21:27

add note

74b8094

add qtorch as dev requirement

c8d47c3

update error message

e08ba6a

add __repr__ and fix transposed issue

b090b4b

add fp6 perplexity test

f6f93c3

rename variables

b857645

Merge branch 'main' into fp6

e69e6db

msaroufim reviewed May 10, 2024

View reviewed changes

gau-nernst added 5 commits May 10, 2024 19:49

remove subclass

f0eba1a

add correctness test

8f1ef8d

remove unwanted changes

cb05d30

add apache 2.0 notice

56aefc6

add benchmark script

7d3a5b1

gau-nernst marked this pull request as ready for review May 10, 2024 16:11

gau-nernst mentioned this pull request May 11, 2024

Performance regression for large batch sizes usyd-fsalab/fp6_llm#8

Closed

gau-nernst added 2 commits May 14, 2024 01:10

add note about FP6 kernel

08a95ac

relax tolerance

a8b4dd3

msaroufim self-requested a review May 14, 2024 14:39

msaroufim approved these changes May 14, 2024

View reviewed changes

msaroufim merged commit 7734f79 into pytorch:main May 14, 2024
12 of 13 checks passed

gau-nernst deleted the fp6 branch May 14, 2024 16:08

gau-nernst mentioned this pull request May 21, 2024

FP6 dtype! #208

Open

gau-nernst mentioned this pull request Jun 18, 2024

Add support for building CUDA extension on Windows #396

Merged

tobiasvanderwerff mentioned this pull request Sep 25, 2024

SM75 (Turing) support for FP6 kernel #942

Merged

yanbing-j pushed a commit to yanbing-j/ao that referenced this pull request Dec 9, 2024

format (pytorch#223)

e0d94bb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add FP16Act-FP6Weight Linear #223

Add FP16Act-FP6Weight Linear #223

gau-nernst commented May 7, 2024 •

edited

Loading

pytorch-bot bot commented May 7, 2024 •

edited

Loading

Iron-Bound commented May 8, 2024

msaroufim commented May 8, 2024

Iron-Bound commented May 8, 2024 •

edited

Loading

msaroufim commented May 8, 2024

gau-nernst commented May 9, 2024

msaroufim May 10, 2024

msaroufim left a comment •

edited

Loading

msaroufim commented May 10, 2024

msaroufim commented May 13, 2024

		@@ -0,0 +1,98 @@
		# from https://github.com/usyd-fsalab/fp6_llm/blob/main/tests/python/kernel_test.py

Add FP16Act-FP6Weight Linear #223

Add FP16Act-FP6Weight Linear #223

Conversation

gau-nernst commented May 7, 2024 • edited Loading

pytorch-bot bot commented May 7, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/223

❌ 1 New Failure

Iron-Bound commented May 8, 2024

msaroufim commented May 8, 2024

Iron-Bound commented May 8, 2024 • edited Loading

msaroufim commented May 8, 2024

gau-nernst commented May 9, 2024

msaroufim May 10, 2024

Choose a reason for hiding this comment

msaroufim left a comment • edited Loading

Choose a reason for hiding this comment

msaroufim commented May 10, 2024

msaroufim commented May 13, 2024

gau-nernst commented May 7, 2024 •

edited

Loading

pytorch-bot bot commented May 7, 2024 •

edited

Loading

Iron-Bound commented May 8, 2024 •

edited

Loading

msaroufim left a comment •

edited

Loading