Autotuner for int mm Triton kernels #41

cpuhrsch · 2024-03-03T18:44:28Z

[ 8:58PM (nightly20240311py310) /scratch/cpuhrsch/dev/ao/benchmarks - git:intmmbenchmarks1]$ TORCHAO_AUTOTUNER_ENABLE=0 python intmm.py sam_shapes.csv
fn,m,k,n,fp_time,int_mm_time,ratio
<function run_int_mm_benchmark at 0x7f42252f29e0>,32768,3072,768,0.7926271820068359,3.6111566162109376,0.219493992159917
<function run_int_mm_benchmark at 0x7f42252f29e0>,32768,768,2304,0.4677939224243164,2.7787774658203124,0.16834522669710103
<function run_int_mm_benchmark at 0x7f42252f29e0>,32768,768,3072,0.6118912124633789,3.285841979980469,0.1862205231387956
<function run_int_mm_benchmark at 0x7f42252f29e0>,32768,768,768,0.17190912246704101,1.7368576049804687,0.09897709632274326
<function run_int_mm_benchmark at 0x7f42252f29e0>,39200,768,2304,0.579788818359375,3.1330712890625,0.1850544609002062
<function run_int_mm_benchmark at 0x7f42252f29e0>,39200,768,768,0.20057088851928712,1.8391961669921875,0.10905355944020902
<function run_int_scaled_mm_benchmark at 0x7f430a506b00>,32768,3072,768,0.7390310668945312,3.5631512451171874,0.20740940141322164
<function run_int_scaled_mm_benchmark at 0x7f430a506b00>,32768,768,2304,0.8094003295898438,3.315210266113281,0.24414750939426128
<function run_int_scaled_mm_benchmark at 0x7f430a506b00>,32768,768,3072,1.0603622436523437,4.030166931152344,0.2631062836271039
<function run_int_scaled_mm_benchmark at 0x7f430a506b00>,32768,768,768,0.28380159378051756,1.9123712158203126,0.14840298339189378
<function run_int_scaled_mm_benchmark at 0x7f430a506b00>,39200,768,2304,0.9600614166259765,3.763394470214844,0.2551051781109645
<function run_int_scaled_mm_benchmark at 0x7f430a506b00>,39200,768,768,0.3342233657836914,2.057533416748047,0.16243885181312626
[ 8:58PM (nightly20240311py310) /scratch/cpuhrsch/dev/ao/benchmarks - git:intmmbenchmarks1]$ TORCHAO_AUTOTUNER_ENABLE=1 python intmm.py sam_shapes.csv
fn,m,k,n,fp_time,int_mm_time,ratio
INFO:root:Trying to load configs for NVIDIA A100-SXM4-40GB from /scratch/cpuhrsch/dev/ao/torchao/kernel/configs/data_a100.pkl
INFO:root:Loading best configs from file /scratch/cpuhrsch/dev/ao/torchao/kernel/configs/data_a100.pkl
<function run_int_mm_benchmark at 0x7fbfe839a9e0>,32768,3072,768,0.7970508575439453,0.39620609283447267,2.0117077247394524
<function run_int_mm_benchmark at 0x7fbfe839a9e0>,32768,768,2304,0.5419622421264648,0.41020416259765624,1.32120122500571
<function run_int_mm_benchmark at 0x7fbfe839a9e0>,32768,768,3072,0.7071027374267578,0.5336064147949219,1.3251391246833395
<function run_int_mm_benchmark at 0x7fbfe839a9e0>,32768,768,768,0.18099199295043944,0.13851648330688476,1.3066458852369918
<function run_int_mm_benchmark at 0x7fbfe839a9e0>,39200,768,2304,0.5925785446166992,0.45351936340332033,1.306622368160523
<function run_int_mm_benchmark at 0x7fbfe839a9e0>,39200,768,768,0.2048409652709961,0.1618739128112793,1.26543531143162
<function run_int_scaled_mm_benchmark at 0x7fc0cd6bab00>,32768,3072,768,0.7551487731933594,0.36495361328125,2.069163712078135
<function run_int_scaled_mm_benchmark at 0x7fc0cd6bab00>,32768,768,2304,0.8089702606201172,0.4003123092651367,2.0208478278001594
<function run_int_scaled_mm_benchmark at 0x7fc0cd6bab00>,32768,768,3072,1.0633113861083985,0.530063362121582,2.006008077699405
<function run_int_scaled_mm_benchmark at 0x7fc0cd6bab00>,32768,768,768,0.28652544021606446,0.17480703353881835,1.6390956039674311
<function run_int_scaled_mm_benchmark at 0x7fc0cd6bab00>,39200,768,2304,0.9518489837646484,0.4832358551025391,1.9697399804132356
<function run_int_scaled_mm_benchmark at 0x7fc0cd6bab00>,39200,768,768,0.33665023803710936,0.17441791534423828,1.9301356593597783

facebook-github-bot · 2024-03-14T20:53:18Z

@cpuhrsch has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

msaroufim · 2024-03-14T20:58:35Z

About to head into a meeting but will give this a proper read, any chance we could add a test?

facebook-github-bot · 2024-03-19T01:32:00Z

@cpuhrsch has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2024-03-19T01:40:03Z

@cpuhrsch has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

msaroufim

some nits

msaroufim · 2024-03-19T02:14:14Z

torchao/kernel/README.md

+
+Set this to a nonzero value to enable the kernels generated by the autotuner. This is turned off by default, because it is still an experimental feature and also can take a long time to run.
+
+Searching a new config can take a long time and we'll save the updated data in `data.pkl`. If you'd like to contributed updated configs for your hardware or shapes, please open a pull request.


presumably people won't contribute the pickle file since that's not human readable? Also kind of a security issues for us to host pickle files

I added https://github.com/pytorch-labs/ao/pull/41/files#diff-4986e1d3257adc0a73b17fd6f21ef9d3b2c0eaec9027381e6f2de89e5be0e6b5 to make it easier to inspect. It stores the triton Configs, so it's a bit more difficult to make them human readable by default.

msaroufim · 2024-03-19T02:15:06Z

benchmarks/intmm.py

+
+
+
+def benchmark_in_ms(warmup, iters, f, *args, **kwargs):


put this in benchmark util instead?

Once I add the next benchmark for weight only

benchmarks/intmm_shapes.csv

msaroufim · 2024-03-19T02:16:01Z

benchmarks/sam_shapes.csv

@@ -0,0 +1,7 @@
+m,k,n


presumably you mean shapes of matmuls in sam?

Yes, SAM vit_b batch size 16 to be precise

msaroufim · 2024-03-19T02:16:23Z

test/kernel/test_autotuner.py

+            [
+                ("cuda", torch.bfloat16),
+                ("cuda", torch.bfloat16),
+                # ("cpu", torch.bfloat16),


nit: remove comments

I'll turn those into TODOs. It should also work on CPU.

msaroufim · 2024-03-19T02:17:21Z

torchao/kernel/README.md

@@ -0,0 +1,19 @@
+## Autotuner and custom Triton kernels
+
+### Use case


Is intent to fill this out later? Might be better to open an issue if you'd like to do this later

msaroufim · 2024-03-19T02:18:17Z

torchao/kernel/autotuner.py

+
+    :param fn: Function to benchmark
+    :type fn: Callable
+    :param warmup: Warmup time (in ms)


warmup is not a time it seems to be an int so not in ms

Why can't milliseconds be given in int?

I think this is just saying "run warmup until at least 25ms were spent"

msaroufim · 2024-03-19T02:18:47Z

torchao/kernel/autotuner.py

+
+BEST_CONFIGS = None
+
+AUTOTUNER_DATA_PATH = os.getenv('TORCHAO_AUTOTUNER_DATA_PATH', None)


put all global variables at the top of the file so they're easier to find

msaroufim · 2024-03-19T02:22:21Z

torchao/kernel/intmm_triton.py

+                          int8_powers_of_two,
+                          int8_powers_of_two)], [])
+
+# int8_mm_kernel_configs = [


I wanted to leave these as reference from core. I'll add a comment.

msaroufim · 2024-03-19T02:22:54Z

torchao/kernel/intmm_triton.py

+import triton.language as tl
+import itertools
+import os
+int8_powers_of_two = [32, 64, 128, 256]


do you envision people wanting to add more options here and for int8 kernel configs?

Eventually, yes. Follow up here includes making it more extensible for other kernels. Adding support for mixed precision should help that.

msaroufim

Unblocking for now

msaroufim

Unblocking for now

facebook-github-bot · 2024-03-19T20:14:45Z

@cpuhrsch has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2024-03-19T21:11:00Z

@cpuhrsch has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

cpuhrsch added 11 commits February 28, 2024 16:21

int_mm benchmarks

ba24af7

Baseline results

1f5a811

Running shapes over night

b8bcf27

Rerun result 1 with empty cache

44c97a6

Filter results_2

579fc5e

Triton matmul

8251780

Store best configs

434b92f

Faster autotuner

e3dcf6d

Inject intmm triton

cacef6b

Scaled int mm

0f2a706

More scaled matmul

395b93c

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 3, 2024

cpuhrsch added 13 commits March 4, 2024 16:36

evict_last

f2bf0fd

Only 1 scales and better Triton code

ea635ed

a100 specific configs based on all configs

ffd0d66

data pkl based on regular configs

75ef774

Enable all configs

404814f

Move configs to torchao

145a498

Read configs from library package

016a839

More environment variables

36bfe59

Benchmark for sam shapes

62bc110

More A100 configs

8ce5707

Revert quant primitives

2797cf3

Merge remote-tracking branch 'origin/main' into intmmbenchmarks1

73f6671

Make autotuner work with compile

20333cf

cpuhrsch marked this pull request as ready for review March 14, 2024 20:47

Merge branch 'main' of github.com:pytorch-labs/ao into intmmbenchmarks1

3f6ddb4

cpuhrsch requested a review from HDCharles March 14, 2024 20:53

cpuhrsch requested a review from msaroufim March 14, 2024 20:55

Make benchmark output a bit more clear

c846b70

cpuhrsch added 5 commits March 14, 2024 20:59

Make benchmark output a bit more clear

8f4d6cc

Merge branch 'main' of github.com:pytorch-labs/ao into intmmbenchmarks1

54976d4

Basic test harness

e590bce

Basic test harness

b376dae

scaled int mm tests

171c491

dev requirements

5437476

drisspg approved these changes Mar 19, 2024

View reviewed changes

msaroufim requested changes Mar 19, 2024

View reviewed changes

msaroufim self-requested a review March 19, 2024 20:00

msaroufim reviewed Mar 19, 2024

View reviewed changes

msaroufim approved these changes Mar 19, 2024

View reviewed changes

cpuhrsch added 2 commits March 19, 2024 20:14

Address comments

ef26555

Merge branch 'main' of github.com:pytorch-labs/ao into intmmbenchmarks1

f4e6b7f

cpuhrsch added 2 commits March 19, 2024 20:55

lintrunner.toml

bc6d23f

Much lint, so wow

9c45710

Merge branch 'main' of github.com:pytorch-labs/ao into intmmbenchmarks1

b4ddebc

cpuhrsch mentioned this pull request Mar 20, 2024

[DUPLICATE] Autotuner for int mm Triton kernels #69

Closed

Merge branch 'main' of github.com:pytorch-labs/ao into intmmbenchmarks1

771fadd

cpuhrsch merged commit efb6514 into main Mar 21, 2024
2 checks passed

dbyoung18 pushed a commit to dbyoung18/ao that referenced this pull request Jul 31, 2024

Autotuner for int mm Triton kernels (pytorch#41)

530f71b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Autotuner for int mm Triton kernels #41

Autotuner for int mm Triton kernels #41

cpuhrsch commented Mar 3, 2024 •

edited

Loading

facebook-github-bot commented Mar 14, 2024

msaroufim commented Mar 14, 2024

facebook-github-bot commented Mar 19, 2024

facebook-github-bot commented Mar 19, 2024

msaroufim left a comment •

edited

Loading

msaroufim Mar 19, 2024

cpuhrsch Mar 19, 2024

msaroufim Mar 19, 2024

cpuhrsch Mar 19, 2024

msaroufim Mar 19, 2024

cpuhrsch Mar 19, 2024

msaroufim Mar 19, 2024

cpuhrsch Mar 19, 2024

msaroufim Mar 19, 2024

msaroufim Mar 19, 2024

cpuhrsch Mar 19, 2024

cpuhrsch Mar 19, 2024

msaroufim Mar 19, 2024

cpuhrsch Mar 19, 2024

msaroufim Mar 19, 2024

cpuhrsch Mar 19, 2024

msaroufim Mar 19, 2024

cpuhrsch Mar 19, 2024

msaroufim left a comment

msaroufim left a comment

facebook-github-bot commented Mar 19, 2024

facebook-github-bot commented Mar 19, 2024


		Set this to a nonzero value to enable the kernels generated by the autotuner. This is turned off by default, because it is still an experimental feature and also can take a long time to run.

		Searching a new config can take a long time and we'll save the updated data in `data.pkl`. If you'd like to contributed updated configs for your hardware or shapes, please open a pull request.

		@@ -0,0 +1,19 @@
		## Autotuner and custom Triton kernels

		### Use case


		BEST_CONFIGS = None

		AUTOTUNER_DATA_PATH = os.getenv('TORCHAO_AUTOTUNER_DATA_PATH', None)

Autotuner for int mm Triton kernels #41

Autotuner for int mm Triton kernels #41

Conversation

cpuhrsch commented Mar 3, 2024 • edited Loading

facebook-github-bot commented Mar 14, 2024

msaroufim commented Mar 14, 2024

facebook-github-bot commented Mar 19, 2024

facebook-github-bot commented Mar 19, 2024

msaroufim left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

msaroufim left a comment

Choose a reason for hiding this comment

msaroufim left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Mar 19, 2024

facebook-github-bot commented Mar 19, 2024

cpuhrsch commented Mar 3, 2024 •

edited

Loading

msaroufim left a comment •

edited

Loading