[small] use torch.int for autotuning cache #840

ngimel · 2022-11-03T21:31:14Z

For stupid reasons, ops on int8 are 3 times slower than on int, and for another set of stupid reasons we are not using cudaMemset for zero_, so using int8 buffer in do_bench makes it slow.

ngimel · 2022-11-04T00:38:14Z

Don't think perf test failures are related, 170us that cache clearing kernel takes (on A100, even longer on other devices) is still more than enough to cover any launch overheads, so reported time shouldn't change.

ptillet · 2022-11-04T00:46:50Z

Let me relaunch

ngimel · 2022-11-04T04:18:36Z

Locally some perf tests are failing for me both with and without this change

ptillet · 2022-11-04T04:19:39Z

Yeah that's weird. Sometimes the CI does that. Let me restart the CI server -- this has always solved these sort of flaky perf tests :)

ptillet · 2022-11-04T09:11:30Z

Since it failed even after a CI reboot while a re-run of the tests on the master passed (https://github.com/openai/triton/actions/runs/3392284659), I am guessing that this PR for some mysterious reasons at least modifies the baseline. Might be related to a discussion I had with @Jokeren about how flushing the L2 cache may affect the performance of matmuls. We haven't got to the bottom of it yet 😅

Jokeren · 2022-11-04T16:07:00Z

Got you. The issue is that flushing the L2 cache before running a matmul can promote the performance a bit.

ngimel · 2022-11-04T16:19:07Z

But this PR flushes L2 cache in the same way it used to (the size of cache is the same), just using a faster kernel to do so.

ptillet · 2022-11-04T17:09:34Z

Yeah, that's weird. In practice, I guess potentially anything that modifies do_bench could require us to recalibrate the baseline perf numbers in our regression tests. How important is this PR for torch-inductor? If it's not urgent, is it ok for me to keep this PR open and delay its merging until we have validated Triton-MLIR to have the same performance as the current baseline with the current do_bench -- out of an abundance of caution ?

ngimel · 2022-11-04T17:11:54Z

We'd like to have it in as it noticeably reduces our startup latency, but we can delay for couple weeks. Alternatively, what about fast_flush arg for do_bench?

ptillet · 2022-11-04T17:12:50Z

Yep, good idea! let's add an extra argument that preserves the old behavior by default

ngimel · 2022-11-04T19:56:09Z

Hm failure is really weird, now default behavior is strictly unchanged

ptillet · 2022-11-04T23:18:12Z

I think it's fine to change the baseline for "test_performance.py::test_elementwise[1048576] FAILED" to 0.52 instead of 0.53. Obviously an artifact of having extra control flow in the loop, but not a big deal since it's elementwise

ngimel · 2022-11-05T00:14:15Z

lol and now observed is 0.54 and thus failing tests, this is too delicate!

ngimel · 2022-11-05T00:26:29Z

btw there's no control flow in the loop, cache is created once before measurement loop.

ptillet · 2022-11-05T01:05:37Z

Thanks! And sorry for the friction :)

ngimel · 2022-11-05T01:09:55Z

Thanks!

For stupid reasons, ops on int8 are 3 times slower than on int, and for another set of stupid reasons we are not using cudaMemset for `zero_`, so using `int8` buffer in `do_bench` makes it slow. Co-authored-by: Philippe Tillet <[email protected]>

eellison · 2023-07-14T00:57:31Z

For stupid reasons, ops on int8 are 3 times slower than on int

Is this because of pytorch casting from int8 -> int32 -> int8 ? I would have thought zero_ would be bandwidth bound anyway.

For stupid reasons, ops on int8 are 3 times slower than on int, and for another set of stupid reasons we are not using cudaMemset for `zero_`, so using `int8` buffer in `do_bench` makes it slow. Co-authored-by: Philippe Tillet <[email protected]>

The parameter was introduced in triton-lang#840, and it looks like it exists mainly to ease migration. In general there's no reason to use fast_flush=False, so let's remove it.

The parameter was introduced in #840, and it looks like it exists mainly to ease migration. In general there's no reason to use fast_flush=False, so let's remove it. --------- Co-authored-by: Keren Zhou <[email protected]>

…ang#4485) The parameter was introduced in triton-lang#840, and it looks like it exists mainly to ease migration. In general there's no reason to use fast_flush=False, so let's remove it. --------- Co-authored-by: Keren Zhou <[email protected]>

The parameter was introduced in triton-lang/triton#840, and it looks like it exists mainly to ease migration. In general there's no reason to use fast_flush=False, so let's remove it. --------- Co-authored-by: Keren Zhou <[email protected]>

Natalia Gimelshein added 2 commits November 3, 2022 21:28

use torch.int for autotuning cache

5397dab

lint

7ec0301

ngimel force-pushed the autotune_cache branch from b41c93d to 7ec0301 Compare November 3, 2022 22:34

optional fast_flush parameter

2a7dc22

ngimel force-pushed the autotune_cache branch from a0802c2 to 2a7dc22 Compare November 4, 2022 19:18

change elementwise perf baseline

d1180ea

ngimel force-pushed the autotune_cache branch from b7f7371 to d1180ea Compare November 4, 2022 23:31

change elementwise perf baseline

33e31d7

ngimel force-pushed the autotune_cache branch from 420c037 to 33e31d7 Compare November 5, 2022 00:14

Merge branch 'master' into autotune_cache

00d906d

ptillet merged commit 0d7e753 into triton-lang:master Nov 5, 2022

int3 mentioned this pull request Aug 6, 2024

[FRONTEND] Use PyTorch device_interface in do_bench #4470

Merged

7 tasks

int3 mentioned this pull request Aug 7, 2024

[TESTING] Remove the fast_flush parameter from do_bench #4485

Merged

sazczmh mentioned this pull request Feb 24, 2025

tests: Triton 3.2.0 had remove the fast_flush parameter from do_bench deepseek-ai/FlashMLA#12

Merged

[small] use torch.int for autotuning cache #840

[small] use torch.int for autotuning cache #840

Uh oh!

Conversation

ngimel commented Nov 3, 2022

Uh oh!

ngimel commented Nov 4, 2022

Uh oh!

ptillet commented Nov 4, 2022

Uh oh!

ngimel commented Nov 4, 2022

Uh oh!

ptillet commented Nov 4, 2022

Uh oh!

ptillet commented Nov 4, 2022

Uh oh!

Jokeren commented Nov 4, 2022

Uh oh!

ngimel commented Nov 4, 2022

Uh oh!

ptillet commented Nov 4, 2022

Uh oh!

ngimel commented Nov 4, 2022

Uh oh!

ptillet commented Nov 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngimel commented Nov 4, 2022

Uh oh!

ptillet commented Nov 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngimel commented Nov 5, 2022

Uh oh!

ngimel commented Nov 5, 2022

Uh oh!

ptillet commented Nov 5, 2022

Uh oh!

ngimel commented Nov 5, 2022

Uh oh!

eellison commented Jul 14, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ptillet commented Nov 4, 2022 •

edited

Loading

ptillet commented Nov 4, 2022 •

edited

Loading