Skip to content

Conversation

@ngimel
Copy link

@ngimel ngimel commented Nov 3, 2022

For stupid reasons, ops on int8 are 3 times slower than on int, and for another set of stupid reasons we are not using cudaMemset for zero_, so using int8 buffer in do_bench makes it slow.

@ngimel
Copy link
Author

ngimel commented Nov 4, 2022

Don't think perf test failures are related, 170us that cache clearing kernel takes (on A100, even longer on other devices) is still more than enough to cover any launch overheads, so reported time shouldn't change.

@ptillet
Copy link
Collaborator

ptillet commented Nov 4, 2022

Let me relaunch

@ngimel
Copy link
Author

ngimel commented Nov 4, 2022

Locally some perf tests are failing for me both with and without this change

@ptillet
Copy link
Collaborator

ptillet commented Nov 4, 2022

Yeah that's weird. Sometimes the CI does that. Let me restart the CI server -- this has always solved these sort of flaky perf tests :)

@ptillet
Copy link
Collaborator

ptillet commented Nov 4, 2022

Since it failed even after a CI reboot while a re-run of the tests on the master passed (https://github.com/openai/triton/actions/runs/3392284659), I am guessing that this PR for some mysterious reasons at least modifies the baseline. Might be related to a discussion I had with @Jokeren about how flushing the L2 cache may affect the performance of matmuls. We haven't got to the bottom of it yet 😅

@Jokeren
Copy link
Contributor

Jokeren commented Nov 4, 2022

Got you. The issue is that flushing the L2 cache before running a matmul can promote the performance a bit.

@ngimel
Copy link
Author

ngimel commented Nov 4, 2022

But this PR flushes L2 cache in the same way it used to (the size of cache is the same), just using a faster kernel to do so.

@ptillet
Copy link
Collaborator

ptillet commented Nov 4, 2022

Yeah, that's weird. In practice, I guess potentially anything that modifies do_bench could require us to recalibrate the baseline perf numbers in our regression tests. How important is this PR for torch-inductor? If it's not urgent, is it ok for me to keep this PR open and delay its merging until we have validated Triton-MLIR to have the same performance as the current baseline with the current do_bench -- out of an abundance of caution ?

@ngimel
Copy link
Author

ngimel commented Nov 4, 2022

We'd like to have it in as it noticeably reduces our startup latency, but we can delay for couple weeks. Alternatively, what about fast_flush arg for do_bench?

@ptillet
Copy link
Collaborator

ptillet commented Nov 4, 2022

Yep, good idea! let's add an extra argument that preserves the old behavior by default

@ngimel
Copy link
Author

ngimel commented Nov 4, 2022

Hm failure is really weird, now default behavior is strictly unchanged

@ptillet
Copy link
Collaborator

ptillet commented Nov 4, 2022

I think it's fine to change the baseline for "test_performance.py::test_elementwise[1048576] FAILED" to 0.52 instead of 0.53. Obviously an artifact of having extra control flow in the loop, but not a big deal since it's elementwise

@ngimel
Copy link
Author

ngimel commented Nov 5, 2022

lol and now observed is 0.54 and thus failing tests, this is too delicate!

@ngimel
Copy link
Author

ngimel commented Nov 5, 2022

btw there's no control flow in the loop, cache is created once before measurement loop.

@ptillet ptillet merged commit 0d7e753 into triton-lang:master Nov 5, 2022
@ptillet
Copy link
Collaborator

ptillet commented Nov 5, 2022

Thanks! And sorry for the friction :)

@ngimel
Copy link
Author

ngimel commented Nov 5, 2022

Thanks!

ptillet added a commit that referenced this pull request Dec 20, 2022
For stupid reasons, ops on int8 are 3 times slower than on int, and for
another set of stupid reasons we are not using cudaMemset for `zero_`,
so using `int8` buffer in `do_bench` makes it slow.

Co-authored-by: Philippe Tillet <[email protected]>
@eellison
Copy link

For stupid reasons, ops on int8 are 3 times slower than on int

Is this because of pytorch casting from int8 -> int32 -> int8 ? I would have thought zero_ would be bandwidth bound anyway.

pingzhuu pushed a commit to siliconflow/triton that referenced this pull request Apr 2, 2024
For stupid reasons, ops on int8 are 3 times slower than on int, and for
another set of stupid reasons we are not using cudaMemset for `zero_`,
so using `int8` buffer in `do_bench` makes it slow.

Co-authored-by: Philippe Tillet <[email protected]>
int3 added a commit to int3/triton-cpu that referenced this pull request Aug 7, 2024
The parameter was introduced in
triton-lang#840, and it looks like it
exists mainly to ease migration. In general there's no reason to use
fast_flush=False, so let's remove it.
Jokeren added a commit that referenced this pull request Oct 7, 2024
The parameter was introduced in
#840, and it looks like it
exists mainly to ease migration. In general there's no reason to use
fast_flush=False, so let's remove it.

---------

Co-authored-by: Keren Zhou <[email protected]>
sfzhu93 pushed a commit to sfzhu93/triton that referenced this pull request Oct 11, 2024
…ang#4485)

The parameter was introduced in
triton-lang#840, and it looks like it
exists mainly to ease migration. In general there's no reason to use
fast_flush=False, so let's remove it.

---------

Co-authored-by: Keren Zhou <[email protected]>
Luosuu pushed a commit to Luosuu/triton that referenced this pull request Nov 13, 2024
…ang#4485)

The parameter was introduced in
triton-lang#840, and it looks like it
exists mainly to ease migration. In general there's no reason to use
fast_flush=False, so let's remove it.

---------

Co-authored-by: Keren Zhou <[email protected]>
guacamoleo pushed a commit to guacamoleo/triton that referenced this pull request Nov 14, 2024
…ang#4485)

The parameter was introduced in
triton-lang#840, and it looks like it
exists mainly to ease migration. In general there's no reason to use
fast_flush=False, so let's remove it.

---------

Co-authored-by: Keren Zhou <[email protected]>
bertmaher pushed a commit to bertmaher/triton that referenced this pull request Dec 10, 2024
…ang#4485)

The parameter was introduced in
triton-lang#840, and it looks like it
exists mainly to ease migration. In general there's no reason to use
fast_flush=False, so let's remove it.

---------

Co-authored-by: Keren Zhou <[email protected]>
liuyunqi20 pushed a commit to flagos-ai/flagtree that referenced this pull request Oct 21, 2025
The parameter was introduced in
triton-lang/triton#840, and it looks like it
exists mainly to ease migration. In general there's no reason to use
fast_flush=False, so let's remove it.

---------

Co-authored-by: Keren Zhou <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants