-
Notifications
You must be signed in to change notification settings - Fork 2.4k
[small] use torch.int for autotuning cache #840
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
b41c93d to
7ec0301
Compare
|
Don't think perf test failures are related, 170us that cache clearing kernel takes (on A100, even longer on other devices) is still more than enough to cover any launch overheads, so reported time shouldn't change. |
|
Let me relaunch |
|
Locally some perf tests are failing for me both with and without this change |
|
Yeah that's weird. Sometimes the CI does that. Let me restart the CI server -- this has always solved these sort of flaky perf tests :) |
|
Since it failed even after a CI reboot while a re-run of the tests on the |
|
Got you. The issue is that flushing the L2 cache before running a matmul can promote the performance a bit. |
|
But this PR flushes L2 cache in the same way it used to (the size of cache is the same), just using a faster kernel to do so. |
|
Yeah, that's weird. In practice, I guess potentially anything that modifies |
|
We'd like to have it in as it noticeably reduces our startup latency, but we can delay for couple weeks. Alternatively, what about |
|
Yep, good idea! let's add an extra argument that preserves the old behavior by default |
a0802c2 to
2a7dc22
Compare
|
Hm failure is really weird, now default behavior is strictly unchanged |
|
I think it's fine to change the baseline for "test_performance.py::test_elementwise[1048576] FAILED" to 0.52 instead of 0.53. Obviously an artifact of having extra control flow in the loop, but not a big deal since it's elementwise |
b7f7371 to
d1180ea
Compare
|
lol and now observed is 0.54 and thus failing tests, this is too delicate! |
420c037 to
33e31d7
Compare
|
btw there's no control flow in the loop, |
|
Thanks! And sorry for the friction :) |
|
Thanks! |
For stupid reasons, ops on int8 are 3 times slower than on int, and for another set of stupid reasons we are not using cudaMemset for `zero_`, so using `int8` buffer in `do_bench` makes it slow. Co-authored-by: Philippe Tillet <[email protected]>
Is this because of pytorch casting from int8 -> int32 -> int8 ? I would have thought |
For stupid reasons, ops on int8 are 3 times slower than on int, and for another set of stupid reasons we are not using cudaMemset for `zero_`, so using `int8` buffer in `do_bench` makes it slow. Co-authored-by: Philippe Tillet <[email protected]>
The parameter was introduced in triton-lang#840, and it looks like it exists mainly to ease migration. In general there's no reason to use fast_flush=False, so let's remove it.
The parameter was introduced in #840, and it looks like it exists mainly to ease migration. In general there's no reason to use fast_flush=False, so let's remove it. --------- Co-authored-by: Keren Zhou <[email protected]>
…ang#4485) The parameter was introduced in triton-lang#840, and it looks like it exists mainly to ease migration. In general there's no reason to use fast_flush=False, so let's remove it. --------- Co-authored-by: Keren Zhou <[email protected]>
…ang#4485) The parameter was introduced in triton-lang#840, and it looks like it exists mainly to ease migration. In general there's no reason to use fast_flush=False, so let's remove it. --------- Co-authored-by: Keren Zhou <[email protected]>
…ang#4485) The parameter was introduced in triton-lang#840, and it looks like it exists mainly to ease migration. In general there's no reason to use fast_flush=False, so let's remove it. --------- Co-authored-by: Keren Zhou <[email protected]>
…ang#4485) The parameter was introduced in triton-lang#840, and it looks like it exists mainly to ease migration. In general there's no reason to use fast_flush=False, so let's remove it. --------- Co-authored-by: Keren Zhou <[email protected]>
The parameter was introduced in triton-lang/triton#840, and it looks like it exists mainly to ease migration. In general there's no reason to use fast_flush=False, so let's remove it. --------- Co-authored-by: Keren Zhou <[email protected]>
For stupid reasons, ops on int8 are 3 times slower than on int, and for another set of stupid reasons we are not using cudaMemset for
zero_, so usingint8buffer indo_benchmakes it slow.