Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable benchmarking dispatches with dynamic shapes #19518

Open
kuhar opened this issue Dec 18, 2024 · 1 comment
Open

Enable benchmarking dispatches with dynamic shapes #19518

kuhar opened this issue Dec 18, 2024 · 1 comment
Assignees
Labels
enhancement ➕ New feature or request performance ⚡ Performance/optimization related work across the compiler and runtime tuner

Comments

@kuhar
Copy link
Member

kuhar commented Dec 18, 2024

Request description

For dispatches with static shapes, --iree-hal-dump-exectuable-benchmark-to generates MLIR files that can be compiled and benchmarked in isolation. We should enable something similar for dispatches with dynamic shape, to be used in both manual perf work and with the tuner.

What component(s) does this issue relate to?

No response

Additional context

No response

@kuhar kuhar added enhancement ➕ New feature or request performance ⚡ Performance/optimization related work across the compiler and runtime tuner labels Dec 18, 2024
@kuhar kuhar self-assigned this Dec 18, 2024
@kuhar
Copy link
Member Author

kuhar commented Dec 18, 2024

Conversation log with @benvanik:

Ben Vanik — Today at 1:46 PM
there's a few options - I think the issue is that the current executable benchmark pass needs to die and be reworked - if rebuilt, I'd do it totally different
in nearly all programs without data dependent shapes (which no real models we have feature) we can just tree shake to leave the shape math
I do that on the tflite bindings to create their shape calculation function already, works well there

Jakub Kuderski — Today at 1:47 PM
E.g. get those from something like util.assume.int?

Ben Vanik — Today at 1:47 PM
that could be useful, but just from the program inputs

Jakub Kuderski — Today at 1:48 PM
ooh, I see what you mean

Ben Vanik — Today at 1:48 PM
if you pass tensor<42x1024xf32> as an input and the 42 is used to calculate all the shapes in the program, we can just leave that math
it's a different flow closer to "how fast does this run for the original problem size" than "how fast does this particular executable run for arbitrary sizes"
so may need others as well, but since most of our perf analysis starts with "here's my model and its input sizes" and then we slice out the executables to microbenchmark it'd probably make it easier
but yeah, generating spreads from the assume ops would be cool too - the pass could add those calculations for the benchmarking tool to use - my eventual goal was to have benchmarks driven by a custom module so we could have the compiler spit them out - but to start stripping everything but the dependent shape calculation math is easiest :P
the tflite WrapEntryPoints createShapeCalculationFunc just calls the original function, adds tensor.dim ops for the results, ignores the result values, and lets DCE/folding/etc strip everything - pretty simple but it works :P
similarly we can have the compiler create per-benchmark-function query functions that return a list of shapes to try, and the benchmark tool can call that to setup the benchmark parameters

Ben Vanik — Today at 1:57 PM
as for instrumentation from running user programs we do something similar and just need a new set of trace point ops - the instrumentation pass creates a global that it accumulates information into and then the tools call the magic __query_instruments function that the pass inserts to get that information - the only instruments we have now are for the HAL but it's meant to be able to get anything we want back (it's a set of binary blobs that we determine the format of)
each instrument blob has some metadata produced by the compiler that gets embedded (e.g. iree/schemas/instruments/dispatch_def.fbs that describes each dispatch site) and then the the compiler inserts code to produce the binary blob (e.g. iree/schemas/instruments/dispatch.h)
lots of fun things we can do with that :)
(it was all built to enable PGO - which capturing dispatch shapes effectively is - so it'd be neat to finally connect it all)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement ➕ New feature or request performance ⚡ Performance/optimization related work across the compiler and runtime tuner
Projects
None yet
Development

No branches or pull requests

1 participant