-
Notifications
You must be signed in to change notification settings - Fork 230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Skip naive conv testing to speed up #3383
Comments
Here's a snippet from the ufdb in question - I'm not 100% sure but I think this shows that some of those ConvDirectNaive kernels take a lot of time; Click to view `HIP.3_2_0.ufdb.txt`HIP.3_2_0.ufdb.txt
|
Hi @RobQuistNL. Internal ticket has been created to assist with your issue. Thanks! |
Hi @RobQuistNL, can you please provide more info on your hardware and software version (ROCm version and OS version)? Thanks. |
hey @huanrwan-amd ;
git clone --recursive https://github.com/ROCm/flash-attention /tmp/flash-attention
cd /tmp/flash-attention; export GPU_ARCHS="gfx90a"; pip3 install . |
Hi @RobQuistNL, thanks for the info. This issue is more like a feature enhancement. I will contact internal team first. |
Hi,
Looking at running various models with various inputs - it seems a lot of time for the initial runs is being spent benchmarking potential kernels - including the naive ones (e.g.
naive_conv_nonpacked_fwd_nchw_float_double_float
)The solution that comes up usually is not the naive one, but one of the other kernels. Running with
MIOPEN_DEBUG_CONV_DIRECT=0
significantly speeds up initial runs of said model with varying resolutions.Would it be an option to get this testing / benching dynamically, without excluding it completely? Where the naive kernel would be the least preferred - and if another is found it would be a safe bet to say the other implementation is faster (so the testing of the kernel itself could be skipped alltogether)
If its not desired behaviour - maybe this could be added behind a feature flag.
I'm quite sure that people running this without knowing about it, would experience major speedups in initial runs (the test case here is various VAE models being ran).
The text was updated successfully, but these errors were encountered: