-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TOPI] Update softmax compute and CPU schedule #3680
Conversation
@kevinthesun @vinx13 can you please review and add any other reviewers you think are necessary? I am currently modifying log_softmax, and it seems worthwhile to create a new generic schedule for it, since the inputs to tvm.compute are now different for softmax and log_softmax. What do you think? |
Thank you @soiferj , can you check the CI problem? |
Yeah, I'm taking a look at the CI failure now. It seems to be an issue in the CUDA schedule. I will work on it. |
The CI issue is fixed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
@kevinthesun feel free to merge the PR given you are managing it |
Thank you for contributing! |
* Update Softmax compute and CPU schedule * Add C++ compute * Fix schedule * Update CUDA and OpenGL schedules * Fix log_softmax * Fix hls and opengl schedules * Fix CUDA schedule
* Update Softmax compute and CPU schedule * Add C++ compute * Fix schedule * Update CUDA and OpenGL schedules * Fix log_softmax * Fix hls and opengl schedules * Fix CUDA schedule
Another suggestion - https://discuss.tvm.ai/t/softmax-sequence-of-relay-ops/5686 |
This change improves performance for softmax by simplifying the computation and writing a schedule that supports better parallelization.
Compute: Currently,
exp(input - max)
is computed twice: once in the_compute_expsum
stage and once in the_normalize
stage. This change adds an extra stage to compute this tensor once. It is then re-used in the_compute_expsum
and_normalize
stages.Schedule: Currently, the schedule only parallelizes the
_normalize
stage of the computation. This change puts all stages of computation under a common root and parallelizes the outer dimensions.The following results are with a tensor of shape
(1,12,128,128)
andaxis=-1
. This simulates the softmax in BERT base. The CPU is Intel Xeon E5-2650, and the Relay target string isllvm -mcpu=core-avx2
.