-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AutoTVM] Add batch_matmul to tunable operations #4242
Conversation
@soiferj @icemelon9, can you guys take a look at this PR? |
@Hzfengsy can you also take a look? |
@jwfromm Thanks for adding this. Could you add some latency measure number, e.g., on c5.9xlarge? |
topi/python/topi/x86/batch_matmul.py
Outdated
# create tuning space | ||
cfg.define_split("tile_y", M, num_outputs=2) | ||
cfg.define_split("tile_x", N, num_outputs=2) | ||
cfg.define_split("tile_k", K, num_outputs=2) | ||
if cfg.is_fallback: | ||
_default_batch_matmul_nopack_config(cfg, M, N, K) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part should belong to the schedule instead of the declaration. I suggest moving them to the schedule function like other ops.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually extremely similar to the topi dense declaration in x86 as it's based directly on it. I would argue that the functional similarity between dense and batch_matmul encourages us to keep the syntax as close as possible to make transferring optimizations simple. If you feel strongly that configuration declarations should be in the schedule, I'd by happy to move both the batch_matmul and the dense declarations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally don't think that's the best practice, because it would be tedious when we want to reuse the compute function on the different target (e.g., CUDA). It would also be confusing when someone tries to improve the schedule in the future, so I think it would be great to change both dense and batch_norm. On the other hand, I would also be happy to know other's opinion.
cc @icemelon9
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better to move both to be in the schedule.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for moving this into the schedule.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The latest push moves the configuration declaration in both batch_matmul and dense to scheduling function. However, note that in the dense declaration, some splits are actually used in the computation (tile_k
in dense_no_pack
for example) and so cannot be moved. This means that the declarations arent all located in the same place. Do you guys prefer it this way or should we leave dense alone and only change batch_matmul?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@comaniac I agree thats the best way to proceed. I reverted the changes to dense in the latest commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@comaniac In certain case, the config space must be defined in the compute as the compute needs to use it to define intermediate compute stage.
@icemelon9, autotuning seems to yield speedups around 20% faster than the base configuration on a 32 core CPU. Again, this template is very simple and can almost certainly be improved, this PR is just a starting point to make that improvement easier. |
Thanks @jwfromm |
* Batch matmul tuning running but with errors. * Default x86 schedule as good as before. * Code Cleanup * Remove unused argument. * improved template documentation. * Silly lint fix * Removed leftover comment. * Moved cfg declaration to schedule for batch_matmul * Moved x86 dense cfg declaration to schedule. * lint fix * Removed duplicate cfg declaration in dense. * Reverted changes to dense.
The rising popularity of BERT models suggests that batch matmul is going to become an increasingly important op for TVM to have first class support of. This PR adds batch_matmul to the list of operations that AutoTVM can tune for and templatizes the x86 schedule. Although the templatization and backends covered in this PR are very limited, it serves a starting point to build off of. The x86 fallback configuration is identical to the static schedule in master branch.