Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AutoTVM] Add batch_matmul to tunable operations #4242

Merged
merged 12 commits into from
Nov 7, 2019

Conversation

jwfromm
Copy link
Contributor

@jwfromm jwfromm commented Nov 1, 2019

The rising popularity of BERT models suggests that batch matmul is going to become an increasingly important op for TVM to have first class support of. This PR adds batch_matmul to the list of operations that AutoTVM can tune for and templatizes the x86 schedule. Although the templatization and backends covered in this PR are very limited, it serves a starting point to build off of. The x86 fallback configuration is identical to the static schedule in master branch.

@jwfromm
Copy link
Contributor Author

jwfromm commented Nov 1, 2019

@soiferj @icemelon9, can you guys take a look at this PR?

@tqchen
Copy link
Member

tqchen commented Nov 1, 2019

@Hzfengsy can you also take a look?

@icemelon
Copy link
Member

icemelon commented Nov 1, 2019

@jwfromm Thanks for adding this. Could you add some latency measure number, e.g., on c5.9xlarge?

Comment on lines 59 to 64
# create tuning space
cfg.define_split("tile_y", M, num_outputs=2)
cfg.define_split("tile_x", N, num_outputs=2)
cfg.define_split("tile_k", K, num_outputs=2)
if cfg.is_fallback:
_default_batch_matmul_nopack_config(cfg, M, N, K)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part should belong to the schedule instead of the declaration. I suggest moving them to the schedule function like other ops.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually extremely similar to the topi dense declaration in x86 as it's based directly on it. I would argue that the functional similarity between dense and batch_matmul encourages us to keep the syntax as close as possible to make transferring optimizations simple. If you feel strongly that configuration declarations should be in the schedule, I'd by happy to move both the batch_matmul and the dense declarations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally don't think that's the best practice, because it would be tedious when we want to reuse the compute function on the different target (e.g., CUDA). It would also be confusing when someone tries to improve the schedule in the future, so I think it would be great to change both dense and batch_norm. On the other hand, I would also be happy to know other's opinion.

cc @icemelon9

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to move both to be in the schedule.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for moving this into the schedule.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest push moves the configuration declaration in both batch_matmul and dense to scheduling function. However, note that in the dense declaration, some splits are actually used in the computation (tile_k in dense_no_pack for example) and so cannot be moved. This means that the declarations arent all located in the same place. Do you guys prefer it this way or should we leave dense alone and only change batch_matmul?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case I personally prefer to leave the dense there, and file a separate PR to refactor the dense compute.
cc @yzhliu @icemelon9 @Laurawly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@comaniac I agree thats the best way to proceed. I reverted the changes to dense in the latest commit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@comaniac In certain case, the config space must be defined in the compute as the compute needs to use it to define intermediate compute stage.

@yzhliu yzhliu added the status: need update need update based on feedbacks label Nov 5, 2019
@jwfromm
Copy link
Contributor Author

jwfromm commented Nov 6, 2019

@icemelon9, autotuning seems to yield speedups around 20% faster than the base configuration on a 32 core CPU. Again, this template is very simple and can almost certainly be improved, this PR is just a starting point to make that improvement easier.

@icemelon icemelon merged commit 14a5a35 into apache:master Nov 7, 2019
@icemelon
Copy link
Member

icemelon commented Nov 7, 2019

Thanks @jwfromm

@icemelon icemelon added status: accepted and removed status: need review status: need update need update based on feedbacks labels Nov 7, 2019
zxy844288792 pushed a commit to neo-ai/tvm that referenced this pull request Nov 13, 2019
* Batch matmul tuning running but with errors.

* Default x86 schedule as good as before.

* Code Cleanup

* Remove unused argument.

* improved template documentation.

* Silly lint fix

* Removed leftover comment.

* Moved cfg declaration to schedule for batch_matmul

* Moved x86 dense cfg declaration to schedule.

* lint fix

* Removed duplicate cfg declaration in dense.

* Reverted changes to dense.
@jwfromm jwfromm deleted the auto_batch_matmul branch January 11, 2020 22:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants