-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Missing broadcast_to before batch_matmul for CuBLAS #7730
Comments
I see, this is the same issue raised by @csullivan in #6616 (review) What was the solution to this problem? @jwfromm @csullivan |
Thanks @comaniac @masahi. Yes the problem is that different targets, and target specific topi implementations, can support different optimizations. In the case of using the blas libraries supported for a target, implicit broadcast is not supported. One option that comes to mind is to add a shape legalization pass that adds the broadcast if a target has specific attributes (e.g. libs=cublas/rocblas etc). However this isn't sufficient; depending on the op strategy priorities or the applied tuning configs, it's possible that the blas library implementation won't be used. A better option could be to make use of #7518, and do the shape legalization after the primitive functions have been lowered to TIR and can be inspected. We could also disable implicit broadcast, but that can increase the memory use (from folding the constant broadcasts) which we've seen overflow device memory for larger batch sizes. |
Another direction I can think of is adding the broadcast support in CuBLAS batch_matmul so that we could have a unified behavior of batch_matmul op in Relay, and we don't need to change anything else. Do you think that's reasonable and doable? |
Reasonable and doable for the short term. The downside being that it only fixes the problem for one target at a time. We would also need to add broadcast support to RocBLAS and CBLAS/MKL to avoid the issue for those targets. |
gentle ping @comaniac to see if you get a chance to followup on this issue |
While @csullivan proposed a long term solution to resolve the implementation difference between targets, this issue on CUDA has been workaround in PyTorch frontend in the PR mentioned above. Specifically, now if either one of the two inputs of matmul is 2D, then PyTorch frontend reshapes the 3D tensor to 2D and uses |
The PR #7348 removes
broadcast_to
before batch_matmul because batch_matmul already supported implicitly broadcast. However, the CuBLAS implementation isn't changed accordingly, which results in the failure of the following case:I guess we need to either add the broadcast_to back or support implicitly broadcasting in CuBLAS implementation.
cc @masahi @jwfromm
The text was updated successfully, but these errors were encountered: