-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LU decomposition runtime is dominated by lu_pivots_to_permutation on GPU #5880
Comments
Thanks for the report! |
Based on some comments from @hawknsp, I think the issue is that on GPU each iteration of the loop gets turned into a separate kernel launch, which is a current limitation of XLA:GPU. Our options for improving this include:
The first fix seems like the right one for the long term, but I don't know when it'll happen. The last one is more of a mitigation, but it seems like quite a good one, especially since we can just write the |
Rewriting as a 1614762065673455.module_0000.before_optimizations.txt |
Wow interesting. Thanks so much for investigating the unrolling option. @hawkinsp any thoughts on what XLA:GPU could do? |
I think we're going to have to give in and hand-write a kernel for this operation until such time as XLA/GPU improves. It's sort of a worst case for XLA on GPU: loopy code, very low compute intensity in the loop. |
TF has implemented a kernel for this very purpose. |
The actual decomposition in handled by cusolver. However, the pivots returned by
getrf
are transformed into a permutation matrix vialu_pivots_to_permutation
. For an(n, n)
matrix this transformation is implemented as a loop withn
iterations. Even though the amount of data involved is small (int32
vector of lengthn
), this loop performs poorly and its runtime by far exceeds that ofgetrf
.Trace for a 75x75 doube-precision matrix on a V100.
Batching does not improve the situation much. Trace for a batch of 64 double-precision matrices with shape (75, 75):
The text was updated successfully, but these errors were encountered: