LU decomposition runtime is dominated by lu_pivots_to_permutation on GPU #5880

ahoenselaar · 2021-02-28T23:07:00Z

The actual decomposition in handled by cusolver. However, the pivots returned by getrf are transformed into a permutation matrix via lu_pivots_to_permutation. For an (n, n) matrix this transformation is implemented as a loop with n iterations. Even though the amount of data involved is small (int32 vector of length n), this loop performs poorly and its runtime by far exceeds that of getrf.

Trace for a 75x75 doube-precision matrix on a V100.

Batching does not improve the situation much. Trace for a batch of 64 double-precision matrices with shape (75, 75):

The text was updated successfully, but these errors were encountered:

mattjj · 2021-03-01T21:06:37Z

Thanks for the report!

mattjj · 2021-03-02T04:30:46Z

Based on some comments from @hawknsp, I think the issue is that on GPU each iteration of the loop gets turned into a separate kernel launch, which is a current limitation of XLA:GPU. Our options for improving this include:

help XLA:GPU folks make loops faster (by gaining the ability to generate a single kernel launch for some loops);
write a custom GPU kernel (like we do for PRNG sampling, because there is a bad compile time / execution time tradeoff on GPU);
partially unroll this loop at the JAX Python level.

The first fix seems like the right one for the long term, but I don't know when it'll happen. The last one is more of a mitigation, but it seems like quite a good one, especially since we can just write the fori_loop as a scan, and use scan's built-in unroll value.

ahoenselaar · 2021-03-03T17:20:14Z

Rewriting as a scan with unroll=16 reduces the runtime by only 20%. The profile shows a large number of same-device memory transfer operations of size 300 (=75 * 4 bytes, i.e. the full permutation vector in this example), likely related to copy ops in the optimized HLO. These copies seem unnecessary and likely hurt performance.

1614762065673455.module_0000.before_optimizations.txt
1614762065673455.module_0000.after_optimizations.txt
1614762065673455.module_0000.after_optimizations-buffer-assignment.txt

mattjj · 2021-03-04T17:33:03Z

Wow interesting. Thanks so much for investigating the unrolling option.

@hawkinsp any thoughts on what XLA:GPU could do?

hawkinsp · 2021-03-04T18:27:49Z

I think we're going to have to give in and hand-write a kernel for this operation until such time as XLA/GPU improves. It's sort of a worst case for XLA on GPU: loopy code, very low compute intensity in the loop.

ahoenselaar · 2021-03-04T20:01:42Z

TF has implemented a kernel for this very purpose.

ahoenselaar added the enhancement New feature or request label Feb 28, 2021

mattjj assigned hawkinsp Mar 1, 2021

mattjj assigned mattjj and unassigned hawkinsp Mar 2, 2021

mattjj assigned hawkinsp and unassigned mattjj Mar 28, 2021

This was referenced Apr 1, 2021

Lower lu_pivots_to_permutation using a custom CUDA kernel. #6314

Closed

Reimplement lu_pivots_to_permutation as a JAX Primitive #6337

Merged

copybara-service bot closed this as completed in #6337 Apr 7, 2021

hawkinsp mentioned this issue May 18, 2021

Prevent the recalculation of permutation in lu_solve() #5826

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LU decomposition runtime is dominated by lu_pivots_to_permutation on GPU #5880

LU decomposition runtime is dominated by lu_pivots_to_permutation on GPU #5880

ahoenselaar commented Feb 28, 2021

mattjj commented Mar 1, 2021

mattjj commented Mar 2, 2021

ahoenselaar commented Mar 3, 2021

mattjj commented Mar 4, 2021

hawkinsp commented Mar 4, 2021

ahoenselaar commented Mar 4, 2021

LU decomposition runtime is dominated by lu_pivots_to_permutation on GPU #5880

LU decomposition runtime is dominated by lu_pivots_to_permutation on GPU #5880

Comments

ahoenselaar commented Feb 28, 2021

mattjj commented Mar 1, 2021

mattjj commented Mar 2, 2021

ahoenselaar commented Mar 3, 2021

mattjj commented Mar 4, 2021

hawkinsp commented Mar 4, 2021

ahoenselaar commented Mar 4, 2021