You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Loop ordering: you should loop over the non-broadcasted, larger dimension wherever possible. Ideally, you can actually exhaustively search for the best loop ordering based on your two input tensors.
The number of loads for a given tensor (T) with a loop order (\pi) is given by:
D_π[j] is the extent of the output in the jᵗʰ loop (according to ordering π).
b_T(d) is 1 if tensor T is non‑broadcast in dimension d (i.e. its size is greater than 1) and 0 if it is broadcast (i.e. its size is 1).
The initial “1” counts the load for the very first output element.
For two tensors (say A and B), the total cost is given by:
$$
L_{\text{total}}(\pi) = L_A(\pi) + L_B(\pi)
$$
The following pseudocode performs an exhaustive search (when the rank is below a specified threshold) to select the loop order that minimizes the total number of loads.
Use reader to read both tensors instead of splitting between writer and reader. This way the compute is not waiting on the writer to finished writing out the first tile AND read in the next tile.
Use the new llk broadcast APIs to broadcast both tensors. Should be about 160 cycles in the best case, maybe 300 in the worst case if we're broadcasting both. Broadcast in RISC is thousands of cycles in comparison. Alternatively, using NOC APIs also works and can cut cycles significantly too.
The text was updated successfully, but these errors were encountered:
The number of loads for a given tensor (T) with a loop order (\pi) is given by:
where:
The initial “1” counts the load for the very first output element.
For two tensors (say A and B), the total cost is given by:
The following pseudocode performs an exhaustive search (when the rank is below a specified threshold) to select the loop order that minimizes the total number of loads.
The text was updated successfully, but these errors were encountered: