Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eltwise perf issues #18472

Open
3 tasks
sjameelTT opened this issue Feb 28, 2025 · 0 comments
Open
3 tasks

Eltwise perf issues #18472

sjameelTT opened this issue Feb 28, 2025 · 0 comments
Assignees

Comments

@sjameelTT
Copy link
Contributor

sjameelTT commented Feb 28, 2025

  • Loop ordering: you should loop over the non-broadcasted, larger dimension wherever possible. Ideally, you can actually exhaustively search for the best loop ordering based on your two input tensors.

The number of loads for a given tensor (T) with a loop order (\pi) is given by:

$$ L_T(\pi) = 1 + \sum_{j=0}^{N-1} \left( \prod_{k=0}^{j-1} D_{\pi[k]} \right) \cdot b_T(\pi[j]) \cdot \Bigl( D_{\pi[j]} - 1 \Bigr) $$

where:

  • D_π[j] is the extent of the output in the jᵗʰ loop (according to ordering π).
  • b_T(d) is 1 if tensor T is non‑broadcast in dimension d (i.e. its size is greater than 1) and 0 if it is broadcast (i.e. its size is 1).

The initial “1” counts the load for the very first output element.

For two tensors (say A and B), the total cost is given by:

$$ L_{\text{total}}(\pi) = L_A(\pi) + L_B(\pi) $$

The following pseudocode performs an exhaustive search (when the rank is below a specified threshold) to select the loop order that minimizes the total number of loads.

  • Use reader to read both tensors instead of splitting between writer and reader. This way the compute is not waiting on the writer to finished writing out the first tile AND read in the next tile.
  • Use the new llk broadcast APIs to broadcast both tensors. Should be about 160 cycles in the best case, maybe 300 in the worst case if we're broadcasting both. Broadcast in RISC is thousands of cycles in comparison. Alternatively, using NOC APIs also works and can cut cycles significantly too.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants