-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
5adad1e
commit 8541e13
Showing
1 changed file
with
10 additions
and
6 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
8541e13
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andrewjradcliffe
I'm not sure what you mean when saying single-pass
extrema
isn't worth it. Single pass has a significant advantage starting fromN=64
on my computer, and is about twice as fast for larger values ofN
when the problem becomes memory-bandwidth dominated:Definitions:
8541e13
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quite right you are sir, the single-pass is superior (as it should be -- streaming the same array through the cache 2x should cost approximately double, as your benchmark series shows). The "not really worth it" reference is actually to the effort involved when I wrote that comment (was more troubled by the reduction-on-first-dimension error). Looking back I see that oh, yeah, the problem is that I had been thinking of findmin/findmax, and not specifically just minimum/maximum. The reduction-on-first-dimension error arises from the booleans used to select the min/max (which are totally unnecessary >_<). I will have to rectify this ASAP.
Then, I also think of the
collect(zip(A_min, A_max))
call, and realize that actually, one can probably optimize more by wrapping only the inner reduction loop block in@turbo
, taking themn
,mx
and storing them directly in an anArray{Tuple{T,T},N}
-- this eliminates the wasteful memory allocation as well (and whatever cost that incurs). I had noticed a while ago that LV doesn't handle tuples, otherwise it would be possible to just continue wrapping both outer and inner blocks in@turbo
, hence the current use ofzip
.8541e13
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chriselrod
Incidentally, benchmarking reveals that non-zip is faster only when the array is very small (length(A) < 100), and/or when the reduction is taking place across a large chunk of the array. If the dimensions are equal size, as in the examples above, this typically requires > ndims(A) ÷ 2. As one would expect, there is dependence on memory traversal order, i.e. which dimensions are being reduced over, as they must appear in the innermost loop. This is easier to elicit with equal size dimensions. For unequal size dimensions, cost modeling is needed to determine what the optimal action -- out of scope for this little note. In any case, making all the loops available to LoopVectorization will yield superior performance in most cases, despite the need to zip the result.
An aside: When the reduction occurs over all dimensions, there is clearly no penalty to the non-zip method. It is an unfortunate side effect, but it goes un-used in such a case.
For reference -- your benchmark repeated on following machine. Same breakpoint at which single-pass exceeds two-pass.