Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression from findall depwarn suggestion #26183

Closed
sbromberger opened this issue Feb 23, 2018 · 6 comments
Closed

Performance regression from findall depwarn suggestion #26183

sbromberger opened this issue Feb 23, 2018 · 6 comments
Labels
performance Must go faster search & find The find* family of functions

Comments

@sbromberger
Copy link
Contributor

sbromberger commented Feb 23, 2018

Prompted by this deprecation warning:

 Warning: In the future `findall(A)` will only work on boolean collections. Use `findall(x->x!=0, A)` instead.

I decided to do a (n admittedly worst-case) test: A = rand(10_000, 10_000);

julia> @time u = findall(x->x != 0, A);            # this assumes 0 = zero(eltype(A))
  8.606016 seconds (65.73 k allocations: 260.716 MiB, 2.96% gc time)

julia> @time v = findall(!iszero, A);
 12.464981 seconds (40 allocations: 1.531 GiB, 4.27% gc time)

julia> @time w = findall((!iszero).(A));
  1.531174 seconds (23 allocations: 1.502 GiB, 12.50% gc time)

julia> u == v == w
true

Had I taken the recommended approach, I would've run into a pretty bad performance regression. As it stands, the best (time) performing approach is among the worst in memory usage; the recommended approach with the anonymous function is bad in time but best in memory, and the negative function is worst in both.

@mbauman
Copy link
Sponsor Member

mbauman commented Feb 23, 2018

The difference here is that findall(::BitArray) counts the true values before constructing the output. findall(f, A) is extremely simple and naive — it just grows on demand and reallocates as necessary:

findall(testf::Function, A) = collect(first(p) for p in pairs(A) if testf(last(p)))

We may want to add some heuristics here — perhaps a sizehint! based upon the density of the beginning of the array or some such. Or maybe add a counting pass like the old find did.

@mbauman mbauman added the performance Must go faster label Feb 23, 2018
@nalimilan
Copy link
Member

nalimilan commented Feb 23, 2018

It's very hard to decide in full generality what the density of true values is going to be. It could make sense to add a sizehint argument, but trying to guess sounds doomed to fail: true and false values may well be grouped, in which case guessing based on the first results would be completely off.

Adding a counting pass could be efficient for functions known to be cheap, like equalto, but doing that in general could be counter-productive. Maybe we could change the deprecation to recommend equalto, and add a specialized method for ::EqualTo, or maybe just EqualTo{T} for some bits types T (since even a comparison could be expensive for some types, e.g. collections). But strictly speaking the deprecation should use == since -0.0 was considered as true (which is probably why #23812 used == in the first place).

There are also some chances that the generic implementation could be made faster by improving push! (#24909).

I'm surprised by the timings you posted. On 0.7 the two first approaches take about the same time (though the number of allocations is lower with iszero for a weird reason and the total allocated size is similar), while using broadcasting first is about twice faster. EDIT: on 0.6 the two first approaches are also equally fast, and the broadcast one is 50% slower.

@nalimilan nalimilan added the search & find The find* family of functions label Feb 23, 2018
@StefanKarpinski
Copy link
Sponsor Member

If you sample a random set of points in the matrix then you do actually get an accurate estimate regardless of matrix structure. Doing that would require evaluating points out of order, however.

@JeffBezanson JeffBezanson changed the title Performance regression based on depwarn suggestion Performance regression from findall depwarn suggestion Mar 13, 2018
@laborg
Copy link
Contributor

laborg commented Feb 13, 2022

I don't see the reported performance problem anymore, but there seems to be a regression from 1.7-> master:

# on Julia 1.7.1 
julia> A = rand(10_000, 10_000);

julia> @btime findall(x->x != 0, A);
  476.016 ms (8 allocations: 1.50 GiB)

julia> @btime findall(!iszero, A);
  470.884 ms (8 allocations: 1.50 GiB)

julia> @btime findall((!iszero).(A));
  469.952 ms (9 allocations: 1.50 GiB)

and

#master Version 1.8.0-DEV.1526 (2022-02-13)
julia> A = rand(10_000, 10_000);

julia> @btime findall(x->x != 0, A);
  505.472 ms (8 allocations: 1.50 GiB)

julia> @btime findall(!iszero, A);
  499.844 ms (8 allocations: 1.50 GiB)

julia> @btime findall((!iszero).(A));
  497.787 ms (9 allocations: 1.50 GiB)

@vtjnash
Copy link
Sponsor Member

vtjnash commented Feb 13, 2022

Duplicate of #42187

@vtjnash vtjnash marked this as a duplicate of #42187 Feb 13, 2022
@vtjnash vtjnash closed this as completed Feb 13, 2022
@vtjnash
Copy link
Sponsor Member

vtjnash commented Feb 13, 2022

I realize this is the older issue, but this function has changed several time since, so I think that is the more accurate one

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster search & find The find* family of functions
Projects
None yet
Development

No branches or pull requests

6 participants