Change find() to return the same index type as pairs() #24774

nalimilan · 2017-11-25T16:02:34Z

This does not change anything for AbstractVectors and general iterables,
which continue to use linear indices. For other AbstractArrays, return
CartesianIndexes (rather than linear indices). For Dict, return
keys (previously not supported at all).

Relying on collect to choose the return element type allows supporting
any definition of pairs, including that for Dict, which creates a standard
Generator for which eltype returns Any.

Fixes #20684.

nalimilan · 2017-11-25T16:05:31Z

base/array.jl

+
+ julia> find(isodd, A)
+ 2-element Array{CartesianIndex{2},1}:
+  CartesianIndex(1, 1)


A nice feature is that with #24651 the CartesianIndex part won't be repeated for each element, making the returned object quite natural.

nalimilan · 2017-11-25T16:09:18Z

base/array.jl

@@ -1789,22 +1802,10 @@ julia> find(falses(3))
 ```
 """
 function find(A)
-    nnzA = count(t -> t != 0, A)


Due to issues with the inference of the returned eltype (see commit message), I've taken a radical approach simply using collect instead of the custom loops. This means we no longer compute the length of the result before filling it. Benchmarks will be needed to check what's the best approach, but at first sight it doesn't sound obvious to me that doing two passes over the data is a good tradeoff, does it?

Historically it was worth it, but might be worth benchmarking again. See also https://discourse.julialang.org/t/push-and-interfacing-to-the-runtime-library/7461, which suggests that two passes might still be faster (growing an array with push! is 3x slower than growing it with setindex!).

See also https://discourse.julialang.org/t/half-vectorization/7399/3, which benchmarks some conditional comprehensions with somewhat alarming results. I think you really need to benchmark this change before we can decide.

nalimilan · 2017-11-25T16:11:45Z

base/sparse/sparsematrix.jl

@@ -1264,7 +1264,7 @@ function find(p::Function, S::SparseMatrixCSC)
    end
    sz = size(S)
    I, J = _findn(p, S)
-    return sub2ind(sz, I, J)
+    return CartesianIndex.(I, J)


With the new behavior of find, findn could probably be removed as it gives almost the same information (struct of arrays vs. array of structs).

This method could also be optimized a bit by not allocating I and J first (but it already does less work than before).

nalimilan · 2017-11-25T16:21:21Z

This was supposed to be a RFC, but it works so well with the new pairs/keys framework, and it breaks no tests, so it may actually be a real PR. The main question is of course performance (but currently BaseBenchmarks doesn't seem to test find systematically).

The extension of this change to findnext and findprev appears to be more difficult. Ideally, we would need to be able to iterate over keys(A) by calling next(keys(A), i), with i the index passed by the user. But I'm not sure we can assume that the iterator state is the index. Cf. @JeffBezanson's proposal to use Iterators.rest for findeach in the Search & Find Julep.

nalimilan · 2017-12-04T15:42:10Z

Any comments?

timholy · 2017-12-04T16:00:01Z

The extension of this change to findnext and findprev appears to be more difficult. Ideally, we would need to be able to iterate over keys(A) by calling next(keys(A), i), with i the index passed by the user. But I'm not sure we can assume that the iterator state is the index.

Where does i come from? If there's no place where anything other than the iterator state gets back to the user, perhaps it would be safe.

nalimilan · 2017-12-04T16:05:34Z

Where does i come from? If there's no place where anything other than the iterator state gets back to the user, perhaps it would be safe.

It's passed by the user, and it's currently documented to be a linear index. IIUC that's because it's supposed to be possible to pass the result of the previous findnext iteration, after adding 1 or callind nextind, so that you can get the next match.

timholy

Looks like this is moving in the right direction, thanks!

timholy · 2017-12-04T16:02:27Z

base/array.jl

+find(testf::Function, A) = collect(first(p) for p in _pairs(A) if testf(last(p)))
+
+_pairs(A::Union{AbstractArray, Associative}) = pairs(A)
+_pairs(iter) = zip(OneTo(typemax(Int)), iter)  # safe for objects that don't implement length


Can't you check the iterator trait here?

What do you mean? We don't guarantee that iterators implement pairs currently.

This should probably use countfrom(1).

timholy · 2017-12-04T16:04:41Z

base/array.jl

@@ -1789,22 +1802,10 @@ julia> find(falses(3))
 ```
 """
 function find(A)
-    nnzA = count(t -> t != 0, A)


See also https://discourse.julialang.org/t/half-vectorization/7399/3, which benchmarks some conditional comprehensions with somewhat alarming results. I think you really need to benchmark this change before we can decide.

timholy · 2017-12-04T16:15:03Z

So if findfirst passes back an iterator state and the user always uses nextind, it would be safe, right? Is there any case where the meaning of i is potentially confuse-worthy? (Strings, presumably?)

nalimilan · 2017-12-04T16:23:01Z

So if findfirst passes back an iterator state and the user always uses nextind, it would be safe, right?

It would be safe, but that iterator state needs to also be an index, since the whole point of findnext is to be able to index the collection with the result. So we would have to require that keys(c) uses indices as its iterator state for all collections. Maybe that's a reasonable requirement, I'm not sure.

Is there any case where the meaning of i is potentially confuse-worthy? (Strings, presumably?)

I'm not sure in what cases ambiguities could be a problem here. Can you develop?

vtjnash · 2017-12-04T21:44:53Z

But I'm not sure we can assume that the iterator state is the index
require that keys(c) uses indices as its iterator state for all collections

I think Associative might be the obvious counter-example, since in that case the state is an integer, but the keys are anything. But we might also want to just make it into a tautology, such that findnext is also just defined to work only for containers where the indices are equivalent to the keys.

nalimilan · 2017-12-04T22:10:27Z

I think Associative might be the obvious counter-example, since in that case the state is an integer, but the keys are anything.

Indeed. And I guess it would be really inefficient to use the key itself as the iterator state.

But we might also want to just make it into a tautology, such that findnext is also just defined to work only for containers where the indices are equivalent to the keys.

That restriction is quite annoying considering that findnext would make sense e.g. for an ordered dict. Also the problem is not really "key" vs. "index", it's really "iterator state" vs. "key/index".

It seems that the only general solution would be to use something like findeach (which @JeffBezanson proposed in the Search & Find Julep), but in his proposal iterating over that object would only return the value and state, and not the index. We could have it return ((index, value), state) if we really wanted. Anyway that's post-1.0.

Overall, I guess you're right that we should state that findnext and findprev only work for collections for which nextind works, i.e. you call findnext, you can pass its return value to getindex and/or to nextind (or do + 1 in special cases where it's the same), and then call findnext again. Everything else cannot be supported.

vtjnash · 2017-12-04T22:31:42Z

but in his proposal iterating over that object would only return the value and state, and not the index

I think at that point it might be equivalent to Iterators.Filter? e.g.:
(g(first(pair)) for pair in pairs(assoc) if filter(last(pair)))

where filter might be something like Base.EqualTo(0)

timholy · 2017-12-05T01:09:03Z

EDIT: sorry, should have refreshed my browser window before posting this

There are iterable containers that may not implement indexing. For example, consider a run-length encoded container. You could imagine choosing to not implement getindex (which would have O(logN) performance rather than O(1) performance), and yet have findnext be a sensible operation.

For an AbstractArray, it would of course be better to return the index than the iterator state. But maybe for anything else it should be the iterator state? For a general iterable container, how do you implement findnext if you can't be sure that what gets passed in is the iterator state?

I'm not sure in what cases ambiguities could be a problem here. Can you develop?

Sure. enumerate is a good example, where what gets returned is a counter which might not be a valid index for vectors with indexing that doesn't start at 1. If that doesn't ever happen here, I think you're reasonably safe.

nalimilan · 2017-12-05T11:13:31Z

I think at that point it might be equivalent to Iterators.Filter? e.g.:
(g(first(pair)) for pair in pairs(assoc) if filter(last(pair)))

where filter might be something like Base.EqualTo(0)

@vtjnash I'm not sure what you mean, isn't that mostly the same as what the PR does? What is g?

@timholy Yeah, for general iterables findnext could return the iterator state, and it would only be useful to pass to nextind and then findnext again. But that kind of use would be more suited to findeach, since the iterator state cannot be used for anything else than that (as opposed to arrays where you may want to do many useful things with the returned index).

nalimilan · 2017-12-07T18:34:14Z

@nanosoldier runbenchmarks(ALL, vs=":master")

nanosoldier · 2017-12-07T19:05:28Z

Something went wrong when running your job:

NanosoldierError: failed to run benchmarks against primary commit: failed process: Process(`sudo cset shield -e su nanosoldier -- -c ./benchscript.sh`, ProcessExited(1)) [1]

Logs and partial data can be found here
cc @ararslan

nalimilan · 2017-12-08T22:10:01Z

Let's try again, just in case: @nanosoldier runbenchmarks(ALL, vs=":master")

nanosoldier · 2017-12-08T22:16:59Z

Something went wrong when running your job:

NanosoldierError: failed to run benchmarks against primary commit: InterruptException:

Logs and partial data can be found here
cc @ararslan

ararslan · 2017-12-08T22:19:17Z

The InterruptException was me. There was a server error so I restarted it. Oddly enough the server seemed to have survived through the error; that's actually the first time I've ever seen Nanosoldier publicly announce that it was interrupted.

@nanosoldier runbenchmarks(ALL, vs=":master")

nanosoldier · 2017-12-08T22:58:16Z

Something went wrong when running your job:

NanosoldierError: failed to run benchmarks against primary commit: failed process: Process(`sudo cset shield -e su nanosoldier -- -c ./benchscript.sh`, ProcessExited(1)) [1]

Logs and partial data can be found here
cc @ararslan

ararslan · 2017-12-08T23:15:23Z

ERROR: LoadError: MethodError: no method matching isless(::CartesianIndex{2}, ::Int64)
Closest candidates are:
  isless(!Matched::Missing, ::Any) at missing.jl:47
  isless(!Matched::AbstractFloat, ::Real) at operators.jl:126
  isless(!Matched::Real, ::Real) at operators.jl:302
  ...
Stacktrace:
 [1] <(::CartesianIndex{2}, ::Int64) at ./operators.jl:227
 [2] getindex(::SparseMatrixCSC{Float64,Int64}, ::Array{CartesianIndex{2},1}, ::Array{CartesianIndex{2},1}) at ./sparse/sparsematrix.jl:2235
 [3] perf_sparse_fem(::Int64) at /home/nanosoldier/.julia/v0.7/BaseBenchmarks/src/problem/SparseFEM.jl:27
 [4] ##core#9020() at /home/nanosoldier/.julia/v0.7/BenchmarkTools/src/execution.jl:312
[5] ##sample#9021(::BenchmarkTools.Parameters) at /home/nanosoldier/.julia/v0.7/BenchmarkTools/src/execution.jl:318
...

I'm pretty sure that's because of

For other AbstractArrays, return CartesianIndexes (rather than linear indices)

which seems pretty breaking in general.

nalimilan · 2017-12-09T11:59:05Z

OK. It's funny that tests don't cover it, but that benchmarks do. So we have to decide whether that's the right thing to do before possibly adapting the benchmarks.

nalimilan · 2018-01-06T18:46:55Z

I agree we should merge this soon so that we can make progress on related find issues. However I've realized that the new functions are type-unstable due to an inference issue (see #25433). Not sure whether we should hold this off until it's fixed or not. It would be nice to find a workaround.

I've also made a PR to fix BaseBenchmarks (JuliaCI/BaseBenchmarks.jl#157), and another one to make it possible to use LinearIndices with Compat (JuliaLang/Compat.jl#446), to get the previous behavior. We could also imagine adding a simpler/more efficient method, e.g. find(Int, pred, a).

Regarding the public API, I think the behavior could be refined a bit after merging the PR. Currently it uses pairs for AbstractArray and AbstractDict, and linear indices elsewhere. This is debatable for a few other types:

Linear indices (i.e. number of codepoints) for String make intuitive sense, but they cannot be used to index into the string. OTOH returning the actual string index could be weird.
Linear indices for NamedTuples could be replaced with the names of matching entries.
In general, for custom types, it would be simpler to decide that we use pairs/keys if it's defined, falling back on linear indices when it's not. But I'm not sure it's OK to use method_exists for that given the inference issues it creates.

nalimilan · 2018-01-06T18:54:27Z

NEWS.md

+    `AbstractDict` objects ([#24774]). In particular, this means that it returns
+    `CartesianIndex` objects for matrices and higher-dimensional arrays instead of
+    always returning linear indices as it was previously the case. Use
+    `Int[LinearIndices(size(a))[i] for i in find(f, a)]` to compute linear indices


@timholy Is there any reason why you cannot index LinearIndices objects with an array of cartesian indices? That would provide a much more convenient syntax.

Though as I noted in my last comment it would make sense to provide a find method so that no temporary allocation is needed.

If I understand your question correctly, that's possible because

julia> LinearIndices <: AbstractArray true

Here's a demo of what I mean:

julia> linear = LinearIndices(1:3, 1:5) LinearIndices{2,Tuple{UnitRange{Int64},UnitRange{Int64}}} with indices 1:3×1:5: 1 4 7 10 13 2 5 8 11 14 3 6 9 12 15 julia> linear[3] 3 julia> linear[3,2] 6 julia> linear[[CartesianIndex(3,2), CartesianIndex(2,3)]] 2-element Array{Int64,1}: 6 8

Hmm, I wonder why I thought it didn't work... So everything is fine.

While I'm at it, should we also support linear[[1, 2]]? That would be convenient for backward compatibility, since you would be able to apply this operation to indices returned by find, and it would be a no-op on 0.6. It would also be more consistent.

We have that already, too 😄. But it currently fails because we seem to be missing a size method for LinearIndices.

See #25541.

JeffBezanson · 2018-01-06T19:26:44Z

For Strings at least I definitely think it should return real usable string indices (not codepoint number).

This does not change anything for AbstractVectors and general iterables, which continue to use linear indices. For other AbstractArrays, return CartesianIndexes (rather than linear indices). For Dicts, return keys (previously not supported at all). Relying on collect() to choose the return element type allows supporting any definition of pairs(), including that for Dict, which creates a standard Generator for which eltype() returns Any.

nalimilan · 2018-01-07T15:49:04Z

OK, I've added a commit so that the same index type as keys returns is used for AbstractString, Tuple and NamedTuple. That sounds more consistent to me, and I think we should introduce a trait to distinguish indexable collections, for which we should always use keys/pairs.

Thanks to @JeffBezanson's fix to type stability, I think the PR is ready. However, if we want to avoid breaking Nanosolider, JuliaCI/BaseBenchmarks.jl#157 should be merged first.

…ple, and add tests

ararslan · 2018-01-10T01:49:36Z

@nanosoldier runbenchmarks(ALL, vs=":master")

nanosoldier · 2018-01-10T07:15:55Z

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan

nalimilan · 2018-01-10T13:24:15Z

Nanosoldier report appears to contain lots of noise (with the usual suspects). It's surprising to see that some find benchmarks on Array now use less memory. OTOH find on generators uses less memory but is somewhat slower (by up to 36%), with some hard to explain differences depending on the element types. Finally, arrays and generators of Bool are much slower (up to 4 times), which is probably due to the fact that we no longer compute the length first to allocate the result beforehand instead of calling push!.

I'm merging despite these regressions so that we can finish the search & find API changes, but I have filed an issue to remember that we need to do something (e.g. add specialized AbstractArray{Bool} methods which preallocate, if push! cannot be made fast enough).

nalimilan · 2018-01-10T13:37:18Z

I should have noted that the next step is now to change findfirst and findlast to return the same index type as find. findnext and findprev should also accept these indices, but this can be added without breaking the API since the input index type should determine the output index type.

nalimilan commented Nov 25, 2017

View reviewed changes

nalimilan added the search & find The find* family of functions label Nov 25, 2017

nalimilan force-pushed the nl/find branch from 58e589c to 0a8fbcb Compare November 25, 2017 23:00

timholy reviewed Dec 4, 2017

View reviewed changes

This was referenced Dec 5, 2017

Renaming findn and findnz? #24910

Closed

Add find benchmarks JuliaCI/BaseBenchmarks.jl#147

Merged

CartesianIndex version of find/findnz? #20684

Closed

Unifying search & find functions #10593

Closed

nalimilan requested a review from JeffBezanson December 21, 2017 19:59

nalimilan added the triage This should be discussed on a triage call label Jan 5, 2018

JeffBezanson approved these changes Jan 5, 2018

View reviewed changes

nalimilan force-pushed the nl/find branch 3 times, most recently from 0c4f930 to 2faeb23 Compare January 6, 2018 14:23

This was referenced Jan 6, 2018

Add CartesianIndices and LinearIndices JuliaLang/Compat.jl#446

Merged

Prepare SparseFEM benchmark to changes to find() on Julia master JuliaCI/BaseBenchmarks.jl#157

Merged

Inference issue with collect() on generators with filter #25433

Closed

nalimilan force-pushed the nl/find branch from 2faeb23 to 04232d0 Compare January 6, 2018 18:53

nalimilan commented Jan 6, 2018

View reviewed changes

nalimilan mentioned this pull request Jan 7, 2018

Broadcast had one job (e.g. broadcasting over iterators and generator) #18618

Closed

nalimilan added 2 commits January 7, 2018 16:32

Add NEWS entry

dee7afd

nalimilan force-pushed the nl/find branch from 04232d0 to bfcd03c Compare January 7, 2018 15:44

Also return indices from keys() for AbstractString, Tuple and NamedTu…

b286f17

…ple, and add tests

nalimilan force-pushed the nl/find branch from bfcd03c to b286f17 Compare January 7, 2018 16:31

nalimilan mentioned this pull request Jan 9, 2018

require explicit predicates in find functions #23812

Merged

Merge branch 'master' into nl/find

8e605fb

nalimilan merged commit 2da9ddb into master Jan 10, 2018

nalimilan deleted the nl/find branch January 10, 2018 13:28

nalimilan mentioned this pull request Jan 10, 2018

Fix performance regression of some find() methods #25489

Closed

nalimilan mentioned this pull request Jan 10, 2018

Change sentinel in find(first|next|prev|last) to nothing #25472

Merged

JeffBezanson removed the triage This should be discussed on a triage call label Jan 10, 2018

nalimilan mentioned this pull request Jan 12, 2018

Deprecate findn(x) in favor of find(!iszero, x), which now returns cartesian indices #25532

Merged

nalimilan mentioned this pull request Feb 6, 2018

Add optimized findall(::AbstractArray{Bool}) method #25879

Merged

Change find() to return the same index type as pairs() #24774

Change find() to return the same index type as pairs() #24774

Conversation

nalimilan commented Nov 25, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nalimilan commented Nov 25, 2017

nalimilan commented Dec 4, 2017

timholy commented Dec 4, 2017

nalimilan commented Dec 4, 2017

timholy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

timholy commented Dec 4, 2017

nalimilan commented Dec 4, 2017

vtjnash commented Dec 4, 2017

nalimilan commented Dec 4, 2017

vtjnash commented Dec 4, 2017

timholy commented Dec 5, 2017 • edited Loading

nalimilan commented Dec 5, 2017

nalimilan commented Dec 7, 2017

nanosoldier commented Dec 7, 2017

nalimilan commented Dec 8, 2017

nanosoldier commented Dec 8, 2017

ararslan commented Dec 8, 2017

nanosoldier commented Dec 8, 2017

ararslan commented Dec 8, 2017

nalimilan commented Dec 9, 2017

nalimilan commented Jan 6, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JeffBezanson commented Jan 6, 2018

nalimilan commented Jan 7, 2018

ararslan commented Jan 10, 2018

nanosoldier commented Jan 10, 2018

nalimilan commented Jan 10, 2018 • edited Loading

nalimilan commented Jan 10, 2018

nalimilan commented Nov 25, 2017 •

edited

Loading

timholy commented Dec 5, 2017 •

edited

Loading

nalimilan commented Jan 10, 2018 •

edited

Loading