Add specifications for functions for searching arrays #23

kgryte · 2020-08-26T08:09:19Z

This PR

adds specifications for functions for searching arrays.
is derived from comparing API signatures across array libraries.

Notes

NumPy includes additional search functions (e.g., nanargmax, nanargmin, argwhere, flatnonzero, extract, searchsorted); however, these functions are not widely implemented by other analyzed array libraries (less than half; see here) or are out-of-scope (nan variants) and, thus, were not included in this initial specification. Should additional search functions be necessary, they can be proposed in follow-up proposals.

Questions (and notes)

argmax
- Some discrepancy with max in that multiple axes cannot be specified (CuPy supports, but no one else). Commonly, when computing the maximum value, array libraries allow users to reduce over multiple axes when performing the statistical "reduction". However, apart from CuPy, most of the analyzed array libraries do not seem to allow the same when searching for the indices of the maximum values. Should this be reconciled?
- Torch and CuPy both support keepdims which we support here in order to ensure broadcast compatibility as we do with statistical functions.
- Similar to max et al, should we also support an out keyword argument? This is supported by NumPy et al, but not Torch and TensorFlow.
argmin
- Same concerns/notes as for argmax.
nonzero
- How should zero-dimensional arrays be handled? NumPy requires at least 1d, while Torch supports handling of zero-dimensional arrays as if one-dimensional.
- Torch has as_tuple keyword argument to either return a tuple of arrays or a single multi-dimensional array. Not clear on why most other array libraries insist on a tuple and not an ndarray (maybe because of NumPy)?
where
- The current spec requires "truthiness" rather than a strict array of booleans. This follows NumPy's behavior, where non-boolean conditions are permitted and act as expected (e.g., 0 is falsy while non-zero values are truthy). Should we be stricter here?
- Should we support an out keyword argument? Would seem consistent with other (element-wise) functions.
- The current spec proposes a ternary function (requires 3 args), omitting the trivial case of just providing a condition argument, as array library docs consistently point to alternative approaches (e.g., np.asarray(condition).nonzero()).

rgommers · 2020-08-26T09:46:49Z

Some discrepancy with max in that multiple axes cannot be specified (CuPy supports, but no one else). Commonly, when computing the maximum value, array libraries allow users to reduce over multiple axes when performing the statistical "reduction". However, apart from CuPy, most of the analyzed array libraries do not seem to allow the same when searching for the indices of the maximum values. Should this be reconciled?

Torch and CuPy both support keepdims which we support here in order to ensure broadcast compatibility as we do with statistical functions.

NumPy discussion where there's agreement it's a good idea, just work to implement: numpy/numpy#8710. And since numpy.argmax already has a thin Python layer, this won't be much work. And I imagine for other libraries it is fairly similar. So shouldn't impose much of a burden to add this.

Similar to max et al, should we also support an out keyword argument? This is supported by NumPy et al, but not Torch and TensorFlow.

Here's the PyTorch feature request: pytorch/pytorch#32570. Seems to be a bit of work, but should be doable.

With both of the above decisions we'd be deviating from choosing the minimal common set, however there's also something to say for consistency between functions. E.g. max/argmax having the same keywords, and keepdims and out being present for all of these reduction-like functions would make a lot of sense. So I'm +0.5 on adding both.

rgommers · 2020-08-26T10:01:29Z

How should zero-dimensional arrays be handled? NumPy requires at least 1d, while Torch supports handling of zero-dimensional arrays as if one-dimensional.

The numpy.nonzero docstring has a deprecation note in it regarding that stuff, and the PyTorch behavior is also deprecated:

In [10]: import torch                                                         

In [11]: t = torch.tensor(1)                                                  

In [12]: t.shape                                                              
Out[12]: torch.Size([])

In [13]: torch.nonzero(t)                                                     
...  UserWarning: This overload of nonzero is deprecated:
...

I suggest saying 0-D input should raise an exception here. It clearly doesn't make too much sense.

Torch has as_tuple keyword argument to either return a tuple of arrays or a single multi-dimensional array. Not clear on why most other array libraries insist on a tuple and not an ndarray (maybe because of NumPy)?

I think the as_tuple=False behavior makes more sense - there's no memory saving or other advantages of the tuple form as far as I can tell. Besides the tuple being a little more clumsy, there's not much difference though, so it's also weird to have two forms. Hence I'd tend to go with the tuple return only.

rgommers · 2020-08-26T10:07:15Z

The current spec requires "truthiness" rather than a strict array of booleans. This follows NumPy's behavior, where non-boolean conditions are permitted and act as expected (e.g., 0 is falsy while non-zero values are truthy). Should we be stricter here?

+1 for accepting only boolean arrays. The note in the numpy.where docstring recommends doing that as the more correct option, and it's easy to do:

    When only `condition` is provided, this function is a shorthand for
    ``np.asarray(condition).nonzero()``. Using `nonzero` directly should be
    preferred, as it behaves correctly for subclasses. The rest of this
    documentation covers only the case where all three arguments are
    provided.

Should we support an out keyword argument? Would seem consistent with other (element-wise) functions.

I'd say where isn't really a normal element-wise function, so I'd leave it out to not make life difficult for implementers here.

The current spec proposes a ternary function (requires 3 args), omitting the trivial case of just providing a condition argument, as array library docs consistently point to alternative approaches (e.g., np.asarray(condition).nonzero()).

+1, the 1-arg form doesn't make any sense.

…o searching

kgryte · 2020-08-26T19:03:50Z

@rgommers Thanks for the initial review, responses, and links to relevant discussions!

Re: argmax/argmin keepdims. Agreed. That its inclusion in the spec should not impose an undue burden was my conclusion, as well.

Re: argmax/argmin out keyword. Agreed. My preference is for keeping related APIs as consistent as possible to reduce the need for separate mental models when using, e.g., max and argmax.

...which leads me back to the question:

should argmax and argmin accept a tuple of ints for the axis keyword?

Re: nonzero and zero-dimensional arrays. Agreed that this should be disallowed in the spec. Updated the spec accordingly.

Re: nonzero and returning a tuple. If anything, returning a tuple is less memory efficient and adds a layer of indirection, which is why I find it a bit odd, but perhaps there is a design reason for this. Perhaps there is room for some NumPy mailing list archaeology.

Re: where and boolean arrays. Updated.

…o searching

rgommers · 2020-09-10T13:52:18Z

This looks about ready to merge, I just need to re-review to see all discussion points were addressed.

rgommers · 2020-09-13T19:20:33Z

Re: nonzero and returning a tuple. If anything, returning a tuple is less memory efficient and adds a layer of indirection, which is why I find it a bit odd, but perhaps there is a design reason for this.

The reason apparently was that it's easier for advanced indexing (you can then simply do x[*tuple_idx], with the non-tuple form it'd be more annoying).

rgommers · 2020-09-13T19:20:59Z

Double checked everything, LGTM. Merged, thanks @kgryte

asmeurer · 2020-09-17T19:17:15Z

The tuple form for nonzero allows boolean array indices to be equivalent to nonzero (except for scalar booleans), so x[boolean_array] == x[nonzero(boolean_array)] (if there are multiple indices, the nonzeros are unpacked in the tuple).

rgommers · 2021-06-10T15:21:35Z

A little more archeology: http://numpy-discussion.10968.n7.nabble.com/just-curious-why-does-numpy-where-return-tuples-td14182.html#a14183 (2008):

I assume that you are talking about where()'s single-argument form
which is equivalent to nonzero() (which I recommend using instead).
The reason that it returns a tuple is that the result is intended to
be usable as a "fancy" index. E.g.

In [4]: x = random.randint(0, 2, [3, 10])

In [5]: x
Out[5]:
array([[1, 0, 0, 0, 0, 0, 1, 1, 0, 0],
       [1, 1, 0, 0, 1, 1, 1, 0, 0, 1],
       [1, 0, 0, 1, 0, 1, 0, 1, 0, 1]])

In [6]: nonzero(x)
Out[6]:
(array([0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2]),
 array([0, 6, 7, 0, 1, 4, 5, 6, 9, 0, 3, 5, 7, 9]))

In [7]: x[nonzero(x)]
Out[7]: array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

In Python, x[i,j] is implicitly x[(i,j)]. In order to support both
x[i,j] and x[some_array] (i.e., some_array is an array indexing on the
first axis only), we differentiate the inputs by type.

In [8]: x[array(nonzero(x))]
---------------------------------------------------------------------------
IndexError

kgryte · 2021-06-11T18:00:08Z

@rgommers Thanks for digging this up. :)

jakirkham · 2021-06-11T18:07:01Z

Did we add argwhere? Didn't see it when searching through the PR, but could have easily missed it

kgryte · 2021-06-11T18:10:46Z

@jakirkham argwhere is not currently in the spec.

jakirkham · 2021-06-11T18:21:51Z

Ah ok. It would be useful to have argwhere

Since these are functions that can have unknown array lengths in their results (in Dask or other lazy computation libraries we don't know how many non-zeros there are until computed), it is helpful to have all of the indices in a single array like in argwhere. Having a single array lets us more easily work with that unknown length (as it is shared by all indices). Whereas something like nonzero has unknown lengths for each index. So it may not be obvious they can be used together (say in addition for example)

rgommers · 2021-06-12T16:40:22Z

Argh, why do we have both nonzero and argwhere! They're almost identical. I actually like argwhere better, but because of the name it's used far less than nonzero I believe.

PyTorch has an unbind method that allows to efficiently go from 2-D tensor to tuple of 1-D tensors. Which, when applied to the output of argwhere, would be an O(1) method that makes nonzero unnecessary. And it would resolve the pain point PyTorch is having with incompatible nonzero definition.

This may be worth considering. The name is a little annoying though. Especially because argwhere isn't a counterpart to where, but the same thing as the discouraged one-element form of where.

>>> x = np.array([0, 4, 7, 0])
>>> np.nonzero(x)
(array([1, 2]),)
>>> np.where(x)
(array([1, 2]),)
>>> np.argwhere(x)
array([[1],
       [2]])
>>> x[np.nonzero(x)]
array([4, 7])
>>> x[np.where(x)]
array([4, 7])
>>> x[np.argwhere(x)]  # this needs a squeeze() to roundtrip for 1-D input
array([[4],
       [7]])

jakirkham · 2022-06-06T20:32:06Z

Apologies for the lack of follow up here. Have tried to articulate why argwhere is useful in a new issue ( #449 ). It's probably best in the end to have this discussion there given it is a proposed spec addition and it will be more discoverable that way

kgryte added 7 commits August 26, 2020 00:00

Add argmax signature

8a82093

Update index

3e924dc

Add argmin signature

eab1830

Add nonzero signature

709e70a

Add where signature

decec89

Update descriptions

aa286ce

Fix return typing

0886a86

kgryte added 5 commits August 26, 2020 11:20

Merge branch 'master' of https://github.com/pydata-apis/array-api int…

d79e895

…o searching

Require an arry of booleans

f51d119

Require positive rank

2d2244e

Add optional out keyword argument to maintain consistency with max/min

43bd558

Update description

13cf3f8

Merge branch 'master' of https://github.com/pydata-apis/array-api int…

b831fe1

…o searching

rgommers merged commit 05026cd into master Sep 13, 2020

rgommers deleted the searching branch September 13, 2020 19:20

kgryte mentioned this pull request Jan 13, 2021

keepdims argument to argmin() and argmax() #109

Closed

rgommers mentioned this pull request Jun 1, 2021

torch.nonzero(t, as_tuple=...) does not work with the JIT because the as_tuple signatures are not exposed properly pytorch/pytorch#45499

Open

jakirkham mentioned this pull request Jun 6, 2022

Add argwhere #449

Closed

rgommers added the topic: Dynamic Shapes Data-dependent shapes. label Mar 9, 2023

kgryte mentioned this pull request May 2, 2024

feat: add count_nonzero to specification #803

Merged

lucascolley mentioned this pull request Dec 10, 2024

fix: clarify that condition should have a boolean dtype in where #868

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add specifications for functions for searching arrays #23

Add specifications for functions for searching arrays #23

kgryte commented Aug 26, 2020 •

edited

Loading

rgommers commented Aug 26, 2020

rgommers commented Aug 26, 2020

rgommers commented Aug 26, 2020

kgryte commented Aug 26, 2020

rgommers commented Sep 10, 2020

rgommers commented Sep 13, 2020

rgommers commented Sep 13, 2020

asmeurer commented Sep 17, 2020

rgommers commented Jun 10, 2021

kgryte commented Jun 11, 2021

jakirkham commented Jun 11, 2021

kgryte commented Jun 11, 2021

jakirkham commented Jun 11, 2021

rgommers commented Jun 12, 2021

jakirkham commented Jun 6, 2022

Add specifications for functions for searching arrays #23

Add specifications for functions for searching arrays #23

Conversation

kgryte commented Aug 26, 2020 • edited Loading

Notes

Questions (and notes)

rgommers commented Aug 26, 2020

rgommers commented Aug 26, 2020

rgommers commented Aug 26, 2020

kgryte commented Aug 26, 2020

rgommers commented Sep 10, 2020

rgommers commented Sep 13, 2020

rgommers commented Sep 13, 2020

asmeurer commented Sep 17, 2020

rgommers commented Jun 10, 2021

kgryte commented Jun 11, 2021

jakirkham commented Jun 11, 2021

kgryte commented Jun 11, 2021

jakirkham commented Jun 11, 2021

rgommers commented Jun 12, 2021

jakirkham commented Jun 6, 2022

kgryte commented Aug 26, 2020 •

edited

Loading