Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add specifications for functions for searching arrays #23

Merged
merged 13 commits into from
Sep 13, 2020
Merged

Conversation

kgryte
Copy link
Contributor

@kgryte kgryte commented Aug 26, 2020

This PR

  • adds specifications for functions for searching arrays.
  • is derived from comparing API signatures across array libraries.

Notes

  • NumPy includes additional search functions (e.g., nanargmax, nanargmin, argwhere, flatnonzero, extract, searchsorted); however, these functions are not widely implemented by other analyzed array libraries (less than half; see here) or are out-of-scope (nan variants) and, thus, were not included in this initial specification. Should additional search functions be necessary, they can be proposed in follow-up proposals.

Questions (and notes)

  • argmax

    • Some discrepancy with max in that multiple axes cannot be specified (CuPy supports, but no one else). Commonly, when computing the maximum value, array libraries allow users to reduce over multiple axes when performing the statistical "reduction". However, apart from CuPy, most of the analyzed array libraries do not seem to allow the same when searching for the indices of the maximum values. Should this be reconciled?
    • Torch and CuPy both support keepdims which we support here in order to ensure broadcast compatibility as we do with statistical functions.
    • Similar to max et al, should we also support an out keyword argument? This is supported by NumPy et al, but not Torch and TensorFlow.
  • argmin

    • Same concerns/notes as for argmax.
  • nonzero

    • How should zero-dimensional arrays be handled? NumPy requires at least 1d, while Torch supports handling of zero-dimensional arrays as if one-dimensional.
    • Torch has as_tuple keyword argument to either return a tuple of arrays or a single multi-dimensional array. Not clear on why most other array libraries insist on a tuple and not an ndarray (maybe because of NumPy)?
  • where

    • The current spec requires "truthiness" rather than a strict array of booleans. This follows NumPy's behavior, where non-boolean conditions are permitted and act as expected (e.g., 0 is falsy while non-zero values are truthy). Should we be stricter here?
    • Should we support an out keyword argument? Would seem consistent with other (element-wise) functions.
    • The current spec proposes a ternary function (requires 3 args), omitting the trivial case of just providing a condition argument, as array library docs consistently point to alternative approaches (e.g., np.asarray(condition).nonzero()).

@rgommers
Copy link
Member

  • Some discrepancy with max in that multiple axes cannot be specified (CuPy supports, but no one else). Commonly, when computing the maximum value, array libraries allow users to reduce over multiple axes when performing the statistical "reduction". However, apart from CuPy, most of the analyzed array libraries do not seem to allow the same when searching for the indices of the maximum values. Should this be reconciled?

  • Torch and CuPy both support keepdims which we support here in order to ensure broadcast compatibility as we do with statistical functions.

NumPy discussion where there's agreement it's a good idea, just work to implement: numpy/numpy#8710. And since numpy.argmax already has a thin Python layer, this won't be much work. And I imagine for other libraries it is fairly similar. So shouldn't impose much of a burden to add this.

  • Similar to max et al, should we also support an out keyword argument? This is supported by NumPy et al, but not Torch and TensorFlow.

Here's the PyTorch feature request: pytorch/pytorch#32570. Seems to be a bit of work, but should be doable.

With both of the above decisions we'd be deviating from choosing the minimal common set, however there's also something to say for consistency between functions. E.g. max/argmax having the same keywords, and keepdims and out being present for all of these reduction-like functions would make a lot of sense. So I'm +0.5 on adding both.

@rgommers
Copy link
Member

How should zero-dimensional arrays be handled? NumPy requires at least 1d, while Torch supports handling of zero-dimensional arrays as if one-dimensional.

The numpy.nonzero docstring has a deprecation note in it regarding that stuff, and the PyTorch behavior is also deprecated:

In [10]: import torch                                                         

In [11]: t = torch.tensor(1)                                                  

In [12]: t.shape                                                              
Out[12]: torch.Size([])

In [13]: torch.nonzero(t)                                                     
...  UserWarning: This overload of nonzero is deprecated:
...

I suggest saying 0-D input should raise an exception here. It clearly doesn't make too much sense.

Torch has as_tuple keyword argument to either return a tuple of arrays or a single multi-dimensional array. Not clear on why most other array libraries insist on a tuple and not an ndarray (maybe because of NumPy)?

I think the as_tuple=False behavior makes more sense - there's no memory saving or other advantages of the tuple form as far as I can tell. Besides the tuple being a little more clumsy, there's not much difference though, so it's also weird to have two forms. Hence I'd tend to go with the tuple return only.

@rgommers
Copy link
Member

  • The current spec requires "truthiness" rather than a strict array of booleans. This follows NumPy's behavior, where non-boolean conditions are permitted and act as expected (e.g., 0 is falsy while non-zero values are truthy). Should we be stricter here?

+1 for accepting only boolean arrays. The note in the numpy.where docstring recommends doing that as the more correct option, and it's easy to do:

    When only `condition` is provided, this function is a shorthand for
    ``np.asarray(condition).nonzero()``. Using `nonzero` directly should be
    preferred, as it behaves correctly for subclasses. The rest of this
    documentation covers only the case where all three arguments are
    provided.
  • Should we support an out keyword argument? Would seem consistent with other (element-wise) functions.

I'd say where isn't really a normal element-wise function, so I'd leave it out to not make life difficult for implementers here.

  • The current spec proposes a ternary function (requires 3 args), omitting the trivial case of just providing a condition argument, as array library docs consistently point to alternative approaches (e.g., np.asarray(condition).nonzero()).

+1, the 1-arg form doesn't make any sense.

@kgryte
Copy link
Contributor Author

kgryte commented Aug 26, 2020

@rgommers Thanks for the initial review, responses, and links to relevant discussions!

Re: argmax/argmin keepdims. Agreed. That its inclusion in the spec should not impose an undue burden was my conclusion, as well.

Re: argmax/argmin out keyword. Agreed. My preference is for keeping related APIs as consistent as possible to reduce the need for separate mental models when using, e.g., max and argmax.

...which leads me back to the question:

should argmax and argmin accept a tuple of ints for the axis keyword?

Re: nonzero and zero-dimensional arrays. Agreed that this should be disallowed in the spec. Updated the spec accordingly.

Re: nonzero and returning a tuple. If anything, returning a tuple is less memory efficient and adds a layer of indirection, which is why I find it a bit odd, but perhaps there is a design reason for this. Perhaps there is room for some NumPy mailing list archaeology.

Re: where and boolean arrays. Updated.

@rgommers
Copy link
Member

This looks about ready to merge, I just need to re-review to see all discussion points were addressed.

@rgommers
Copy link
Member

Re: nonzero and returning a tuple. If anything, returning a tuple is less memory efficient and adds a layer of indirection, which is why I find it a bit odd, but perhaps there is a design reason for this.

The reason apparently was that it's easier for advanced indexing (you can then simply do x[*tuple_idx], with the non-tuple form it'd be more annoying).

@rgommers rgommers merged commit 05026cd into master Sep 13, 2020
@rgommers rgommers deleted the searching branch September 13, 2020 19:20
@rgommers
Copy link
Member

Double checked everything, LGTM. Merged, thanks @kgryte

@asmeurer
Copy link
Member

The tuple form for nonzero allows boolean array indices to be equivalent to nonzero (except for scalar booleans), so x[boolean_array] == x[nonzero(boolean_array)] (if there are multiple indices, the nonzeros are unpacked in the tuple).

@rgommers
Copy link
Member

A little more archeology: http://numpy-discussion.10968.n7.nabble.com/just-curious-why-does-numpy-where-return-tuples-td14182.html#a14183 (2008):

I assume that you are talking about where()'s single-argument form
which is equivalent to nonzero() (which I recommend using instead).
The reason that it returns a tuple is that the result is intended to
be usable as a "fancy" index. E.g.

In [4]: x = random.randint(0, 2, [3, 10])

In [5]: x
Out[5]:
array([[1, 0, 0, 0, 0, 0, 1, 1, 0, 0],
       [1, 1, 0, 0, 1, 1, 1, 0, 0, 1],
       [1, 0, 0, 1, 0, 1, 0, 1, 0, 1]])

In [6]: nonzero(x)
Out[6]:
(array([0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2]),
 array([0, 6, 7, 0, 1, 4, 5, 6, 9, 0, 3, 5, 7, 9]))

In [7]: x[nonzero(x)]
Out[7]: array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

In Python, x[i,j] is implicitly x[(i,j)]. In order to support both
x[i,j] and x[some_array] (i.e., some_array is an array indexing on the
first axis only), we differentiate the inputs by type.

In [8]: x[array(nonzero(x))]
---------------------------------------------------------------------------
IndexError

@kgryte
Copy link
Contributor Author

kgryte commented Jun 11, 2021

@rgommers Thanks for digging this up. :)

@jakirkham
Copy link
Member

Did we add argwhere? Didn't see it when searching through the PR, but could have easily missed it

@kgryte
Copy link
Contributor Author

kgryte commented Jun 11, 2021

@jakirkham argwhere is not currently in the spec.

@jakirkham
Copy link
Member

Ah ok. It would be useful to have argwhere

Since these are functions that can have unknown array lengths in their results (in Dask or other lazy computation libraries we don't know how many non-zeros there are until computed), it is helpful to have all of the indices in a single array like in argwhere. Having a single array lets us more easily work with that unknown length (as it is shared by all indices). Whereas something like nonzero has unknown lengths for each index. So it may not be obvious they can be used together (say in addition for example)

@rgommers
Copy link
Member

Argh, why do we have both nonzero and argwhere! They're almost identical. I actually like argwhere better, but because of the name it's used far less than nonzero I believe.

PyTorch has an unbind method that allows to efficiently go from 2-D tensor to tuple of 1-D tensors. Which, when applied to the output of argwhere, would be an O(1) method that makes nonzero unnecessary. And it would resolve the pain point PyTorch is having with incompatible nonzero definition.

This may be worth considering. The name is a little annoying though. Especially because argwhere isn't a counterpart to where, but the same thing as the discouraged one-element form of where.

>>> x = np.array([0, 4, 7, 0])
>>> np.nonzero(x)
(array([1, 2]),)
>>> np.where(x)
(array([1, 2]),)
>>> np.argwhere(x)
array([[1],
       [2]])
>>> x[np.nonzero(x)]
array([4, 7])
>>> x[np.where(x)]
array([4, 7])
>>> x[np.argwhere(x)]  # this needs a squeeze() to roundtrip for 1-D input
array([[4],
       [7]])

@jakirkham jakirkham mentioned this pull request Jun 6, 2022
@jakirkham
Copy link
Member

Apologies for the lack of follow up here. Have tried to articulate why argwhere is useful in a new issue ( #449 ). It's probably best in the end to have this discussion there given it is a proposed spec addition and it will be more discoverable that way

@rgommers rgommers added the topic: Dynamic Shapes Data-dependent shapes. label Mar 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic: Dynamic Shapes Data-dependent shapes.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants