Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index with booleans #2994

Merged
merged 35 commits into from
Jul 28, 2023
Merged

Index with booleans #2994

merged 35 commits into from
Jul 28, 2023

Conversation

dvd101x
Copy link
Collaborator

@dvd101x dvd101x commented Jul 9, 2023

Hi, regarding #2735 here is a way to do indexing with an array of booleans.

new capabilities

a = [1, 2, 3, 4];
a[[true, false, true, false]]   # yields [1, 3]
a[a > 2]                        # yields [3, 4]

b = [1, 2, 4, 8]; 
a[b < 4]                        # yields [1, 2]

a[a > 2 and a < 4]              # yields 3
a[a > 2 and b < 8]              # yields 3
a[a > 2 and b < 3]              # yields []

A = [1, 2; 3, 4];
A[[false, true], 2]             # yields 4

Main changes:

index

Checks if it's getting an array of only booleans and converts such arguments to a corresponding vector of numeric indices (comparable to octave and numpy)

  • currently index([true, false, true]) is like index([1, 0, 1])
  • the proposal is that index([true, false, true]) works like:
    • index([0, 2])
    • index([1, 3]) in the parser

subset

Can use an empty index (comparable to octave and numpy)

  • subset(value, index) returns an empty value of the same type if it has an empty index
  • subset(value, index, replacement), returns the value unmodified if it has an empty index

Note:
it doesn't check if the size of the array of booleans matches the dimension of the value in subset (it's similar to octave, but unlike numpy).

@gwhitney
Copy link
Collaborator

Thanks so much for the PR submission!

Some comments/questions, not meant to be exhaustive at this point:

(a) I thought we decided in the discussion in #2735 that if A = [1, 2, 3, 4] then A[[true, false, true]] is a dimension error, but the comment above seems to say this PR does not complain and returns [1,3]. Why not adhere to the recommended design?

(b) Reviewing this PR revealed the pre-existing oddity that if B = [1,2;3,4] then (in the parser, easier to test) B[2,2] = 4 and B[2,[2,1]] = [[4,3]] (both pretty much as expected) but B[2,[2]] = 4 . I would expect the latter to produce [[4]]. (Or quite possibly B(2,[2,1]) to produce [4,3] and B(2,[2]) to produce [4], while B([2],[2,1]) produced [[4,3]].) The current behavior seems to prevent certain dimensionality invariants (based on the dimensions of the matrix being indexed and the shape of the indices) from holding true. I bring these up because they produce the very surprising behavior that B([false, true], [true, true])) is [[3,4]] while B([[false,true]],[[true, false]]) is 3. It seems like a problem to me that the type of the result differs for indices that are identical in type and shape and differ only in the value of a single boolean inside the indices. Hence, should I file the indexing oddity of B[2,[2,1]] vs B[2,[2]] as an issue? And should the boolean indexing wait until that newly surfaced issue is resolved? Or another possibility is to leave B(2,[2]) alone but still have B([false,true],[false,true]) return [[4]]. Thoughts, @josdejong ?

@dvd101x
Copy link
Collaborator Author

dvd101x commented Jul 10, 2023

Thanks for the review Glen!

(a)

I was testing different methods to accomplish this but in the end I couldn't make it work thinking it would introduce breaking changes. The issue I couldn't resolve is.

  • subset hast the information of the size of the matrix and receives index (not the array of booleans). So it doesn't know what was the size of the array of booleans that generated the index.
  • index is the one that receives the array (and checks if it contains only booleans) but it doesn't know what is the size of the matrix it will be used in.

In the end I thought to present this PR that is comparable to what Octave does even though is not as agreed

(b)

I think this behavior is related to #2344, it might make some cases simpler but in this case it's very odd. I think that behavior is as intended but let's wait for Jos's comments.

In the context of using logic inside of index it's not that odd.

A = [10, 20, 30, 40];
A[A > 20] # yields [30, 40]
A[A < 20] # yields 10

Maybe a proposal could be that if the index comes from an array, then the result should match the source regardless if it has singleton dimensions.

A = [1, 2; 3, 4]
A[1, 2]          # shall return number 2
A[[1], 2]        # shall return [2]
A[1, [2]]        # shall return [2]
A[[1], [2]]      # shall return [2]

I think these issues could be addressed independently

Just as an example this is what the parser does with an array of booleans.

B = [10, 20, 30];
B[[true]]             # yields 10
B[[true, true]]       # yields [10, 10]
B[[true, true, true]] # yields [10, 10, 10]
B[B > 5]              # yields [10, 10, 10]

@gwhitney
Copy link
Collaborator

On (a), I guessed as much, but it seems like it's a case of implementation pushing behavior around. If the behavior desired is the one agreed on in #2735, then it seems like we need to find an implementation that supports that. For example, it could be that index objects are enhanced to allow vectors of booleans within them, and the "index" function just passes such specifiers through, so that subset can have all of the information and deal with it. There may be other possible implementations that would yield the desired behavior, just brainstorming.

On (b),

In the context of using logic inside of index it's not that odd.

A = [10, 20, 30, 40];
A[A > 20] # yields [30, 40]
A[A < 20] # yields 10

On the contrary, I find this extremely odd: depending on what the exact condition is, the result might be a vector or a number? That seems like it will be painful, since I will have to then write the expressions that use the result to handle either. It would seem much preferable that whatever the condition is, it should produce an array/vector as a result, that might have just one (or zero) entries.

Maybe a proposal could be that if the index comes from an array, then the result should match the source regardless if it has singleton dimensions.

A = [1, 2; 3, 4]
A[1, 2]          # shall return number 2
A[[1], 2]        # shall return [2]
A[1, [2]]        # shall return [2]
A[[1], [2]]      # shall return [2]

I think these issues could be addressed independently

I agree that the existing shape-of-indexing-results oddity is the underlying problem, but if it were up to me, I'd first resolve that concern (in a separate issue/PR) before merging this boolean-indexing stuff, because of the very counterintuitive results it produces in the Boolean case. There are various options on how to modify the shapes of indexing results, so I would also suggest first an issue or a discussion to hash out the desired behavior, before a PR addressing the point. But let's see how @josdejong wants to go -- another position might be to leave the shape of results of numeric indexing alone, and only fix the vector-vs-number inconsistency for Boolean indexing. That's not what I would do, but it's certainly possible to implement.

@dvd101x
Copy link
Collaborator Author

dvd101x commented Jul 11, 2023

Fair points

I agree that the flexibility of showing either a number or an array depending on the case is odd for a program.

I was thinking that in a different context wouldn't be that odd. Let's imagine a stack of paper in a box and I try to get them according to an index. If many pages are found they are returned in a manilla envelope, but when only one page is found there is no envelope, I might be glad to see the found page rather than complain about the missing envelope.

both a) and b) are consistent with the way numpy indexes, which would be really nice to have in mathjs, but might need more core changes.

In fact the other proposed change that an empty index gets you an empty array makes it even more odd that:

A = [10, 20, 30, 40];
A[A > 20] # yields [30, 40]
A[A < 20] # yields 10
A[A > 50] # yields [ ]

As if a set of many pages is presented in an envelope, a single page is without the envelope and no pages found gets you only the empty envelope.

So I mostly agree with you, at some point I thought about changing the behavior of index further as you mentioned but thought it might be unwelcome as it wasn't agreed upon. In fact I wasn't sure about returning an empty index (currently it just throws an error)

I see some partial functionality that can be used in the meantime something better comes along, but as you mentioned, let's wait for Jos to weigh in.

As a reference, the current implementation of broadcasting is missing some optimization as making something like numpy: nditer would require some core changes in the matrix algorithm suite. Currently it is using the mathjs functions to make copies of matrices (requiring more memory) but it's actually working in the meantime an optimized solution can be implemented.

@josdejong
Copy link
Owner

Thanks David!

The logic behind the behavior of the function subset is: when the output contains a single, scalar value, unwrap it from the matrix and return the (numeric) value. When the output contains multiple values, return a matrix. I'm just happy with the current behavior, it is very practical when getting/setting individual values in a matrix that you don't have to worry about wrapping/unwrapping the value correctly. Also, I do like this proposal in #2344 to be able to disable this "smart unwrapping" feature if you need predictability.

I do not really have a strong opinion about how A[[1], 2] and A[1, [2]] should behave. I think in practice you will not enter a nested array with just one value, so it is a bit of a theoretic case to me. It makes sense to let the output be an array then, instead of the current behavior of returning an unwrapped number.

For the new filtering behavior it makes sense to me to always return a Matrix/Array, and not unwrap the output when it contains only a single value.

@dvd101x
Copy link
Collaborator Author

dvd101x commented Jul 12, 2023

Hi Jos, thanks for the review

Only as a reference, It works like this by default:

A = [10, 20, 30, 40];
A[A > 20] # yields [30, 40]
A[A < 20] # yields 10
A[A > 50] # yields [ ]

And now it includes this behavior with math.config({ predictable: true}) according to #2344

A = [10, 20, 30, 40];
A[A > 20] # yields [30, 40]
A[A < 20] # yields [10]
A[A > 50] # yields [ ]

@josdejong
Copy link
Owner

Thanks, that is clear.

Whilst there is a clear use case to let A[0] return a number 10 (and I would like to keep that behavior like it is), I do not see a use case for the filtering functionality returning a number, since a filtering operation in general has an array as input and an array with an arbitrary number of items as output. So I would like A[A < 20] to return [10] rather than 10 (always predictable). How do you think about that? Does that make sense to you? Or do you see use cases for that?

@dvd101x
Copy link
Collaborator Author

dvd101x commented Jul 13, 2023

Here is a possible case for that.

A = [1, 2, 3];
B = [4, 5, 6];
A[B == 5] # yields 2 or [2]

For me it's the same logic as with regular indexing. If the index is a calculated variable, it's uncertain if the index will contain one or many numbers.

Of course B could have various values equal to 5, so it's not predictable if the indexing with booleans will return one or many numbers.

In the references I know, it usually returns an array (the same as regular indexing in those references). So I don't know if it will also be confusing if the behavior of indexing is different if it comes from an array of booleans.

@josdejong
Copy link
Owner

josdejong commented Jul 13, 2023

Hm, yeah that is true, if you have a variable in A[x], the output depends on whether x holds a number or a range for example. So, yeah, there is some logic to it.

I just find it odd to have something like A[A < 20] or A[B == 5] return a number. Would you want that in practice?

@dvd101x
Copy link
Collaborator Author

dvd101x commented Jul 13, 2023

Personally I wouldn't. I would prefer a predictable way as it can be used for other steps.

I will look into a way for an array of booleans to always return predictable outcomes regarding of config. If there is some confusion later regarding regular indexing we can take a look into it.

I think this is an in between state at some point it will need to be able to do something like.

A = 1:4;
A[A > 2] = 6;
A          # yields [1, 2, 6, 6]

Currently this throws a dimensional error but we will need to take a look at some point regarding broadcasting the 6.

A[3:4] = 6

@dvd101x
Copy link
Collaborator Author

dvd101x commented Jul 14, 2023

In this version:

  • Index class now
    • Converts an array of booleans to an array of numbers
    • Has a property to indicate the size of the array of booleans
  • Return predictable subsets if indexing with an array of booleans
    • for other cases returns predictable subsets if config.predictable
  • Subsets now validates if the array of booleans had an equal size to the Array|Matrix and throws a dimensional error otherwise.

This behavior of returning a number when indexing with a single number or an array with a single element when indexing with an array of booleans is equivalent to numpy's way of indexing.

One difference is that indexing with booleans returns the same number of dimensions as the source but numpy reduces the other dimensions.

@dvd101x
Copy link
Collaborator Author

dvd101x commented Jul 20, 2023

Hi Glen and Jos,

I'm really glad you are liking the results so far. Thanks for the comments and review!

I agree these are a lot of changes at once, I wasn't planning for this. So thanks for the effort of reviewing this many changes at once and being so understanding of this whole process.

@dvd101x
Copy link
Collaborator Author

dvd101x commented Jul 22, 2023

Hi, I think this latest commit covers most of the issues. Please let me know if something is missing.

@josdejong
Copy link
Owner

Thanks David for adding even more tests.

Last question: do we keep the config.predictable in this PR or not? See #2994 (comment). I see you marked the discussion as resolved, does that mean that you do want to keep it in this PR (so we'll schedule this PR for a new major version)?

@dvd101x
Copy link
Collaborator Author

dvd101x commented Jul 24, 2023

Sorry my bad, I was certain I removed that as soon as you asked, either I messed something up with my setup or removed some of that and forgot about other places

I think it's better now.

docs/datatypes/matrices.md Outdated Show resolved Hide resolved
docs/datatypes/matrices.md Outdated Show resolved Hide resolved
src/function/matrix/subset.js Outdated Show resolved Hide resolved
src/function/matrix/subset.js Show resolved Hide resolved
src/type/matrix/MatrixIndex.js Show resolved Hide resolved
@gwhitney
Copy link
Collaborator

OK I took a look through, specific comments above. In general, I think this now leaves prior numeric/numeric array indexing alone, which I think is appropriate, and has the agreed-upon behavior for boolean indexing. So I have no objections to its merge. I will comment a bit on #2344

@gwhitney
Copy link
Collaborator

Good catch on this latest fix to your PR. Should you add a test that would have exercised the problem?

@dvd101x
Copy link
Collaborator Author

dvd101x commented Jul 25, 2023

Thanks, Glen

Ok, I will include a test where an array to broadcast to a size that is equal to the array will broadcast correctly (and not skip the broadcast process). I was concerned it might be too specific.

@gwhitney
Copy link
Collaborator

I was concerned it might be too specific.

Thanks! IMHO no test for something that actually went awry at any point is too specific.

@dvd101x
Copy link
Collaborator Author

dvd101x commented Jul 26, 2023

Makes sense, I'll keep that in mind, thanks!

@josdejong
Copy link
Owner

Thanks for reviewing Glen.

@dvd101x there is still one open comment from Glen about code duplication. Can you reply on that? Besides that I think the PR is ready to be merged :).

@dvd101x
Copy link
Collaborator Author

dvd101x commented Jul 27, 2023

Thanks Jos,

I think Glen's comment about code duplication is addressed with the latest commit.

If no other topic is unresolved, I think this is ready to be merged.

@gwhitney
Copy link
Collaborator

Yup, that's the sort of thing I had in mind, thanks. I have no objections to this being merged.

@josdejong
Copy link
Owner

Awesome 🎉. Thanks for your patience David! Going to merge your PR now.

@josdejong josdejong merged commit 49c793b into josdejong:develop Jul 28, 2023
@dvd101x dvd101x deleted the index-with-booleans branch July 28, 2023 17:45
@josdejong
Copy link
Owner

Published now in v11.10.0, thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants