Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

APL indexing #15431

Merged
merged 4 commits into from
May 1, 2016
Merged

APL indexing #15431

merged 4 commits into from
May 1, 2016

Conversation

mbauman
Copy link
Member

@mbauman mbauman commented Mar 10, 2016

The dimensionality of indexing A[I_1, I_2, …, I_N] is now sum(map(ndims, (I_1, I_2, …, I_N))).

This now brings SubArrays to full feature parity with builtin arrays with respect to indexing semantics.

@mbauman
Copy link
Member Author

mbauman commented Mar 10, 2016

Looks like we'll need to be more judicious in what minimal subset of indices we use in the non-full subarray tests… they're timing out the builds.

Edit: the abridged tests are about 4x longer. The unabridged ones are still running.

``A[I_1[i_1], I_2[i_2], ..., I_n[i_n]]``. If ``I_1`` is changed to a
two-dimensional matrix, then ``X`` becomes an ``n+1``-dimensional array of
shape ``(size(I_1, 1), size(I_1, 2), length(I_2), ..., length(I_n))``. The
matrix adds a dimension. The location ``(i_1, i_2, i_3, ..., i_n+1)`` contains
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the last index need to be i_{n+1} ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, that's clearer.

@mschauer
Copy link
Contributor

In the assignment doc it says "If X is an array, it must have the same number of elements as the product of the lengths of the indices". Isn't this a bit misleading, because setindex_shape_check base/operators.jl#L351 demands that both shapes are the same after removing all singleton dimensions.

@mbauman
Copy link
Member Author

mbauman commented Mar 10, 2016

That's not quite it, either. It allows, for example A[:,:] = 1:length(A). The previous wording is too strict, whereas mine is too lax, but neither accurately describes the behavior. Suggestions are welcome.

@mbauman
Copy link
Member Author

mbauman commented Mar 10, 2016

Actually, we may want to loosen this up in the context of this PR:

julia> A = rand(3,4); B = rand(3,4)
       A[[1 2; 3 1], 1] = B[[1 2; 3 1], 1]
ERROR: DimensionMismatch("tried to assign 2x2 array to 4x1 destination")

@timholy
Copy link
Member

timholy commented Mar 10, 2016

Really nice, @mbauman!

The subarray tests are already so long, any increase is distressing. These tests seem to have gotten longer than they used to be, too (I'm seeing ~800s now, I think they used to be 300s), though perhaps that's a consequence of the move to LLVM 3.7?

@nalimilan
Copy link
Member

@timholy Maybe #15246?

@timholy
Copy link
Member

timholy commented Mar 10, 2016

Hmm, profiling suggests that this is unlikely, though a little more analysis might be justified. Most of the increase seems due to the addition of bounds checks which triggers collection of a lot of backtraces. Rather than using @test_throws, perhaps we should implement a checkbounds(Bool, A, indexes) and just test the boolean return value?

@timholy
Copy link
Member

timholy commented Mar 10, 2016

Now that I think about it, I don't like that solution. Is there a way we could suspend backtrace collection from an error, and just test whether it throws the error?

@mbauman
Copy link
Member Author

mbauman commented Mar 10, 2016

The bigger issue is that codegen time is exponential in the number of index types. For each new type that we test (in indexN* or oind), we double the test time. I think that by combining a few (e.g., by using the SubArray index to also test multidimensional behaviors), we can keep the coverage we need and keep the test time down. Let's see if this latest commit passes tests.

@timholy
Copy link
Member

timholy commented Mar 11, 2016

That's true, but it's also true that on my machine approximately half of test/subarray.jl's runtime is spent collecting backtraces.

@timholy
Copy link
Member

timholy commented Mar 11, 2016

Ah, but the "spot checks" should resolve that!

@mbauman mbauman mentioned this pull request Mar 25, 2016
@tkelman
Copy link
Contributor

tkelman commented Apr 21, 2016

@mbauman does this need anything other than a rebase? would anyone be able to pick this up if Matt's too busy?

@mbauman
Copy link
Member Author

mbauman commented Apr 21, 2016

The major roadblock here is the time it takes to run the subarray tests. I'd like to test with 3 new index types: 0-, 2-, and 3-dimensional arrays. This results in an 8x increase in compilation time during testing. ReshapedArrays may add yet more types to the fray. I actually never managed to complete a JULIA_TESTFULL run on my wimpy laptop (IIRC it began swapping due to all the compiled methods, but even with unlimited RAM it'd probably take ~10-20 hours).

So we may need to pare down the tests a little further, but I don't see how without losing much-needed coverage. I can rebase and trigger a new travis run, but I'm not sure when I'll be able to dig into the test time issue.

@tkelman
Copy link
Contributor

tkelman commented Apr 21, 2016

The compiler has been getting faster, so it's worth another try.

@mbauman mbauman force-pushed the mb/apl branch 2 times, most recently from 15dd1a3 to 0e663bc Compare April 21, 2016 21:23
@mbauman
Copy link
Member Author

mbauman commented Apr 22, 2016

Hey - this is looking great. Here are the CI logs for the subarray test:

Most recent master:

Win 32:   1221.47 seconds, maxrss  741.64 MB
Win 64:   1271.07 seconds, maxrss 1261.11 MB
Linux 32:  728.04 seconds, maxrss  962.89 MB
Linux 64:  505.72 seconds, maxrss 1602.45 MB

This PR:

Win 32:    829.14 seconds, maxrss  799.12 MB
Win 64:    934.77 seconds, maxrss 1212.62 MB (parallel test failed?)
Linux 32:  634.22 seconds, maxrss 1388.58 MB
Linux 64:  625.93 seconds, maxrss 1834.32 MB

@inline replace_colons(V, indexes, I) = replace_colons_dim(V, indexes, 1, I)
@inline replace_colons_dim(V, indexes, dim, I::Tuple{}) = ()
@inline replace_colons_dim(V, indexes, dim, I::Tuple{Colon}) = (1:trailingsize(indexes[1], dim)*prod(index_lengths_dim(V.parent, length(V.indexes)-length(indexes)+2, tail(indexes)...)),)
@inline replace_colons_dim(V, indexes, dim, I::Tuple{Colon, Vararg{Any}}) = (1:size(indexes[1], dim), replace_colons_dim(V, indexes, dim+1, tail(I))...)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrap some of the long lines here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I'm looking forward to seeing this section deleted entirely — linear indexing is really just implicitly indexing into a reshaped view of the array, so now that we have reshaped arrays, we can do just that.

@timholy
Copy link
Member

timholy commented Apr 22, 2016

#15999 should help, too.

@mbauman mbauman force-pushed the mb/apl branch 2 times, most recently from d68315c to dea7ec8 Compare April 23, 2016 12:52
@mbauman
Copy link
Member Author

mbauman commented Apr 23, 2016

Alright, this is good to go! With #15999, all CI systems are under 500 seconds. Looks like Linux32 timed out trying to compile BLAS or something… not sure what happened there.

@inline index_shape_dim(A, dim, i::AbstractVector, I...) = (length(i), index_shape_dim(A, dim+1, I...)...)
@inline index_shape_dim{N}(A, dim, i::AbstractVector{CartesianIndex{N}}, I...) = (length(i), index_shape_dim(A, dim+N, I...)...)
@inline index_shape_dim(A, dim, i::AbstractArray, I...) = (size(i)..., index_shape_dim(A, dim+1, I...)...)
@inline index_shape_dim(A, dim, i::AbstractArray{Bool}, I...) = (sum(i), index_shape_dim(A, dim+1, I...)...)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm reading this right, does this mean all AbstractArray{Bool} act like they are vectors, for the purpose of determining output dimensionality?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct. They can't really do anything else.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One potentially crazy interpretation would be that they should remove dimensions from the parent:

A = rand(1:10, 3,4,5)
A[bitrand(3,4), 1] # would select a vector of random elements from the first page of `A`

That's a separate "feature" though, I think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that's pretty much the only thing they can do. I'd simply suggest that we document this, it's slightly at odds with the "sum of the dimensions of the indexes".

One additional option is to support AbstractArray{Bool,N} if it's the only index, and otherwise require each dimension to be an AbstractVector{Bool}.

Copy link
Member Author

@mbauman mbauman Apr 23, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's probably a reasonable restriction. As it stands, I have a hard time imagining a real use-case for indexing with a multidimensional logical array that's somewhere between 1- and N- indices unless we allow it to span multiple dimensions. One of the nice things about this APL indexing patch, though, is how it generalizes the first-index behavior and removes special cases like that.

On the other hand, I must say that I really value how restrictive Julia currently is with regards to logical indexing. I frequently hit Matlab bugs where vectorized comparisons of struct arrays (mask = [s.a] == x) will silently drop empty elements, and then it'll silently allow me to index a larger vector with it (e.g., [s(mask).b]). So artificial restrictions here can be very good.

@tkelman
Copy link
Contributor

tkelman commented Apr 28, 2016

Sorry, I had been messing with the travis cache last time you pushed here. Restarted, should be fine now. Assuming it passes, is this okay to merge, are we ready here?

@mbauman
Copy link
Member Author

mbauman commented Apr 28, 2016

Let's disallow indexing by multidimensional BitArrays unless they're the only index and are the same dimension as the parent array. I think that can just be done as a bounds check to avoid contortions on dispatch.

@mbauman
Copy link
Member Author

mbauman commented Apr 29, 2016

Ok, assuming CI passes this is good to go.

@tkelman
Copy link
Contributor

tkelman commented May 1, 2016

d'oh, needs another rebase. @timholy do you have any final comments here?

@timholy
Copy link
Member

timholy commented May 1, 2016

Just one: fantastic work, as usual.

* Just do spot-checks on throwing bounds errors
* Be more judicious in the index types tested
* Run the subarray test first
a single index that matches the size of the array it indexes into
@@ -15,11 +15,11 @@ Upon return, `tests` is a vector of fully-expanded test names, and
""" ->
function choosetests(choices = [])
testnames = [
"linalg", "core", "inference", "keywordargs", "numbers", "printf",
"char", "string", "triplequote", "unicode",
"linalg", "subarray", "core", "inference", "keywordargs", "numbers",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yay compiler getting faster

@tkelman tkelman merged commit 89775a5 into master May 1, 2016
@tkelman tkelman deleted the mb/apl branch May 1, 2016 18:02
@quinnj
Copy link
Member

quinnj commented May 1, 2016

For the "array peasantry" among us, can we get a quick before and after example of the change here?

@mbauman
Copy link
Member Author

mbauman commented May 1, 2016

Sure! This generalizes a behavior that we already have:

julia> (0.1:0.1:1.0)[[1 2; 3 4]]
2×2 Array{Float64,2}:
 0.1  0.2
 0.3  0.4

This is indexing into a vector with a two-dimensional array, and this works on Julia 0.4. It uses the shape of the index array to determine where to put the elements of the vector. It's effectively reshape(A[vec(I)], size(I)). This PR extends this behavior to indexing with multiple indices.

I find this sort of operation extremely useful when you're extracting windowed sections of data. Take a look at this:

julia> A = 0.1:0.1:9.9
       A[find(round(A) .== A)' .+ (-2:2)]
5x9 Array{Float64,2}:
 0.8  1.8  2.8  3.8  4.8  5.8  6.8  7.8  8.8
 0.9  1.9  2.9  3.9  4.9  5.9  6.9  7.9  8.9
 1.0  2.0  3.0  4.0  5.0  6.0  7.0  8.0  9.0
 1.1  2.1  3.1  4.1  5.1  6.1  7.1  8.1  9.1
 1.2  2.2  3.2  4.2  5.2  6.2  7.2  8.2  9.2

Here, we're extracting five indices about the whole numbers in the array — each column here represents a window about some repeated event. This also works on Julia 0.4.

But what happens if, instead of a vector, we have a matrix? This gets harder to display nicely in short meaningful chunks, but let's try this:

julia> A = [sin(-10:.1:10) cos(-10:.1:10)]
201×2 Array{Float64,2}:
  0.544021  -0.839072
  0.457536  -0.889191
  
 -0.457536  -0.889191
 -0.544021  -0.839072

julia> zero_crossings = find(diff(A[:,1] .>= 0))'
1×7 Array{Int64,2}:
 6  38  69  100  132  163  195

julia> A[zero_crossings .+ (-2:2), :]
5×7×2 Array{Float64,3}:
[:, :, 1] =
  0.271761   -0.21512     0.255541   -0.29552     0.239249   -0.279415    0.22289
  0.174327   -0.116549    0.157746   -0.198669    0.14112    -0.182163    0.124454
  0.0751511  -0.0168139   0.0583741  -0.0998334   0.0415807  -0.0830894   0.0247754
 -0.0247754   0.0830894  -0.0415807   0.0        -0.0583741   0.0168139  -0.0751511
 -0.124454    0.182163   -0.14112     0.0998334  -0.157746    0.116549   -0.174327

[:, :, 2] =
 -0.962365  0.976588  -0.966798  0.955336  -0.970958  0.96017   -0.974844
 -0.984688  0.993185  -0.98748   0.980067  -0.989992  0.983268  -0.992225
 -0.997172  0.999859  -0.998295  0.995004  -0.999135  0.996542  -0.999693
 -0.999693  0.996542  -0.999135  1.0       -0.998295  0.999859  -0.997172
 -0.992225  0.983268  -0.989992  0.995004  -0.98748   0.993185  -0.984688

Look at that! Now you have a snapshot into the coupled behavior between sin and cos in these two pages. You can see that the first page represents sin's zero-crossing behavior, and the second page is the coupled behavior of cos — it's near negative one when sin is decreasing and positive one when sin increases. We found the indices at which sin crossed zero, and then created a five-index window about those indices. By indexing into A with this two-dimensional matrix and selecting both columns with a :, we've added a dimension. These dimensions are meaningful: 5 element windows, 7 zero crossings, and 2 functions. You could just as easily select just one of the two columns, and it'd work just the same. This is the swanky new behavior.

For a somewhat less contrived example, take a look at the example on AxisArrays.jl's readme — that's where I initially implemented this sort of indexing. It's a little more complicated there since I use intervals to describe my windows in terms of the units on the axes, but the concept is exactly the same and those intervals just get reduced down to ranges (actually RangeMatrices) on their way to being used as an index.

In my line of research, I'm constantly taking windowed repetitions of multidimensional arrays like this. We often want to look at the behavior of some data signal at some event or time to create a windowed average (e.g., spike-triggered average PDF warning) or peri-event histogram (e.g., PSTH another PDF).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants