Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should reshaping a SubArray produce another SubArray? #9874

Closed
johnmyleswhite opened this issue Jan 21, 2015 · 32 comments
Closed

Should reshaping a SubArray produce another SubArray? #9874

johnmyleswhite opened this issue Jan 21, 2015 · 32 comments
Labels
needs decision A decision on this change is needed

Comments

@johnmyleswhite
Copy link
Member

Right now, calling reshape on a SubArray allocates memory, which means that reshaping a slice causes you to lose your view into the original data. See the example below for the kinds of situations you might find yourself in.

julia> a = ones(9)
9-element Array{Float64,1}:
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0

julia> b = reshape(a, 3, 3)
3x3 Array{Float64,2}:
 1.0  1.0  1.0
 1.0  1.0  1.0
 1.0  1.0  1.0

julia> b[3, 3] = 9.0
9.0

julia> a
9-element Array{Float64,1}:
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 9.0

julia> c = slice(a, 4:9)
6-element SubArray{Float64,1,Array{Float64,1},(UnitRange{Int64},),1}:
 1.0
 1.0
 1.0
 1.0
 1.0
 9.0

julia> c[1] = 4.0
4.0

julia> a
9-element Array{Float64,1}:
 1.0
 1.0
 1.0
 4.0
 1.0
 1.0
 1.0
 1.0
 9.0

julia> d = reshape(c, 3, 2)
3x2 Array{Float64,2}:
 4.0  1.0
 1.0  1.0
 1.0  9.0

julia> d[2, 1] = 5.0
5.0

julia> a
9-element Array{Float64,1}:
 1.0
 1.0
 1.0
 4.0
 1.0
 1.0
 1.0
 1.0
 9.0
@johnmyleswhite johnmyleswhite added the needs decision A decision on this change is needed label Jan 21, 2015
@timholy
Copy link
Sponsor Member

timholy commented Jan 21, 2015

If that's how we want them to work, a rudimentary version is basically all ready-to-go, because our new SubArrays support a Vector{Int} as an index. Here's why that fact is relevant:

julia> A = reshape(1:16, 4, 4)
4x4 Array{Int64,2}:
 1  5   9  13
 2  6  10  14
 3  7  11  15
 4  8  12  16

julia> B = sub(A, 2:4, 3:4)
3x2 SubArray{Int64,2,Array{Int64,2},(UnitRange{Int64},UnitRange{Int64}),1}:
 10  14
 11  15
 12  16

julia> b = sub(A, [10,11,12,14,15,16])
6-element SubArray{Int64,1,Array{Int64,2},(Array{Int64,1},),0}:
 10
 11
 12
 14
 15
 16

julia> b[4] = 0
0

julia> A
4x4 Array{Int64,2}:
 1  5   9  13
 2  6  10   0
 3  7  11  15
 4  8  12  16

One big downside, though, is that you have to allocate a full-length vector just to express the "unwrapped" 2d indexing. It seems possible one could do better with some kind of smart multidimensional Range object, but I haven't given this any serious design thought yet.

@johnmyleswhite
Copy link
Member Author

That is a really helpful start, @timholy. But often we might want to reshape upwards by adding dimensions. So we want to go from a 1D structure to a 2D structure. That seems to require a call to reshape as far as I can tell with my limited understanding of sub.

@johnmyleswhite
Copy link
Member Author

Actually, I'm pretty confused by the b = sub(A, [10,11,12,14,15,16]) line. It seems like you can get the same effect just by using b = sub(A, 10:16) without allocating a full length vector.

@timholy
Copy link
Sponsor Member

timholy commented Jan 22, 2015

But often we might want to reshape upwards by adding dimensions

You can currently only do that in a very limited way: a = 1:10; s = sub(a, 2:4, 1). This makes a (trivial) 2d array out of a 1d array.

Actually, I'm pretty confused by the b = sub(A, [10,11,12,14,15,16]) line. It seems like you can get the same effect just by using b = sub(A, 10:16) without allocating a full length vector.

Notice I was deliberately omitting 13, which you can't do with the range.

@timholy
Copy link
Sponsor Member

timholy commented Jan 22, 2015

But adding dimensions could be implemented more generally with smart multidimensional range objects.

@johnmyleswhite
Copy link
Member Author

Thanks for clarifying the index point.

When you say multidimensional range objects, are you thinking of something like the punned meaning of 1:3:1:5 to indicate a range in two dimensions at once?

@timholy
Copy link
Sponsor Member

timholy commented Jan 22, 2015

Ideally, in terms of my example above we'd have an indexing object ind that returns 2,4 when accessed as ind[4]: b[4] -> A[2,4]. (In the 2d space spanned by (2:4,3:4)---the indexes used to create B---the 4th entry is (2,4) if you're incrementing along the first dimension.) So b would be effectively created as b = sub(A, ind) where ind is a magical 1d to 2d indexing object mapping 1:6 -> (2:4,3:4).

The problem is, I don't really know how to realistically make such objects work without calling div, and div is absolutely deadly for performance.

Going the opposite direction in a performant manner is much easier. Suppose you have a vector and you want to reshape it into a matrix:

a = 1:15
B = sub(a, Index(1:5,1:3))  # this would become the implementation of `reshape(a, 5, 3)`

Then B[3,2] could be trivially expanded to a[(2-1)*5+3] = 8.

More details at http://docs.julialang.org/en/latest/devdocs/subarrays/, especially the first two sections.

@simonster
Copy link
Member

The reshape docs say "an implementation for a particular type of array may choose whether the data is copied or shared." To me this implies that reshape of SubArrays should share data for SubArrays when that will be faster than constructing a new array (i.e., when we have fast linear indexing), but shouldn't when it won't be.

We could also possibly have a reshape_shared function (or something with a better name) that always shares data for any AbstractArray but possibly returns a view into the original array that uses linear indexing and thus may incur a performance penalty. reshape would be good for operations that only need to read from the array, whereas reshape_shared would be useful for in-place modification as described above.

@johnmyleswhite
Copy link
Member Author

"an implementation for a particular type of array may choose whether the data is copied or shared."

I think this makes reshape a little too difficult to reason about. Having to look up information about every specific method for a generic function to understand whether it introduces shared data seems problematic, especially when we go to good effort to make mutation so easy to predict from the name of a function alone.

For context on how these issues are coming up, I've been working with some folks that are new to Julia and are accustomed to building their entire infrastructure around higher-order tensors. They routinely takes slices (possibly with subsequent reshaping) of a linear packing of many distinct arrays. The linear packing is used because they send the entire packed data store into a black-box optimization suite that operates on vectors, but they simultaneously want to articulate the objective function they're optimizing in terms of many different objects -- all of which point, via SubArray-like objects, into the memory store of the grand linear packing. So they need to make sure that every operation they might perform can be done without memory copying.

@nalimilan
Copy link
Member

I agree with John. I'd rather people choose do an explicit action (like copying, or calling different function that copies) when they need performance, rather than forcing them to reason about whether the data is going to be copied or not. The fact that in Julia arrays sometimes share data is already relatively complex to handle (though incredibly powerful of course).

@StefanKarpinski
Copy link
Sponsor Member

I'm in agreement on this too – "Schrödinger's copy" is no good.

@milktrader
Copy link
Contributor

I'd like to keep tabs on this discussion but have nothing to add. And I don't know how to do so in github without posting a comment. Sighs. Sorry for the noise.

@johnmyleswhite
Copy link
Member Author

@milktrader, there's a notifications button the right-hand side that will let you do that. It's not very conspicuous, but it's near the bottom of the right-hand side panel.

@timholy
Copy link
Sponsor Member

timholy commented Jan 23, 2015

Relevant discussion of faster ways to do div: #8188 (comment). In the case of reshape the denominator is fixed, so the smart ind object could precompute the magic numbers so that usage is fast.

@milktrader
Copy link
Contributor

@johnmyleswhite ah, after all these years. Thanks!

@milktrader
Copy link
Contributor

Oh wait, I guess I was subscribed. I suppose I was looking for the participating flag. No way to do that that I can see. (goes back to his desk)

@timholy
Copy link
Sponsor Member

timholy commented Feb 25, 2015

I posted a rough draft of a solution in this gist. The basic idea is that we create new view type, a ReshapeArray. First, some simple demos:

julia> using Reshape

julia> A = myreshape(1:15, (3, 5))
3x5 Reshape.ReshapeArray{Int64,2,UnitRange{Int64},(Reshape.IndexMD{1,2},)}:
 1  4  7  10  13
 2  5  8  11  14
 3  6  9  12  15

julia> AA = copy(A)
3x5 Array{Int64,2}:
 1  4  7  10  13
 2  5  8  11  14
 3  6  9  12  15

julia> B = myreshape(AA, (15,))
15-element Reshape.ReshapeArray{Int64,1,Array{Int64,2},(Reshape.IndexMD{2,1},)}:
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15

julia> BB = myreshape(A, (15,))
15-element Reshape.ReshapeArray{Int64,1,UnitRange{Int64},(Colon,)}:
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15

julia> C = reshape(1:16, (2,4,2))
2x4x2 Array{Int64,3}:
[:, :, 1] =
 1  3  5  7
 2  4  6  8

[:, :, 2] =
  9  11  13  15
 10  12  14  16

julia> CR = myreshape(C, (4,4))  # note this doesn't split along the dimensions of C
4x4 Reshape.ReshapeArray{Int64,2,Array{Int64,3},(Reshape.IndexMD{3,2},)}:
 1  5   9  13
 2  6  10  14
 3  7  11  15
 4  8  12  16

julia> V = sub(CR, 2:3,2:4)
2x3 SubArray{Int64,2,Reshape.ReshapeArray{Int64,2,Array{Int64,3},(Reshape.IndexMD{3,2},)},(UnitRange{Int64},UnitRange{Int64}),0}:
 6  10  14
 7  11  15

julia> myreshape(V, (3,2))
3x2 Reshape.ReshapeArray{Int64,2,SubArray{Int64,2,Reshape.ReshapeArray{Int64,2,Array{Int64,3},(Reshape.IndexMD{3,2},)},(UnitRange{Int64},UnitRange{Int64}),0},(Reshape.IndexMD{2,2},)}:
  6  11
  7  14
 10  15

As the last example illustrates, the problem is that we'll get a ReshapeArray{SubArray{ReshapeArray{SubArray...}}}. I'm pretty sure there is no way around this: while there are simple index types that are closed under either sub or reshape (calling sub on a SubArray just yields another SubArray, and likewise for calling myreshape on a ReshapeArray), there isn't one that is closed under their combination.

The advantage of this approach is that we can guarantee that reshape always creates a view: AFAICT, you will never need to silently return a copy simply because the indexing got too complicated. This is different from NumPy, which is overall a very admirable implementation of multidimensional arrays. The downside is that there are likely to be performance hits for using highly-composed objects. My suspicion is that in practice these will not be nested very deeply (probably rarely deeper than SubArray{ReshapeArray{SubArray}}}). If users don't care whether it's a copy or a view, they can always get better performance by taking a copy if the nesting gets too deep.

@timholy
Copy link
Sponsor Member

timholy commented Feb 25, 2015

I guess the main point is: what are people's thoughts on this design? EDIT: I've done nothing to optimize it yet, and there are surely still bugs, but at least the general principle should be clear.

@johnmyleswhite
Copy link
Member Author

This certainly resolves one of my main concerns about Arrays. Thanks for working on this!

@nalimilan
Copy link
Member

It indeed sounds better to always return a view, possibly at the cost of performance in very complex cases. People should always be able to call copy() manually in cases where the performance is too bad.

@mbauman
Copy link
Sponsor Member

mbauman commented Feb 28, 2015

@timholy - I think you're not getting many comments here because it seems so sensible. I think this is a great direction. I think it'd be even better if it were performant enough to handle dense array reshapes, too (which is currently handled in the C array implementation). This way there'd be user-visible information about the view-like nature of the reshaped array.

Once we move to using views, an open question is if A[:] should return a ReshapeArray when A[1:end] would return a SubArray. Is that surprising? Or sensible?

@timholy
Copy link
Sponsor Member

timholy commented Feb 28, 2015

Thanks to all who commented. I'll continue with this direction and try to get something merged; I may not get to it for a couple of days, though.

@timholy
Copy link
Sponsor Member

timholy commented Feb 28, 2015

Once we move to using views, an open question is if A[:] should return a ReshapeArray when A[1:end] would return a SubArray. Is that surprising? Or sensible?

Good question. In fact, if A has more than one dimension, A[1:end] is going to have to return a SubArray{ReshapeArray{ParentArrayType}}, unless we turn 1:end into : using some parser magic. The bottom line is that converting to linear indexing will almost certainly be best achieved as a reshape, and the index selection done on top of that.

@Jutho
Copy link
Contributor

Jutho commented Mar 2, 2015

Couldn't the magic numbers for speeding up div be built in the definition of SubArray and computed when a SubArray is constructed; that way, all SubArrays would have (reasonably) fast linear indexing, and then Reshape(d?)Array could be a simple wrapper that just uses linear indexing into the parent array according to the formula i1 + newsize1*( i2 - 1 + newsize2*(i3-1+ ...)).

@timholy
Copy link
Sponsor Member

timholy commented Mar 2, 2015

I've thought about that. I suppose the issue will come down to how long they take to compute, and I haven't yet played with that code at all. There are folks who have tried using SubArrays to extract the columns of a 2xN matrix, and even with our new-and-improved SubArrays it doesn't work terribly well. (Presumably once tuples can be elided, it will be much better.)

@Jutho
Copy link
Contributor

Jutho commented Mar 2, 2015

That's indeed important to consider. Is that construction overhead that you are talking about in the example of the columns? Maybe a lightweight StridedView type that only works with a parent of type Array, without all the power of SubArray, is something to consider. I haven't studied the SubArray code so I don't know if there is much to be gained by specialising for simpler cases.

On a more general note, it is probably also not worth the effort trying to write the most general code that can provide reasonable efficiency for all possible combinations of indexing and slicing, reshaping and permuting for all possible parent array types, when people in the end just want great efficiency for strided views over an Array.

In that respect, I am not quite sure whether reshape over an arbitrary AbstractArray should provide a view. What behaviour is to be expected for reshape acting on a range, on the different special types of matrices, on sparse arrays, ... ?

@Jutho
Copy link
Contributor

Jutho commented Mar 2, 2015

Well, the StridedView proposal is essentially @lindahua 's ArrayViews.jl package, which I have never worked with myself, so I forgot about it for a minute there.

@timholy
Copy link
Sponsor Member

timholy commented Mar 2, 2015

A lot of it comes down to having to save a dims tuple. If you want really fast column views, you basically have to do this:

immutable ColVector{T,M<:AbstractMatrix} <: AbstractVector{T}
    parent::M
    col::Int
end
ColVector(A::AbstractMatrix, col::Integer) = ColVector{eltype(A), typeof(A)}(A, Int(col))

getindex(v::ColVector, i::Int) = v.parent[i,v.col]
size(v::ColVector) = (size(v.parent, 1),)
size(v::ColVector, d) = d == 1 ? size(v.parent, 1) : 1

Even this has its negatives; I think @mlubin noticed that even storing a reference to the parent array means this can't be elided, and so he was storing a pointer instead (which means you need to manually maintain a reference).

@johnmyleswhite
Copy link
Member Author

Even this has its negatives; I think @mlubin noticed that even storing a reference to the parent array means this can't be elided, and so he was storing a pointer instead (which means you need to manually maintain a reference).

This latter issue is problem for lots of other things, including Nullable objects that wrap around something that's not isbits.

Is there any hope we might improve on this at some point?

@simonster
Copy link
Member

Dup of #4211?

@johnmyleswhite
Copy link
Member Author

Possibly? It looks very similar and is possibly subsumed by #4211.

@JeffBezanson
Copy link
Sponsor Member

Closing as dup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs decision A decision on this change is needed
Projects
None yet
Development

No branches or pull requests

9 participants