Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand copy! to handle mixed nullability and views #97

Merged
merged 1 commit into from
Nov 15, 2017
Merged

Conversation

cjprybol
Copy link
Contributor

@cjprybol cjprybol commented Nov 1, 2017

replaces #92

@codecov
Copy link

codecov bot commented Nov 2, 2017

Codecov Report

Merging #97 into master will decrease coverage by 0.11%.
The diff coverage is 95.65%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #97      +/-   ##
==========================================
- Coverage   97.87%   97.76%   -0.12%     
==========================================
  Files           9        9              
  Lines         660      672      +12     
==========================================
+ Hits          646      657      +11     
- Misses         14       15       +1
Impacted Files Coverage Δ
src/subarray.jl 100% <100%> (ø) ⬆️
src/array.jl 97.27% <95.23%> (-0.3%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f8dde5b...373bac5. Read the comment docs.

src/array.jl Outdated
destinds, srcinds = linearindices(dest), linearindices(src)
(dstart ∈ destinds && dstart+n-1 ∈ destinds) || throw(BoundsError(dest, dstart:dstart+n-1))
(sstart ∈ srcinds && sstart+n-1 ∈ srcinds) || throw(BoundsError(src, sstart:sstart+n-1))
n == 0 && return dest
n < 0 && throw(ArgumentError(string("tried to copy n=", n, " elements, but n should be nonnegative")))

drefs = dest.refs
srefs = src.refs
srefs = isa(src, SubArray) ? [x.level for x in src] : src.refs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really inefficient. Why can't we access the refs of the parent?

src/array.jl Outdated
@@ -433,9 +434,15 @@ function copy!(dest::CategoricalArray{T, N}, dstart::Integer,
dest
end

copy!(dest::CategoricalArray{T, N}, src::CategoricalArray{T, N}) where {T,N} =
CA_OR_CA_VIEW = Union{CategoricalArray, SubArray{A,B,C,D} where
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type names should be in CamelCase. D isn't needed. Actually, you should even be able to use SubArray{<:Any, <:Any, <:CategoricalArray}.

src/array.jl Outdated
n::Integer=length(src)-sstart+1) where {T, N}
function copy!(dest::CategoricalArray{<:Union{T,Null}, N}, dstart::Integer,
src::S, sstart::Integer, n::Integer=length(src)-sstart+1) where
{T,N,S<:Union{CategoricalArray{<:Union{T,Null},N},SubArray{A,B,C,D}} where {A,B,C<:CategoricalArray{<:Union{T,Null},N},D}}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to make this shorter by parameterizing CA_OR_CA_VIEW on the element type and using it here. Also avoid defining type parameters that are not used where possible, and note that the package currently uses spaces after commas even for type parameters.

src/array.jl Outdated
else
srefs = src.refs
end
spool = isa(src, SubArray) ? src.parent.pool : src.pool
Copy link
Contributor

@alyst alyst Nov 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good example why it might be good to have CategoricalArrays.pool(A::CategoricalArray) and CategoricalArrays.pool(A::SubArray{CategoricalArray}); the same for refs().
All type-dependent logic would be automatically handled by dispatch; less worries that type inference did its job right and such code blocks are efficiently compiled.
It would also be easy to extend to handle e.g. ReshapedArray{CategoricalArray}.

src/array.jl Outdated
sstart::Integer,
n::Integer=length(src)-sstart+1) where
{T, N, S <: Union{CategoricalArray{<:Union{T ,Null}, N},
SubArray{<:Any, <:Any, <:CategoricalArray{<:Union{T, Null}, N}}}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, the 2nd tparam of SubArray is dimensionality, so it should match N of the destination, whereas the dimensionality of the parent should not be checked.

Copy link
Contributor

@alyst alyst Nov 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much this copy!() really depends on the explicit nullability? Could <:Union{T,Null} be simplified into T?
There's a use of T in levels!() below, but there you can use leveltype(dest).
So in principle the declaration could be simplified to dest::CategoricalArray{T,N}, src::Union{CategoricalArray{S, N}, SubArray{<:Any, N, <:CategoricalArray{S}}, where T===S or T === Union{S, Null} or S === Union{T, Null}, but it should be ok to do this check (leveltype(dest) == leveltype(src)) in the beginning of the function body.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding nullability, I think I had left a comment in the original PR. The problem is that you're allowed to copy from a nullable array into a non-nullable array. As for what's the best way of ensuring consistency, I'd say the most compact solution is the best one.

Actually it looks like the current method is too restrictive, as we should also preserve levels when the types of source and destination are different (e.g. different integer types). So we could as well make remove restrictions on the element types (but test that it works with different types, and that in case of conversion error we don't corrupt the destination).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cjprybol Have you tried using CatArrSrc here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it looks like for nullability there's no other way than to do a run-time check for zeros in src if dest is non-nullable.

I'm not sure it's the most compact way, but since there's no easy way to express the requirements of src and dest category level types in the function signature, src could also be declared as src::AbstractArray{<:Union{CatValue, Null}, N}. There could be a weird case of src being neither CategoricalArray nor its wrapper, but then calls like leveltype(src) would fail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you both for the helpful comments!

@cjprybol Have you tried using CatArrSrc here?

I originally had src::CatArrSrc (although it was a different name before) and I couldn't get the dispatch to work correctly. The element-type T in the CatArrSrc definition wasn't being recognized as the same element-type T of the dest inside of the copy! function. When I copied & the CatArrSrc definition into the copy! function parameters everything worked. However, it sounds like we're all in agreement that src and dest don't need to have the same element-type T because we may want to copy CategoricalArrays where the eltypes of src and dest are supertypes or subtypes of one another. I may be able to use src::CatArrSrc again as long as I can get the dimensionality parameterization to work. Working on testing that out now

@cjprybol
Copy link
Contributor Author

cjprybol commented Nov 7, 2017

Function parameterization updated to be a little cleaner. I again was unsuccessful trying to use CatArrSrc as a type parameter in copy!, but the ternary operations have been moved to functions. I'm not sure what the cleanest way to check the compatibility of element-types. We could put a check at the beginning as @alyst suggested and/or just let the function throw an error if it's unable to copy the elements successfully. Let me know what you think and if you have any requests for specific test combinations and edge cases!

@nalimilan
Copy link
Member

Have you tried with something like CatArrayOrSubArray{T, N} = Union{CategoricalArray{T, N}, SubArray{<:Any, N, <:CategoricalArray{T}}}?

Regarding the handling of different element types, it looks like you could just do convert(Vector{T}, levels(src)) before calling mergelevels, so that an error is raised early if the conversion is not possible. Other than that, everything should work (it would actually work without it, but in case conversion fails the array could be left in a corrupt state). For the particular case of nullable to non-nullable conversion, just add something like this before mutating the destination: if eltype(src) >: Null && !(eltype(src) >: Null) && !all(x -> x > 0, srefs); throw(NullException()); end. In both cases, these should be tested (both for working and failing cases).

src/extras.jl Outdated
@@ -122,3 +122,6 @@ Cut a numeric array into `ngroups` quantiles, determined using
cut(x::AbstractArray, ngroups::Integer;
labels::AbstractVector{U}=String[]) where {U<:AbstractString} =
cut(x, quantile(x, (1:ngroups-1)/ngroups); extend=true, labels=labels)

refs(A::CategoricalArray) = A.refs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better put this in array.jl, and the SubArray method in subarray.jl. Same for pool.

src/array.jl Outdated
@@ -412,7 +415,7 @@ function copy!(dest::CategoricalArray{T, N}, dstart::Integer,
if dstart == dstart == 1 && n == length(dest) == length(src)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By adding !isa(dest, SubArray) && ... here, you could relax the method signature to accept dest::SubArray. I don't think any other changes are needed, but testing is of course needed to check that.

@nalimilan
Copy link
Member

Any news? Would you prefer me to finish it?

@nalimilan
Copy link
Member

@cjprybol I've added a commit allowing for destination SubArray. Please tell me whether it looks OK.

@cjprybol
Copy link
Contributor Author

Yes, your changes look great! Very glad to see that you got it to work with the CatArrOrSub, and the error messages and extension to allow copying into a viewed dest are both big improvements. Out of curiosity, did you try getting the function to dispatch correctly when using src::CatArrOrSub{T, N} rather than src::CatArrOrSub{<:Any, N} as it is now? I tried dozens of similar parameter combinations where dest and src had the same T and got nowhere, still not sure why... but that was before we realized dest and src don't need to have the same eltype T and so it doesn't matter much now.

👍 I approve 👏

@nalimilan
Copy link
Member

Yes, your changes look great! Very glad to see that you got it to work with the CatArrOrSub, and the error messages and extension to allow copying into a viewed dest are both big improvements.

I'm still hesitant about whether copying into a SubArray should be allowed to change the ordered property of the parent or not. We already allow it to change the levels, but that would be a step further as e.g. it would prevent comparing two values from an unrelated part of the parent array with <. Anyway, that's really a corner case so not a big deal.

Out of curiosity, did you try getting the function to dispatch correctly when using src::CatArrOrSub{T, N} rather than src::CatArrOrSub{<:Any, N} as it is now? I tried dozens of similar parameter combinations where dest and src had the same T and got nowhere, still not sure why... but that was before we realized dest and src don't need to have the same eltype T and so it doesn't matter much now.

I haven't tried, since that seems redundant with the third parameter. But note that it needs to be CategoricalValue{T} for T to correspond to the first parameter to CategoricalArray.

- add refs and pool functions for accessing fields
- implement internal copy checks to avoid erroring mid-copy and
  corrupting the dest
- introduce tests for various combinations of src and dest types
  and expected failures
@nalimilan
Copy link
Member

Thanks for doing most of the work!

@nalimilan nalimilan deleted the cjp/copy2 branch November 15, 2017 20:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants