Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

join() doesn't preserve categorical levels order #1257

Closed
alyst opened this issue Oct 13, 2017 · 10 comments
Closed

join() doesn't preserve categorical levels order #1257

alyst opened this issue Oct 13, 2017 · 10 comments
Assignees

Comments

@alyst
Copy link
Contributor

alyst commented Oct 13, 2017

julia> using DataFrames
julia> A = DataFrame(a = [1,2,3], b = ["a", "b", "c"]);
julia> B = DataFrame(b = ["a", "b", "c"], c = levels!(categorical(["a", "b", "b"]), ["b", "a"]));
julia> levels(B[:c])
2-element Array{String,1}:
 "b"
 "a"

julia> C = join(A, B, on=:b)
3×3 DataFrames.DataFrame
│ Row │ a │ b │ c │
├─────┼───┼───┼───┤
│ 1   │ 1 │ a │ a │
│ 2   │ 2 │ b │ b │
│ 3   │ 3 │ c │ b │

julia> levels(C[:c])
2-element Array{String,1}:
 "a"
 "b"
@nalimilan
Copy link
Member

nalimilan commented Oct 13, 2017

Good catch. I think this happens because we call similar and fill the resulting array. It's kind of unfortunate since the old code (in 0.10.1) used to rely only on vcat, which already implements the required logic in CategoricalArrays. We changed that in JuliaData/DataTables.jl#17 and JuliaData/DataTables.jl#44 for performance reasons (see also your comment). After JuliaData/DataTables.jl#17, resize! was used rather than similar, but IIRC there were bugs (which may not have been related directly to this question) which were fixed in JuliaData/DataTables.jl#44. Using resize! isn't perfect either since vcat includes some logic to merge levels when concatenating two CategoricalArrays, which would have to be reimplemented.

If we can't find an efficient solution which doesn't rely on similar, we'll have to add a special fillcolumn! method for CategoricalArray columns which would call CategoricalArrays.mergelevels and set levels based on that. I don't like the idea of adding another special case, though.

Would you be willing to investigate this? I'm pretty swamped at the moment.

Cc: @cjprybol

@alyst
Copy link
Contributor Author

alyst commented Oct 13, 2017

To me copying the pool into the new array when doing similar(A::CategoricalArray) seems the most straightforward solution.
Similar problem would have to be solved for Query.jl (when it would properly handle categorical arrays), which grows the result row by row, so the order of the levels in the new vector may change.
IMO there are very few situations when similar() doesn't need to copy/share the pool (but theoretically it's also possible to add pool=:copy/:share/:empty to similar()).

@nalimilan
Copy link
Member

That would certainly work, but if somebody passes a CategoricalArray to a function which calls similar and then fills it with completely different data? Then you'll have completely unrelated levels floating around. I'd argue it isn't very common either to call similar just to fill the resulting vector with the same values as the original. A (well-known) problem is that similar isn't a very well defined operation, so we could certainly add keyword arguments to make the intended behavior more specific. But I don't see a way of making it work magically in all cases, so I'd rather add special code in packages which need to be aware of it anyway.

@nalimilan nalimilan added this to the 0.11 milestone Oct 13, 2017
@nalimilan
Copy link
Member

@cjprybol Do you think we could go back to calling vcat?

@cjprybol
Copy link
Contributor

Yes, if vcat already has the correct logic we should try switching back to that

@cjprybol
Copy link
Contributor

Easiest way to handle this looks to be just checking if the original column is a CategoricalArray, and if so, make sure to copy the levels.

julia> A = DataFrame(a = [1,2,3], b = ["a", "b", "c"]);

julia> B = DataFrame(b = ["a", "b", "c"], c = levels!(categorical(["a", "b", "b"]), ["b", "a"]));

julia> levels(B[:c])
2-element Array{String,1}:
 "b"
 "a"

julia> C = join(A, B, on=:b)
3×3 DataFrames.DataFrame
│ Row │ a │ b │ c │
├─────┼───┼───┼───┤
│ 11 │ a │ a │
│ 22 │ b │ b │
│ 33 │ c │ b │

julia> levels(C[:c])
2-element Array{String,1}:
 "b"
 "a"

I couldn't figure out a good way to use vcat since the current algorithm doesn't build columns in order, but rather creates a properly sized vector and then fills in value by value, and that vector filling isn't in order and may skip indices that will be filled in later (only for right outer join though)

@nalimilan
Copy link
Member

Yeah, but that's not a generic solution, so I'd have preferred using vcatif possible. Also, is there a way to handle concatenation of two CategoricalArray columns with different levels? We have a relatively complex logic to merge levels and find a consistent ordering in vcat.

@cjprybol
Copy link
Contributor

I don't see the connection between vcat and join as operations. Using join doesn't implicitly ensure that it's possible to vertically join vectors, and redesigning join to use vcat seems like a general regression to improve the support for a single case. I don't think we currently support proper level ordering in vcat for DataFrames (rather than columns) either because of the promote_vector_type issues that are still unresolved see vcat code. Generating a properly sized and typed Vector and filling it with values was the work-around we decided worked best to avoid doing things like returning Vector{CategoricalValue}, right?

@nalimilan
Copy link
Member

Good points. Looking at the code, I rediscovered that copy! for CategoricalArray also preserves levels, and it's closer to what we need than vcat. Unfortunately we use our custom fillcolumn! function instead of copy!. Do you think the code could be refactored a bit to use copy! instead? Sorry for asking again, after trying with vcat. ;-)

If that doesn't work, let's got with a special case for CategoricalArray. But then please have a look at the mergelevels stuff in copy! and use the same strategy for consistency. It took quite a bit of work at the time it was designed, but now I'm glad to see it's exactly what we need here.

@nalimilan
Copy link
Member

@cjprybol Do you think you'll be able to find the time to investigate using copy!? We need to fix this before tagging a release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants