Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure levels are maintained when joining df's with Categorical cols #1266

Merged
merged 1 commit into from
Nov 21, 2017

Conversation

cjprybol
Copy link
Contributor

Fixes #1257

cc @alyst

@coveralls
Copy link

coveralls commented Oct 18, 2017

Coverage Status

Coverage increased (+0.03%) to 72.599% when pulling ccd30da on cjp/joinlevels into 86c6145 on master.

@cjprybol
Copy link
Contributor Author

cjprybol commented Oct 26, 2017

Updated to include the mergelevels steps from copy! in CategoricalArrays. I think it's probably possible to refactor this to use copy! directly, but if we did that then we would need to add a few other steps that I think would ultimately make that less efficient than what we already have here. For a copy!-based approach we would need to copy, then (potentially) resize, then (potentially) promote the categorical eltype from T to Union{T, Null}, and then we'd still have to call our custom fillcolumn! function to make sure the indices from the original DataFrames get placed in the correct indices in the output DataFrame when the values in the resized column should have gaps inserted between them (to accept values from right outer join). Checking if the original columns were categorical, copying over the levels, and using mergelevels to handle the right outer join case where we need to bring in values from the right table that are missing from the left table seems like a sensible way to get the same result.

If the gaps that we insert to accept outer join values are always at the end, then I'm wrong here and copy! might allow us to get rid of fillcolumn!, but I don't think that's the case (see examples from before we fixed the ROJ bug here)

I've added a note to the docstring for join to clarify that the levels ordering of the left table takes precedence over the ordering of the right table

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I'd really like to avoid special-casing CategoricalArray in DataFrames, but I trust you if you say that doesn't work. Maybe we could create an API to pseudo-concatenate two arrays, something like similar but taking two arrays instead of one. But for now it will be fine.

test/join.jl Outdated
@test levels(join(B, A, on=:b, kind = :semi)[:b]) == ["a", "b", "c"]
end

@testset "maintain Categorical levels ordering on join - only 1 is categorical" begin
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Only LHS is categorical" (IIUC).

test/join.jl Outdated
@@ -325,4 +325,50 @@ module TestJoin
@test all(isa.(o(on).columns,
[CategoricalVector{Union{T, Null}} for T in (Int, Float64)]))
end

@testset "maintain Categorical levels ordering on join - non-`on` cols" begin
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CategoricalArray

test/join.jl Outdated
A = DataFrame(a = [1, 2, 3], b = ["a", "b", "c"])
c = levels!(categorical(["a", "b", "b"]), ["b", "a"])
B = DataFrame(b = ["a", "b", "c"], c = c)
@test levels(join(A, B, on=:b, kind=:inner)[:c]) == ["b", "a"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to test isordered too, since that's easy to get wrong. While you're at it, testing the actual contents of the resulting column wouldn't hurt either, if that's not too much additional code.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.3%) to 72.819% when pulling de9a640 on cjp/joinlevels into 86c6145 on master.

@coveralls
Copy link

coveralls commented Oct 26, 2017

Coverage Status

Coverage increased (+0.2%) to 72.758% when pulling 1d92fa7 on cjp/joinlevels into 86c6145 on master.

@cjprybol
Copy link
Contributor Author

Ignore what I said before, a combination of permute! plus an off-by-one error I made during my first attempt at using copy! confused me into thinking something completely different was going on here. This now uses copy!.

I think I remember why we went with fillcolumn! instead of copy! the first time though. We make a copy of the original column when we subset/reorder and pass it as the source into the copy! function. One work around would be to view(col, all_orig_left/right_indices) rather than col[all_orig_left/right_indices] which would achieve the same goal and avoid the copy. We'd have to extend JuliaData/CategoricalArrays.jl#92 to accept SubArray{CategoricalArray}. Alternatively, we might be able to just permute the source column in-place (within the DataTableJoiner object).

@coveralls
Copy link

Coverage Status

Coverage increased (+0.3%) to 72.819% when pulling de9a640 on cjp/joinlevels into 86c6145 on master.

@nalimilan
Copy link
Member

Ah, great, that's really the most natural and generic API. Using view would also sound like the simplest solution. Can you extend JuliaData/CategoricalArrays.jl#92 to accept SubArray{CategoricalArray}?

@cjprybol
Copy link
Contributor Author

cjprybol commented Nov 2, 2017

This is working for me locally with JuliaData/CategoricalArrays.jl#97. We'll need to merge that, tag a new release and update this PR with a new minimum requirement on CategoricalArrays

@nalimilan
Copy link
Member

CI won't pass until we move to Missings. I'm preparing a PR for that.

@coveralls
Copy link

coveralls commented Nov 21, 2017

Coverage Status

Coverage decreased (-0.03%) to 72.926% when pulling c180053 on cjp/joinlevels into 2d29d2a on master.

@nalimilan nalimilan merged commit ff2729b into master Nov 21, 2017
@nalimilan nalimilan deleted the cjp/joinlevels branch November 21, 2017 13:14
@nalimilan
Copy link
Member

Thanks! I really like how you have been able to fix a bug while removing code.

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants