Skip to content
This repository has been archived by the owner on May 5, 2019. It is now read-only.

Use whatever column-type you want #24

Closed
wants to merge 0 commits into from
Closed

Use whatever column-type you want #24

wants to merge 0 commits into from

Conversation

cjprybol
Copy link
Contributor

@cjprybol cjprybol commented Feb 27, 2017

This is a go at removing the automatic promotion to NullableArrays so that users have more type stability/control. I believe this was what the consensus of users wanted based on the DataFrames: what's inside? Issue posted by @quinnj at JuliaData/DataFrames.jl#1119, although it IS NOT an implementation of Develop an official AbstractColumn interface but I hope it can be a stepping stone?

example

using DataTables
dt = DataTable(A = 1:3, B = 2:4, C = 3:5)
map(typeof, dt.columns)
dt[:D] = NullableArray([4, 5, Nullable()])
map(typeof, dt.columns)
dt[:E] = 'c'
map(typeof, dt.columns)

current

julia> using DataTables
INFO: Recompiling stale cache file /Users/Cameron/.julia/lib/v0.5/DataTables.ji for module DataTables.

julia> using DataTables

julia> dt = DataTable(A = 1:3, B = 2:4, C = 3:5)
3×3 DataTables.DataTable
│ Row │ A │ B │ C │
├─────┼───┼───┼───┤
│ 1123 │
│ 2234 │
│ 3345 │

julia> map(typeof, dt.columns)
3-element Array{Any,1}:
 NullableArrays.NullableArray{Int64,1}
 NullableArrays.NullableArray{Int64,1}
 NullableArrays.NullableArray{Int64,1}

julia> dt[:D] = NullableArray([4, 5, Nullable()])
3-element NullableArrays.NullableArray{Int64,1}:
 4
 5
 #NULL

julia> map(typeof, dt.columns)
4-element Array{Any,1}:
 NullableArrays.NullableArray{Int64,1}
 NullableArrays.NullableArray{Int64,1}
 NullableArrays.NullableArray{Int64,1}
 NullableArrays.NullableArray{Int64,1}

julia> dt[:E] = 'c'
'c'

julia> map(typeof, dt.columns)
5-element Array{Any,1}:
 NullableArrays.NullableArray{Int64,1}
 NullableArrays.NullableArray{Int64,1}
 NullableArrays.NullableArray{Int64,1}
 NullableArrays.NullableArray{Int64,1}
 NullableArrays.NullableArray{Char,1}

This PR

julia> using DataTables

julia> dt = DataTable(A = 1:3, B = 2:4, C = 3:5)
3×3 DataTables.DataTable
│ Row │ A │ B │ C │
├─────┼───┼───┼───┤
│ 1123 │
│ 2234 │
│ 3345 │

julia> map(typeof, dt.columns)
3-element Array{DataType,1}:
 UnitRange{Int64}
 UnitRange{Int64}
 UnitRange{Int64}

julia> dt[:D] = NullableArray([4, 5, Nullable()])
3-element NullableArrays.NullableArray{Int64,1}:
 4
 5
 #NULL

julia> map(typeof, dt.columns)
4-element Array{DataType,1}:
 UnitRange{Int64}
 UnitRange{Int64}
 UnitRange{Int64}
 NullableArrays.NullableArray{Int64,1}

julia> dt[:E] = 'c'
'c'

julia> map(typeof, dt.columns)
5-element Array{DataType,1}:
 UnitRange{Int64}
 UnitRange{Int64}
 UnitRange{Int64}
 NullableArrays.NullableArray{Int64,1}
 Array{Char,1}

The first thing you'll notice is that you'll want to start using an explicit collect(range)

julia> using DataTables

julia> dt = DataTable(A = collect(1:3), B = collect(2:4), C = collect(3:5))
3×3 DataTables.DataTable
│ Row │ A │ B │ C │
├─────┼───┼───┼───┤
│ 1123 │
│ 2234 │
│ 3345 │

julia> map(typeof, dt.columns)
3-element Array{DataType,1}:
 Array{Int64,1}
 Array{Int64,1}
 Array{Int64,1}

I've added the functions nullify!(), nullify() that convert all columns to NullableVectors and denullify!(), denullify() that upwrap all columns that are null-free.

This no longer supports vcatting DataTables where the column names do not match or the column dimensions are different. I could add it back but it's inconsistent with base behavior and I'd rather put the responsibility on the user to rename columns or resize the data as an added check that they really want to vcat DataTables of mixed size/columns. If this is unpopular I'm happy to revert behavior.

julia> rand(2,2)
2×2 Array{Float64,2}:
 0.863193  0.0246376
 0.137138  0.892913

julia> rand(2,3)
2×3 Array{Float64,2}:
 0.560117  0.921307    0.479872
 0.238899  0.00126669  0.0588353

julia> vcat(rand(2,2), rand(2,3))
ERROR: ArgumentError: number of columns of each array must match (got (2,3))
 in typed_vcat(::Type{Float64}, ::Array{Float64,2}, ::Array{Float64,2}) at ./abstractarray.jl:1047
 in vcat(::Array{Float64,2}, ::Array{Float64,2}) at ./array.jl:725

I seem to have broken readtable and writetable but I wasn't inclined to figure out why, as a broken readtable/writetable is another reason to push forth on deprecating them for CSV.read/write #10.

This is currently passing all tests on my setup, but of note this includes the code from #17. This change broke the current groupby & join and I was more familiar #17 than the current implementations. Apologies for the extra diff noise because of that code inclusion. If that pull request is rejected than I'll need to do some reworking.

julia> Pkg.test("DataTables")
INFO: Computing test dependencies for DataTables...
INFO: Installing Atom v0.5.9
INFO: Installing Blink v0.5.1
INFO: Installing CodeTools v0.4.3
INFO: Installing Codecs v0.2.0
INFO: Installing DataArrays v0.3.12
INFO: Installing HttpCommon v0.2.6
INFO: Installing HttpParser v0.2.0
INFO: Installing HttpServer v0.1.7
INFO: Installing LNR v0.0.2
INFO: Installing LaTeXStrings v0.2.0
INFO: Installing Lazy v0.11.5
INFO: Installing MbedTLS v0.4.3
INFO: Installing Mustache v0.1.3
INFO: Installing Mux v0.2.3
INFO: Installing RData v0.0.4
INFO: Installing RDatasets v0.2.0
INFO: Installing WebSockets v0.2.1
INFO: Building HttpParser
INFO: Building Homebrew
Already up-to-date.
INFO: Building MbedTLS
Using system libraries...
INFO: Testing DataTables
Running tests:
	PASSED: utils.jl
	PASSED: cat.jl
WARNING: using DataTables.sub in module TestData conflicts with an existing identifier.
	PASSED: data.jl
	PASSED: index.jl
	PASSED: datatable.jl
	PASSED: datatablerow.jl
	PASSED: io.jl
	PASSED: constructors.jl
	PASSED: conversions.jl
	PASSED: sort.jl
	PASSED: grouping.jl
	PASSED: join.jl
	PASSED: iteration.jl
	PASSED: duplicates.jl
	PASSED: show.jl
INFO: DataTables tests passed
INFO: Removing Atom v0.5.9
INFO: Removing Blink v0.5.1
INFO: Removing CodeTools v0.4.3
INFO: Removing Codecs v0.2.0
INFO: Removing DataArrays v0.3.12
INFO: Removing HttpCommon v0.2.6
INFO: Removing HttpParser v0.2.0
INFO: Removing HttpServer v0.1.7
INFO: Removing LNR v0.0.2
INFO: Removing LaTeXStrings v0.2.0
INFO: Removing Lazy v0.11.5
INFO: Removing MbedTLS v0.4.3
INFO: Removing Mustache v0.1.3
INFO: Removing Mux v0.2.3
INFO: Removing RData v0.0.4
INFO: Removing RDatasets v0.2.0
INFO: Removing WebSockets v0.2.1

Functions that can naturally create missing elements such as join(kind = :outer) and unstack autoconvert relevant columns to NullableArray.

All comments, questions, and requests for clarification are welcome. I'd also like to say upfront that this is a rather large change and there's a high probability that I've introduced some inefficiencies, changes that are unwanted and unwarranted, or may have completely missed the mark on the goals of JuliaData/DataFrames.jl#1119, so I recognize that this may receive heavy criticism if not outright rejection. In that case, I hope it serves as a useful discussion point for attempt 2 :)

@nalimilan
Copy link
Member

Thanks! So I think we should first merge #17 and deprecate readtable/writable, so that we can remove them in this PR. Can you open a PR for the latter?

The behaviour you describe makes sense to me, though we could imagine special-casing ranges so that you don't need to call collect. In general immutable columns don't sound too useful. Not sure. See also JuliaData/DataFrames.jl#882, which I'd like to update and merge in both DataFrames and DataTables.

@quinnj
Copy link
Member

quinnj commented Feb 27, 2017

I think this is definitely something we want. I don't know if we really need to have the member field be columns::Vector{AbstractVector} over columns::Vector{Any}, since it doesn't really buy us anything. I'd leave it at Any for now so it's not overly restrictive. Otherwise, big 👍 from me!

@cjprybol
Copy link
Contributor Author

cjprybol commented Feb 28, 2017

Thanks! So I think we should first merge #17

We need to decide on whether this should be moved to a future PR or addressed before merging and find an appropriate test for this if you'd like me to add one. After that and review approvals we should be good to go! Agreed that should be merged first.

and deprecate readtable/writable, so that we can remove them in this PR. Can you open a PR for the latter?

Yes, I'd be happy to. cf. JuliaData/DataStreams.jl#27.

I think this is definitely something we want. I don't know if we really need to have the member field be columns::Vector{AbstractVector} over columns::Vector{Any}, since it doesn't really buy us anything. I'd leave it at Any for now so it's not overly restrictive. Otherwise, big 👍 from me!

Makes sense to me, I'll revert those edits. Thanks for the feedback!

dt
```

"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The preferred format for docstrings is

"""
    denullify!(dt::AbstractDataTable)

Convert `NullableArray` columns that do not contain null values to a non-`Nullable`
equivalent array type. The table `dt` is modified in place.

# Examples

```jldoctest
julia> dt = DataTable(A = NullableArray(1:3), B = [Nullable(i) for i = 1:3])
<the output as it would appear in the REPL>

julia> denullify!(dt)
<the output as it would appear in the REPL>
``
"""

The function signature is at the top, indented by four spaces rather than inside a triple-backquote markdown code block. Then the functionality is described beneath that using the imperative, e.g. "convert" rather than "converts." Sections are given using markdown section markers, i.e. multiples of #, and it's generally a good idea to use jldoctest for examples in docstrings.

end
return result
elseif (minlen == 0 && maxlen > 0) || any(x -> x != 0, mod(maxlen, lengths))
return error("Incompatible lengths of arguments")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably make this a DimensionMismatch

@@ -3,9 +3,15 @@
##

# Like similar, but returns a nullable array
similar_nullable{T}(dv::AbstractArray{T}, dims::@compat(Union{Int, Tuple{Vararg{Int}}})) =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No @compat for Unions since we don't support 0.3 here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This commit just reverts changes made in the previous commits (from the other PR). Nothing changes overall. @cjprybol I think it would be better to focus on removing readtable first, since that's simple, and then you can rebase this PR so that only commits related to it are shown. Else it's really messy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, thanks for the explanation and sorry for the noise.

@@ -157,4 +157,6 @@ end
##
##############################################################################

Base.map(f::Function, sdt::SubDataTable) = f(sdt) # TODO: deprecate
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same deal as before--if it's not needed we might as well just dump it rather than deprecate it. But why add it if you ultimately want to deprecate it anyway?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same.

@cjprybol cjprybol changed the title Use whatever column-type you want [Pending #26] Use whatever column-type you want Mar 6, 2017
@cjprybol cjprybol changed the title [Pending #26] Use whatever column-type you want Use whatever column-type you want Mar 9, 2017
@cjprybol
Copy link
Contributor Author

cjprybol commented Mar 9, 2017

I've commented out several tests that fail due to the new behavior. I've added a few new ones in the docstrings using jldoctest as suggested by @ararslan #24 (comment). I'll add these tests somewhere too now that I've written them up as examples.

Code

using DataTables
dt = DataTable(A = 1:3, B = 2:4, C = 3:5)
map(typeof, dt.columns)
dt[:D] = NullableArray([4, 5, Nullable()])
map(typeof, dt.columns)
dt[:E] = 'c'
map(typeof, dt.columns)
nullify!(dt)
map(typeof, dt.columns)
denullify!(dt)
map(typeof, dt.columns)

Output

julia> using DataTables

julia> dt = DataTable(A = 1:3, B = 2:4, C = 3:5)
3×3 DataTables.DataTable
│ Row │ A │ B │ C │
├─────┼───┼───┼───┤
│ 1123 │
│ 2234 │
│ 3345 │

julia> map(typeof, dt.columns)
3-element Array{Any,1}:
 Array{Int64,1}
 Array{Int64,1}
 Array{Int64,1}

julia> dt[:D] = NullableArray([4, 5, Nullable()])
3-element NullableArrays.NullableArray{Int64,1}:
 4
 5
 #NULL

julia> map(typeof, dt.columns)
4-element Array{Any,1}:
 Array{Int64,1}
 Array{Int64,1}
 Array{Int64,1}
 NullableArrays.NullableArray{Int64,1}

julia> dt[:E] = 'c'
'c'

julia> map(typeof, dt.columns)
5-element Array{Any,1}:
 Array{Int64,1}
 Array{Int64,1}
 Array{Int64,1}
 NullableArrays.NullableArray{Int64,1}
 Array{Char,1}

julia> nullify!(dt)
3×5 DataTables.DataTable
│ Row │ A │ B │ C │ D     │ E   │
├─────┼───┼───┼───┼───────┼─────┤
│ 11234'c' │
│ 22345'c' │
│ 3345#NULL │ 'c' │

julia> map(typeof, dt.columns)
5-element Array{Any,1}:
 NullableArrays.NullableArray{Int64,1}
 NullableArrays.NullableArray{Int64,1}
 NullableArrays.NullableArray{Int64,1}
 NullableArrays.NullableArray{Int64,1}
 NullableArrays.NullableArray{Char,1}

julia> denullify!(dt)
3×5 DataTables.DataTable
│ Row │ A │ B │ C │ D     │ E   │
├─────┼───┼───┼───┼───────┼─────┤
│ 11234'c' │
│ 22345'c' │
│ 3345#NULL │ 'c' │

julia> map(typeof, dt.columns)
5-element Array{Any,1}:
 Array{Int64,1}
 Array{Int64,1}
 Array{Int64,1}
 NullableArrays.NullableArray{Int64,1}
 Array{Char,1}

@@ -31,6 +31,10 @@ The following are normally implemented for AbstractDataTables:
* [`nonunique`](@ref) : indexes of duplicate rows
* [`unique!`](@ref) : remove duplicate rows
* `similar` : a DataTable with similar columns as `d`
* `denullify` : unwrap `Nullable` columns
* `denullify!` : unwrap `Nullable` columns in-place
* `nullify` : Convert all columns to NullableArrays
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No initial upper case.

test/io.jl Outdated
@@ -38,4 +40,41 @@ module TestIO
show(io, "text/html", dt)
@test length(String(take!(io))) < 10000

dt = DataTable(A = 1:26,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please open a separate PR for this one. Thanks!

@@ -45,7 +45,7 @@ function printtable(io::IO,
if !isnull(dt[j][i])
if ! (etypes[j] <: Real)
print(io, quotemark)
x = isa(Nullable, dt[i, j]) ? get(dt[i, j]) : dt[i, j]
x = isa(Nullable, typeof(dt[i, j])) ? get(dt[i, j]) : dt[i, j]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't sound correct AFAICT.

end
return DataTable(columns, Index(cnames))
end

# Initialize from a Vector of Associatives (aka list of dicts)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a strange constructor and so I'd opt to delete it outright rather than fix it, but I can add this back if desired

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quite weird indeed...

test/data.jl Outdated
b = [:A,:B,:C][[1,1,1,2,3]],
v2 = randn(5)
)
dt2[1,:a] = Nullable()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR currently does not try and handle the case of a user assigning a Nullable() to a typed, non-Nullable() column. We could promote to NullableArray to handle this, but I'm not sure if autopromotion is better than an explicit conversion by the user.

Copy link
Member

@nalimilan nalimilan Mar 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automatic promotion could be problematic as it means you could get missing values in a column where you wouldn't have expected them. Probably better require explicit conversion.

@@ -166,25 +161,9 @@ module TestDataTable
@test size(dt, 2) == 5
@test typeof(dt[:, 1]) == Vector{Float64}

#test_group("Other DataTable constructors")
dt = DataTable([@compat(Dict{Any,Any}(:a=>1, :b=>'c')),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nalimilan
Copy link
Member

Could you rebase on master and move any unrelated changes to separate PRs? You shouldn't need to include your other PRs in this one, do you?

@cjprybol
Copy link
Contributor Author

Definitely, I'll try rebasing tomorrow. Could you elaborate on what you mean by unrelated changes/other PRs? I've added recycling of values to this PR as a byproduct of consolidating the constructors. I consolidated the constructors because they each had their own form of NullableArray promotion and for checking if the passed arguments were valid (size, type, etc.). (I just now caught the Dict constructors doing their own NullableArray promotion). With that said, I'd be happy to break this up into smaller chunks to make it easier to review for all involved (me included) if we can think of a logical way to do so. And if I have any unrelated edits left in this PR I will definitely remove them.

@@ -659,17 +663,6 @@ unique!(dt) # modifies dt
"""
(unique, unique!)

function nonuniquekey(dt::AbstractDataTable)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something which should have been removed in #17? Better do this in another PR then.

@@ -701,7 +694,7 @@ without(dt::AbstractDataTable, c::Any) = without(dt, index(dt)[c])

# catch-all to cover cases where indexing returns a DataTable and copy doesn't
Base.hcat(dt::AbstractDataTable, x) = hcat!(dt[:, :], x)
Base.hcat(dt1::AbstractDataTable, dt2::AbstractDataTable) = hcat!(dt[:, :], dt2)
Base.hcat(dt1::AbstractDataTable, dt2::AbstractDataTable) = hcat!(dt1[:, :], dt2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woops. But deserves another PR (and a test...).

if length(uniqueheaders) == 0
return DataTable()
elseif length(unique(map(length, uniqueheaders))) > 1
throw(ArgumentError("not all DataTables have the same number of columns."))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to print the actual number of columns. Same below, where you can use something like setdiff(union(allheaders...), intersect(allheaders...)) to print the names of non-matching columns. That's really really useful when you get this kind of error.

end
allheaders = map(names, dts)
# don't vcat empty DataTables
indicestovcat = find(x -> length(x) > 0, allheaders)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe better name this "notempty" or something like that? The current name sounds like the indices are going to be concatenated.

elseif length(unique(map(length, uniqueheaders))) > 1
throw(ArgumentError("not all DataTables have the same number of columns."))
elseif length(uniqueheaders) > 1
throw(ArgumentError("Column headers do not match. Use `rename` or `names` to adjust header names"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Column names" rather. Also, rename! and names! are more likely to be what people want (names does something completely different).

else
unwrapped = Array{eltype(eltype(A))}(size(A))
for i in eachindex(A)
unwrapped[i] = get(A[i])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use _unsafe_get since you know there are no nulls.

if isa(A, NullableArray)
return A.values
else
unwrapped = Array{eltype(eltype(A))}(size(A))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use similar like in NullableArrays.dropnull.

"""
function denullify(A::AbstractVector)
if (eltype(A) <: Nullable) && !any(x -> isnull(x), A)
if isa(A, NullableArray)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also needs a special case for NullableCategoricalArray. Actually I wonder whether this couldn't just call dropnull to avoid duplicating the logic. If a new custom type wants to work with DataTable, the current approach won't allow it to override the method. It would be better if it could just override dropnull.

return A.values
else
unwrapped = Array{eltype(eltype(A))}(size(A))
for i in eachindex(A)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add @inbounds. Also you can just to for x in A.

"""
nullify!(dt::AbstractDataTable)

Convert all columns of `dt` to `NullableArrays`. The table `dt` is modified in place.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just say "nullable arrays", since it could use NullableCategoricalArray.

@nalimilan
Copy link
Member

Could you elaborate on what you mean by unrelated changes/other PRs?

I was referring to the nonuniquekey and the hcat changes.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants