-
Notifications
You must be signed in to change notification settings - Fork 11
Conversation
Thanks! So I think we should first merge #17 and deprecate The behaviour you describe makes sense to me, though we could imagine special-casing ranges so that you don't need to call |
I think this is definitely something we want. I don't know if we really need to have the member field be |
We need to decide on whether this should be moved to a future PR or addressed before merging and find an appropriate test for this if you'd like me to add one. After that and review approvals we should be good to go! Agreed that should be merged first.
Yes, I'd be happy to. cf. JuliaData/DataStreams.jl#27.
Makes sense to me, I'll revert those edits. Thanks for the feedback! |
dt | ||
``` | ||
|
||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The preferred format for docstrings is
"""
denullify!(dt::AbstractDataTable)
Convert `NullableArray` columns that do not contain null values to a non-`Nullable`
equivalent array type. The table `dt` is modified in place.
# Examples
```jldoctest
julia> dt = DataTable(A = NullableArray(1:3), B = [Nullable(i) for i = 1:3])
<the output as it would appear in the REPL>
julia> denullify!(dt)
<the output as it would appear in the REPL>
``
"""
The function signature is at the top, indented by four spaces rather than inside a triple-backquote markdown code block. Then the functionality is described beneath that using the imperative, e.g. "convert" rather than "converts." Sections are given using markdown section markers, i.e. multiples of #
, and it's generally a good idea to use jldoctest
for examples in docstrings.
src/datatable/datatable.jl
Outdated
end | ||
return result | ||
elseif (minlen == 0 && maxlen > 0) || any(x -> x != 0, mod(maxlen, lengths)) | ||
return error("Incompatible lengths of arguments") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd probably make this a DimensionMismatch
src/abstractdatatable/join.jl
Outdated
@@ -3,9 +3,15 @@ | |||
## | |||
|
|||
# Like similar, but returns a nullable array | |||
similar_nullable{T}(dv::AbstractArray{T}, dims::@compat(Union{Int, Tuple{Vararg{Int}}})) = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No @compat
for Union
s since we don't support 0.3 here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This commit just reverts changes made in the previous commits (from the other PR). Nothing changes overall. @cjprybol I think it would be better to focus on removing readtable
first, since that's simple, and then you can rebase this PR so that only commits related to it are shown. Else it's really messy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, thanks for the explanation and sorry for the noise.
src/subdatatable/subdatatable.jl
Outdated
@@ -157,4 +157,6 @@ end | |||
## | |||
############################################################################## | |||
|
|||
Base.map(f::Function, sdt::SubDataTable) = f(sdt) # TODO: deprecate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same deal as before--if it's not needed we might as well just dump it rather than deprecate it. But why add it if you ultimately want to deprecate it anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same.
I've commented out several tests that fail due to the new behavior. I've added a few new ones in the docstrings using Codeusing DataTables
dt = DataTable(A = 1:3, B = 2:4, C = 3:5)
map(typeof, dt.columns)
dt[:D] = NullableArray([4, 5, Nullable()])
map(typeof, dt.columns)
dt[:E] = 'c'
map(typeof, dt.columns)
nullify!(dt)
map(typeof, dt.columns)
denullify!(dt)
map(typeof, dt.columns) Outputjulia> using DataTables
julia> dt = DataTable(A = 1:3, B = 2:4, C = 3:5)
3×3 DataTables.DataTable
│ Row │ A │ B │ C │
├─────┼───┼───┼───┤
│ 1 │ 1 │ 2 │ 3 │
│ 2 │ 2 │ 3 │ 4 │
│ 3 │ 3 │ 4 │ 5 │
julia> map(typeof, dt.columns)
3-element Array{Any,1}:
Array{Int64,1}
Array{Int64,1}
Array{Int64,1}
julia> dt[:D] = NullableArray([4, 5, Nullable()])
3-element NullableArrays.NullableArray{Int64,1}:
4
5
#NULL
julia> map(typeof, dt.columns)
4-element Array{Any,1}:
Array{Int64,1}
Array{Int64,1}
Array{Int64,1}
NullableArrays.NullableArray{Int64,1}
julia> dt[:E] = 'c'
'c'
julia> map(typeof, dt.columns)
5-element Array{Any,1}:
Array{Int64,1}
Array{Int64,1}
Array{Int64,1}
NullableArrays.NullableArray{Int64,1}
Array{Char,1}
julia> nullify!(dt)
3×5 DataTables.DataTable
│ Row │ A │ B │ C │ D │ E │
├─────┼───┼───┼───┼───────┼─────┤
│ 1 │ 1 │ 2 │ 3 │ 4 │ 'c' │
│ 2 │ 2 │ 3 │ 4 │ 5 │ 'c' │
│ 3 │ 3 │ 4 │ 5 │ #NULL │ 'c' │
julia> map(typeof, dt.columns)
5-element Array{Any,1}:
NullableArrays.NullableArray{Int64,1}
NullableArrays.NullableArray{Int64,1}
NullableArrays.NullableArray{Int64,1}
NullableArrays.NullableArray{Int64,1}
NullableArrays.NullableArray{Char,1}
julia> denullify!(dt)
3×5 DataTables.DataTable
│ Row │ A │ B │ C │ D │ E │
├─────┼───┼───┼───┼───────┼─────┤
│ 1 │ 1 │ 2 │ 3 │ 4 │ 'c' │
│ 2 │ 2 │ 3 │ 4 │ 5 │ 'c' │
│ 3 │ 3 │ 4 │ 5 │ #NULL │ 'c' │
julia> map(typeof, dt.columns)
5-element Array{Any,1}:
Array{Int64,1}
Array{Int64,1}
Array{Int64,1}
NullableArrays.NullableArray{Int64,1}
Array{Char,1} |
@@ -31,6 +31,10 @@ The following are normally implemented for AbstractDataTables: | |||
* [`nonunique`](@ref) : indexes of duplicate rows | |||
* [`unique!`](@ref) : remove duplicate rows | |||
* `similar` : a DataTable with similar columns as `d` | |||
* `denullify` : unwrap `Nullable` columns | |||
* `denullify!` : unwrap `Nullable` columns in-place | |||
* `nullify` : Convert all columns to NullableArrays |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No initial upper case.
test/io.jl
Outdated
@@ -38,4 +40,41 @@ module TestIO | |||
show(io, "text/html", dt) | |||
@test length(String(take!(io))) < 10000 | |||
|
|||
dt = DataTable(A = 1:26, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please open a separate PR for this one. Thanks!
src/abstractdatatable/io.jl
Outdated
@@ -45,7 +45,7 @@ function printtable(io::IO, | |||
if !isnull(dt[j][i]) | |||
if ! (etypes[j] <: Real) | |||
print(io, quotemark) | |||
x = isa(Nullable, dt[i, j]) ? get(dt[i, j]) : dt[i, j] | |||
x = isa(Nullable, typeof(dt[i, j])) ? get(dt[i, j]) : dt[i, j] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't sound correct AFAICT.
src/datatable/datatable.jl
Outdated
end | ||
return DataTable(columns, Index(cnames)) | ||
end | ||
|
||
# Initialize from a Vector of Associatives (aka list of dicts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a strange constructor and so I'd opt to delete it outright rather than fix it, but I can add this back if desired
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quite weird indeed...
test/data.jl
Outdated
b = [:A,:B,:C][[1,1,1,2,3]], | ||
v2 = randn(5) | ||
) | ||
dt2[1,:a] = Nullable() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR currently does not try and handle the case of a user assigning a Nullable() to a typed, non-Nullable() column. We could promote to NullableArray to handle this, but I'm not sure if autopromotion is better than an explicit conversion by the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Automatic promotion could be problematic as it means you could get missing values in a column where you wouldn't have expected them. Probably better require explicit conversion.
test/datatable.jl
Outdated
@@ -166,25 +161,9 @@ module TestDataTable | |||
@test size(dt, 2) == 5 | |||
@test typeof(dt[:, 1]) == Vector{Float64} | |||
|
|||
#test_group("Other DataTable constructors") | |||
dt = DataTable([@compat(Dict{Any,Any}(:a=>1, :b=>'c')), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you rebase on master and move any unrelated changes to separate PRs? You shouldn't need to include your other PRs in this one, do you? |
Definitely, I'll try rebasing tomorrow. Could you elaborate on what you mean by unrelated changes/other PRs? I've added recycling of values to this PR as a byproduct of consolidating the constructors. I consolidated the constructors because they each had their own form of NullableArray promotion and for checking if the passed arguments were valid (size, type, etc.). (I just now caught the Dict constructors doing their own NullableArray promotion). With that said, I'd be happy to break this up into smaller chunks to make it easier to review for all involved (me included) if we can think of a logical way to do so. And if I have any unrelated edits left in this PR I will definitely remove them. |
@@ -659,17 +663,6 @@ unique!(dt) # modifies dt | |||
""" | |||
(unique, unique!) | |||
|
|||
function nonuniquekey(dt::AbstractDataTable) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something which should have been removed in #17? Better do this in another PR then.
@@ -701,7 +694,7 @@ without(dt::AbstractDataTable, c::Any) = without(dt, index(dt)[c]) | |||
|
|||
# catch-all to cover cases where indexing returns a DataTable and copy doesn't | |||
Base.hcat(dt::AbstractDataTable, x) = hcat!(dt[:, :], x) | |||
Base.hcat(dt1::AbstractDataTable, dt2::AbstractDataTable) = hcat!(dt[:, :], dt2) | |||
Base.hcat(dt1::AbstractDataTable, dt2::AbstractDataTable) = hcat!(dt1[:, :], dt2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Woops. But deserves another PR (and a test...).
if length(uniqueheaders) == 0 | ||
return DataTable() | ||
elseif length(unique(map(length, uniqueheaders))) > 1 | ||
throw(ArgumentError("not all DataTables have the same number of columns.")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice to print the actual number of columns. Same below, where you can use something like setdiff(union(allheaders...), intersect(allheaders...))
to print the names of non-matching columns. That's really really useful when you get this kind of error.
end | ||
allheaders = map(names, dts) | ||
# don't vcat empty DataTables | ||
indicestovcat = find(x -> length(x) > 0, allheaders) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe better name this "notempty" or something like that? The current name sounds like the indices are going to be concatenated.
elseif length(unique(map(length, uniqueheaders))) > 1 | ||
throw(ArgumentError("not all DataTables have the same number of columns.")) | ||
elseif length(uniqueheaders) > 1 | ||
throw(ArgumentError("Column headers do not match. Use `rename` or `names` to adjust header names")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Column names" rather. Also, rename!
and names!
are more likely to be what people want (names
does something completely different).
else | ||
unwrapped = Array{eltype(eltype(A))}(size(A)) | ||
for i in eachindex(A) | ||
unwrapped[i] = get(A[i]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use _unsafe_get
since you know there are no nulls.
if isa(A, NullableArray) | ||
return A.values | ||
else | ||
unwrapped = Array{eltype(eltype(A))}(size(A)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use similar
like in NullableArrays.dropnull
.
""" | ||
function denullify(A::AbstractVector) | ||
if (eltype(A) <: Nullable) && !any(x -> isnull(x), A) | ||
if isa(A, NullableArray) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also needs a special case for NullableCategoricalArray
. Actually I wonder whether this couldn't just call dropnull
to avoid duplicating the logic. If a new custom type wants to work with DataTable
, the current approach won't allow it to override the method. It would be better if it could just override dropnull
.
return A.values | ||
else | ||
unwrapped = Array{eltype(eltype(A))}(size(A)) | ||
for i in eachindex(A) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add @inbounds
. Also you can just to for x in A
.
""" | ||
nullify!(dt::AbstractDataTable) | ||
|
||
Convert all columns of `dt` to `NullableArrays`. The table `dt` is modified in place. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just say "nullable arrays", since it could use NullableCategoricalArray
.
I was referring to the |
This is a go at removing the automatic promotion to
NullableArray
s so that users have more type stability/control. I believe this was what the consensus of users wanted based on the DataFrames: what's inside? Issue posted by @quinnj at JuliaData/DataFrames.jl#1119, although it IS NOT an implementation ofDevelop an official AbstractColumn interface
but I hope it can be a stepping stone?example
current
This PR
The first thing you'll notice is that you'll want to start using an explicit
collect(range)
I've added the functions
nullify!(), nullify()
that convert all columns toNullableVector
s anddenullify!(), denullify()
that upwrap all columns that are null-free.This no longer supports
vcat
ting DataTables where the column names do not match or the column dimensions are different. I could add it back but it's inconsistent with base behavior and I'd rather put the responsibility on the user to rename columns or resize the data as an added check that they really want tovcat
DataTables of mixed size/columns. If this is unpopular I'm happy to revert behavior.I seem to have broken
readtable
andwritetable
but I wasn't inclined to figure out why, as a brokenreadtable/writetable
is another reason to push forth on deprecating them forCSV.read/write
#10.This is currently passing all tests on my setup, but of note this includes the code from #17. This change broke the current
groupby
&join
and I was more familiar #17 than the current implementations. Apologies for the extradiff
noise because of that code inclusion. If that pull request is rejected than I'll need to do some reworking.Functions that can naturally create missing elements such as
join(kind = :outer)
andunstack
autoconvert relevant columns toNullableArray
.All comments, questions, and requests for clarification are welcome. I'd also like to say upfront that this is a rather large change and there's a high probability that I've introduced some inefficiencies, changes that are unwanted and unwarranted, or may have completely missed the mark on the goals of JuliaData/DataFrames.jl#1119, so I recognize that this may receive heavy criticism if not outright rejection. In that case, I hope it serves as a useful discussion point for attempt 2 :)