Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add view to filter, sort, dropmissing, and unique #2386

Merged
merged 19 commits into from
Sep 9, 2020
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
45 changes: 27 additions & 18 deletions src/abstractdataframe/abstractdataframe.jl
Original file line number Diff line number Diff line change
Expand Up @@ -749,18 +749,19 @@ completecases(df::AbstractDataFrame, cols::MultiColumnIndex) =
"""
dropmissing(df::AbstractDataFrame, cols=:; view::Bool=false, disallowmissing::Bool=!view)

Return a copy of data frame `df` excluding rows with missing values.
Return a data frame excluding rows with missing values in `df`.

If `cols` is provided, only missing values in the corresponding columns are considered.
`cols` can be any column selector ($COLUMNINDEX_STR; $MULTICOLUMNINDEX_STR).

If `view=false` a fresly allocated `DataFrame` is returned.
bkamins marked this conversation as resolved.
Show resolved Hide resolved
If `view=true` then a view into `df` is returned. In this case
bkamins marked this conversation as resolved.
Show resolved Hide resolved
`disallowmissing` must be `false`.

If `disallowmissing` is `true` (the default when `view` is `false`)
then columns specified in `cols` will be converted so as not to allow for missing
values using [`disallowmissing!`](@ref).

If `view=true` then a view into `df` is returned instead. In this case
`disallowmissing` must be `false`.

See also: [`completecases`](@ref) and [`dropmissing!`](@ref).

# Examples
Expand Down Expand Up @@ -816,13 +817,17 @@ julia> dropmissing(df, [:x, :y])
function dropmissing(df::AbstractDataFrame,
cols::Union{ColumnIndex, MultiColumnIndex}=:;
bkamins marked this conversation as resolved.
Show resolved Hide resolved
view::Bool=false, disallowmissing::Bool=!view)
view && disallowmissing &&
throw(ArgumentError("disallowmissing=true is incompatible with view=true"))
rowidxs = completecases(df, cols)
view && return Base.view(df, rowidxs, :)
newdf = df[rowidxs, :]
disallowmissing && disallowmissing!(newdf, cols)
return newdf
if view
if disallowmissing
throw(ArgumentError("disallowmissing=true is incompatible with view=true"))
end
return Base.view(df, rowidxs, :)
else
newdf = df[rowidxs, :]
disallowmissing && disallowmissing!(newdf, cols)
return newdf
end
end

"""
Expand Down Expand Up @@ -898,7 +903,7 @@ end
filter(fun, df::AbstractDataFrame; view::Bool=false)
filter(cols => fun, df::AbstractDataFrame; view::Bool=false)

Return a copy of data frame `df` containing only rows for which `fun`
Return a data frame containing only rows from `df` for which `fun`
returns `true`.

If `cols` is not specified then the predicate `fun` is passed `DataFrameRow`s.
Expand All @@ -910,7 +915,8 @@ corresponding columns as separate positional arguments, unless `cols` is an
column duplicates are allowed if a vector of `Symbol`s, strings, or integers is
passed.

If `view=true` then a view into `df` is returned instead.
If `view=false` a fresly allocated `DataFrame` is returned.
bkamins marked this conversation as resolved.
Show resolved Hide resolved
If `view=true` then a view into `df` is returned.
bkamins marked this conversation as resolved.
Show resolved Hide resolved

Passing `cols` leads to a more efficient execution of the operation for large data frames.

Expand Down Expand Up @@ -1208,13 +1214,16 @@ end
unique!(df::AbstractDataFrame)
unique!(df::AbstractDataFrame, cols)

Delete duplicate rows of data frame `df`, keeping only the first occurrence of unique rows.
When `cols` is specified, the returned `DataFrame` contains complete rows,
retaining in each case the first instance for which `df[cols]` is unique.
`cols` can be any column selector ($COLUMNINDEX_STR; $MULTICOLUMNINDEX_STR).
Delete duplicate rows of data frame `df`, keeping only the first occurrence of
unique rows. When `cols` is specified, the returned `DataFrame` contains
bkamins marked this conversation as resolved.
Show resolved Hide resolved
complete rows, retaining in each case the first instance for which `df[cols]` is
unique. `cols` can be any column selector ($COLUMNINDEX_STR;
$MULTICOLUMNINDEX_STR).

For `unique` if `view=false` a fresly allocated `DataFrame` is returned,
bkamins marked this conversation as resolved.
Show resolved Hide resolved
and if `view=true` then a view into `df` is returned.
bkamins marked this conversation as resolved.
Show resolved Hide resolved

`unique` returns a new data frame unless `view=true`, in which
case it returns a `SubDataFrame` view into `df`. `unique!` updates `df` in-place.
`unique!` updates `df` in-place and does not support `view` keyword argument.
bkamins marked this conversation as resolved.
Show resolved Hide resolved

See also [`nonunique`](@ref).

Expand Down
58 changes: 26 additions & 32 deletions src/abstractdataframe/sort.jl
Original file line number Diff line number Diff line change
Expand Up @@ -332,35 +332,12 @@ function Base.issorted(df::AbstractDataFrame, cols=[];
end
end

# sort and sortperm functions

for s in [:(Base.sort), :(Base.sortperm)]
@eval begin
function $s(df::AbstractDataFrame, cols=[];
alg=nothing, lt=isless, by=identity, rev=false, order=Forward)
if !(isa(by, Function) || eltype(by) <: Function)
msg = "'by' must be a Function or a vector of Functions. " *
" Perhaps you wanted 'cols'."
throw(ArgumentError(msg))
end
# exclude AbstractVector as in that case cols can contain order(...) clauses
if cols isa MultiColumnIndex && !(cols isa AbstractVector)
cols = index(df)[cols]
end
ord = ordering(df, cols, lt, by, rev, order)
_alg = Sort.defalg(df, ord; alg=alg, cols=cols)
return $s(df, _alg, ord)
end
end
end

"""
sort(df::AbstractDataFrame, cols;
alg::Union{Algorithm, Nothing}=nothing, lt=isless, by=identity,
rev::Bool=false, order::Ordering=Forward, view::Bool=false)

Return a copy of data frame `df` sorted by column(s) `cols`.
If `view=true` a `SubDataFrame` view into `df` is returned instead.
Return a data frame `df` sorted by column(s) `cols`.
bkamins marked this conversation as resolved.
Show resolved Hide resolved

`cols` can be any column selector ($COLUMNINDEX_STR; $MULTICOLUMNINDEX_STR).

Expand All @@ -370,6 +347,10 @@ on the type of the sorting columns and on the number of rows in `df`.
If `rev` is `true`, reverse sorting is performed. To enable reverse sorting
only for some columns, pass `order(c, rev=true)` in `cols`, with `c` the
corresponding column index (see example below).

If `view=false` a fresly allocated `DataFrame` is returned.
bkamins marked this conversation as resolved.
Show resolved Hide resolved
If `view=true` then a view into `df` is returned.
bkamins marked this conversation as resolved.
Show resolved Hide resolved

See [`sort!`](@ref) for a description of other keyword arguments.

# Examples
Expand Down Expand Up @@ -425,7 +406,11 @@ julia> sort(df, [:x, order(:y, rev=true)])
│ 4 │ 3 │ b │
```
"""
sort(::AbstractDataFrame, ::Any)
function sort(df::AbstractDataFrame, cols=[]; alg=nothing, lt=isless,
by=identity, rev=false, order=Forward, view::Bool=false)
bkamins marked this conversation as resolved.
Show resolved Hide resolved
rowidxs = sortperm(df, cols, alg=alg, lt=lt, by=by, rev=rev, order=order)
return view ? Base.view(df, rowidxs, :) : df[rowidxs, :]
end

"""
sortperm(df::AbstractDataFrame, cols;
Expand Down Expand Up @@ -486,14 +471,23 @@ julia> sortperm(df, (:x, :y), rev=true)
1
```
"""
sortperm(::AbstractDataFrame, ::Any)

function Base.sort(df::AbstractDataFrame, a::Algorithm, o::Ordering; view::Bool=false)
rowidxs = sortperm(df, a, o)
return view ? Base.view(df, rowidxs, :) : df[rowidxs, :]
function sortperm(df::AbstractDataFrame, cols=[];
alg=nothing, lt=isless, by=identity, rev=false, order=Forward)
bkamins marked this conversation as resolved.
Show resolved Hide resolved
if !(isa(by, Function) || eltype(by) <: Function)
msg = "'by' must be a Function or a vector of Functions. " *
" Perhaps you wanted 'cols'."
throw(ArgumentError(msg))
end
# exclude AbstractVector as in that case cols can contain order(...) clauses
if cols isa MultiColumnIndex && !(cols isa AbstractVector)
cols = index(df)[cols]
end
ord = ordering(df, cols, lt, by, rev, order)
_alg = Sort.defalg(df, ord; alg=alg, cols=cols)
return _sortperm(df, _alg, ord)
end

Base.sortperm(df::AbstractDataFrame, a::Algorithm, o::Union{Perm,DFPerm}) =
_sortperm(df::AbstractDataFrame, a::Algorithm, o::Union{Perm,DFPerm}) =
sort!([1:size(df, 1);], a, o)
Base.sortperm(df::AbstractDataFrame, a::Algorithm, o::Ordering) =
_sortperm(df::AbstractDataFrame, a::Algorithm, o::Ordering) =
sortperm(df, a, DFPerm(o,df))