-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add view to filter, sort, dropmissing, and unique #2386
Conversation
@nalimilan - no rush, but do you have any opinion on this (this is the first step in deciding what we do with updating of views later) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like a good idea, and I don't see what else this could mean so it's quite safe to add. It's also nice to see that the implementation is quite clean.
Regarding type instability, ideally the compiler will use the fact that view=true
is known at compile time (in particular as the default value for a keyword argument) to infer the type as ::DataFrame
. I checked that recently for categorical
's compress=false
argument and it worked quite well. Can you check that it's the case? In some large functions, it could help to move some of the code into helpers, so that the top-level function in which if view
appears can be inlined.
Co-authored-by: Milan Bouchet-Valat <[email protected]>
Actually for |
OK, that's not great, but I guess the compiler cannot do better in that case since there are many possible return types. For |
Regarding We actually have the same issue with |
It seems that |
Co-authored-by: Milan Bouchet-Valat <[email protected]>
OK, with
(that is - sometimes we have a proper inference and sometimes we do not unfortunately) |
I have added tests (and caught and old bug in |
@nalimilan - this should be good to have a look at. Thank you! |
test/data.jl
Outdated
@test fun(view(df, 1:2, 1:2)) isa DataFrame | ||
@test fun(df, view=false) isa DataFrame | ||
@test fun(view(df, 1:2, 1:2), view=false) isa DataFrame | ||
@test fun(df, view=true) isa SubDataFrame | ||
@test fun(view(df, 1:2, 1:2), view=true) isa SubDataFrame |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also test returned value? Or is that covered elsewhere?
It would also be nice to use @inferred
when view=true/false
isn't specified to prevent any regression: it would be easy to remove one of the @inlined
without realizing why they are here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added the test and @inferred
(I thought it would fail on Julia 1.0, but it passes - which is good)
It's not just "sometimes", right? If you specify |
I had to do some refactoring of |
Co-authored-by: Milan Bouchet-Valat <[email protected]>
|
||
@inline function _filter_helper(df::AbstractDataFrame, f, cols...; view::Bool) | ||
@inline function Base.filter(f, df::AbstractDataFrame; view::Bool=false) | ||
rowidxs::BitVector = _filter_helper(f, eachrow(df)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's weird that inference fails here. Anyway, maybe better put the type assertion directly in _filter_helper
itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I have tried this and this was crashing my Julia earlier. But now it seems to work, so probably I had some weird problem earlier. I pushed the fix for this. Also I think it is not a problem to add this conversion and assertion as it is no-op and at least we signal what we intend to return.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Do we have tests to catch an inference failure if somebody removed the assertion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I remember why I did it the other way - it was because nightly fails this way, see e.g. https://travis-ci.org/github/JuliaData/DataFrames.jl/jobs/725236829 (I have not tested this in 100% detail to nail down the reason as I have a lot of classes this week 😩).
@JeffBezanson + @Keno : this is a regression on nightly vs Julia 1.5.1 - some problems with type inference. Please let me know if you need more information to investigate.
Use this branch on Julia nightly and run the following to reproduce (interestingly - a correct result is produced, but an error is thrown in the process):
julia> using DataFrames
julia> df = DataFrame(rand(10,1))
10×1 DataFrame
│ Row │ x1 │
│ │ Float64 │
├─────┼───────────┤
│ 1 │ 0.389567 │
│ 2 │ 0.0901768 │
│ 3 │ 0.179161 │
│ 4 │ 0.0274051 │
│ 5 │ 0.696103 │
│ 6 │ 0.321395 │
│ 7 │ 0.476313 │
│ 8 │ 0.628184 │
│ 9 │ 0.610729 │
│ 10 │ 0.048924 │
julia> filter(:x1 => >(0.5), df)
Internal error: encountered unexpected error in runtime:
BoundsError(a=Array{UInt64, (68,)}[0x0000000000000001, 0x0000000000000002, 0x0000000000000000, 0x0000000000000003, 0x0000000000000004, 0x0000000000000005, 0x0000000000000006, 0x0000000000000007, 0x0000000000000008, 0x0000000000000009, 0x000000000000000a, 0x000000000000000b, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x000000000000000c, 0x000000000000000d, 0x000000000000000e, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x000000000000000f, 0x0000000000000010, 0x0000000000000011, 0x0000000000000012, 0x0000000000000013, 0x0000000000000000, 0x0000000000000014, 0x0000000000000015, 0x0000000000000016, 0x0000000000000017, 0x0000000000000000, 0x0000000000000018, 0x0000000000000019, 0x000000000000001a, 0x000000000000001b, 0x0000000000000000, 0x000000000000001c, 0x000000000000001d, 0x000000000000001e, 0x000000000000001f, 0x0000000000000020, 0x0000000000000021, 0x0000000000000022, 0x0000000000000023], i=(69,))
jl_bounds_error_ints at /buildworker/worker/package_linux64/build/src/rtutils.c:183
getindex at ./array.jl:809 [inlined]
getindex at ./abstractarray.jl:1122 [inlined]
DFS at ./compiler/ssair/domtree.jl:197
SNCA at ./compiler/ssair/domtree.jl:269
construct_domtree at ./compiler/ssair/domtree.jl:121
run_passes at ./compiler/ssair/driver.jl:131
optimize at ./compiler/optimize.jl:172
typeinf at ./compiler/typeinfer.jl:32
typeinf_edge at ./compiler/typeinfer.jl:522
abstract_call_method at ./compiler/abstractinterpretation.jl:453
abstract_call_gf_by_type at ./compiler/abstractinterpretation.jl:129
abstract_call_known at ./compiler/abstractinterpretation.jl:991
abstract_call at ./compiler/abstractinterpretation.jl:1014
abstract_call at ./compiler/abstractinterpretation.jl:998
abstract_eval_statement at ./compiler/abstractinterpretation.jl:1119
typeinf_local at ./compiler/abstractinterpretation.jl:1375
typeinf_nocycle at ./compiler/abstractinterpretation.jl:1431
typeinf at ./compiler/typeinfer.jl:12
typeinf_ext at ./compiler/typeinfer.jl:609
typeinf_ext_toplevel at ./compiler/typeinfer.jl:642
typeinf_ext_toplevel at ./compiler/typeinfer.jl:638
jfptr_typeinf_ext_toplevel_10082.clone_1 at /home/bkamins/Downloads/julia-latest-linux64/julia-371bfa89d4/lib/julia/sys.so (unknown line)
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1750 [inlined]
jl_type_infer at /buildworker/worker/package_linux64/build/src/gf.c:300
jl_generate_fptr at /buildworker/worker/package_linux64/build/src/jitlayers.cpp:293
jl_compile_method_internal at /buildworker/worker/package_linux64/build/src/gf.c:1890
jl_compile_method_internal at /buildworker/worker/package_linux64/build/src/gf.c:1841 [inlined]
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2145 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2336
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1750 [inlined]
do_call at /buildworker/worker/package_linux64/build/src/interpreter.c:117
eval_value at /buildworker/worker/package_linux64/build/src/interpreter.c:206
eval_stmt_value at /buildworker/worker/package_linux64/build/src/interpreter.c:157 [inlined]
eval_body at /buildworker/worker/package_linux64/build/src/interpreter.c:551
jl_interpret_toplevel_thunk at /buildworker/worker/package_linux64/build/src/interpreter.c:659
top-level scope at REPL[15]:1
jl_toplevel_eval_flex at /buildworker/worker/package_linux64/build/src/toplevel.c:838
jl_toplevel_eval_flex at /buildworker/worker/package_linux64/build/src/toplevel.c:788
jl_toplevel_eval_in at /buildworker/worker/package_linux64/build/src/toplevel.c:881
eval at ./boot.jl:344
eval_user_input at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:134
repl_backend_loop at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:195
start_repl_backend at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:180
#run_repl#41 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:311
run_repl at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:299
#844 at ./client.jl:386
jfptr_YY.844_30029.clone_1 at /home/bkamins/Downloads/julia-latest-linux64/julia-371bfa89d4/lib/julia/sys.so (unknown line)
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1750 [inlined]
do_apply at /buildworker/worker/package_linux64/build/src/builtins.c:655
jl_f__apply_latest at /buildworker/worker/package_linux64/build/src/builtins.c:705
#invokelatest#2 at ./essentials.jl:718 [inlined]
invokelatest at ./essentials.jl:717 [inlined]
run_main_repl at ./client.jl:371
exec_options at ./client.jl:301
_start at ./client.jl:484
jfptr__start_22499.clone_1 at /home/bkamins/Downloads/julia-latest-linux64/julia-371bfa89d4/lib/julia/sys.so (unknown line)
jl_apply at /buildworker/worker/package_linux64/build/ui/../src/julia.h:1750 [inlined]
true_main at /buildworker/worker/package_linux64/build/ui/repl.c:106
main at /buildworker/worker/package_linux64/build/ui/repl.c:227
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
_start at /home/bkamins/Downloads/julia-latest-linux64/julia-371bfa89d4/bin/julia (unknown line)
3×1 DataFrame
│ Row │ x1 │
│ │ Float64 │
├─────┼──────────┤
│ 1 │ 0.696103 │
│ 2 │ 0.628184 │
│ 3 │ 0.610729 │
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have tests to catch an inference failure if somebody removed the assertion?
Yes - tests will catch this then.
Thank you for approving, but I think let us wait if we get a feedback from core devs on nightly issues. |
@nalimilan - added NEWS.md entry. Can you have a quick look please before merging? |
Thank you! |
This is a non-breaking proposal to add an option to
filter
,sort
,dropmissing
, andunique
to return a view instead of a copy. This is relevant for very large data frames on which these operations can easily eat up all memory. It is also faster than doing a copy (of course if it is worth doing it depends on what one would want to do with the data frame later).The drawback is that we become type unstable, but this is a case when we have a small union (
DataFrame
orSubDataFrame
) so compiler should handle it. An alternative withVal
seemed to be an overkill (though it would be type stable then).Now I just have written an implementation and proposed docstring updates. If we think this PR is OK I will add tests and update NEWS.md.