-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/Compatibility Request: Arrow and/or DuckDB support for get_age
#18
Comments
Sorry I have not had a chance to use {arrow} much, and certainly not in a few years. Pinging @jonkeane who may have a better idea. |
Hmm, yeah looking at that code using a UDF with arrow would likely be pretty slow. UDFs in arrow operate one row at a time, so there would be a non-trivial amount of overhead doing that. If one refactors the code calculating fractional age to use any of the base arithmetic functions (a more or less full list of functions that have been mapped are at https://arrow.apache.org/docs/r/reference/acero.html), which should be possible if I'm reading this code correctly, it could run natively in arrow's query execution engine fully vectorized (and likely quite quickly). |
@jonkeane how easy is it to register new mappings? I reckon I could rewrite the I assume it wouldn't require acero to depend on {data.table} (i.e. akin to {dbplyr} which just statically analyzes things and makes replacements to map to the underlying {arrow} backend). |
Wish I had more information but haven't tried it myself. Seemed relatively straightforward when I tried it with the only bottleneck being the use of |
Thanks a ton for the FR @TPDeramus. It forced me to revisit the implementation of IINM the implementation now in #22 can easily be translated to other engines. Working on closing that loop now. |
Filed apache/arrow#45098 on the {arrow} side to close the loop. |
Hi Michael.
Thanks for sharing your utility functions,
get_age
in particular has come in extremely handy for some of the data I've been working with where age needs to be more precise than the year rounded down.However, a lot of the data I'm working with happens to be part of VERY large datasets that need to be loaded then
mutated
inarrow
orduckdb
tables to work even remotely on a decent scale.I believe it would be possible to do so using arrow by defining it as a function using
register_scalar_function
:https://arrow.apache.org/docs/dev/r/reference/register_scalar_function.html
And I had some success with getting it to run this way, but I think the
data.table
andfoverlaps
requirements are inadvertently pulling it intoR
or running all columns at once in thedplyr
call and slowing it down in some way.Would you happen to have some experience with the
arrow
package to the degree that you might be able to provide some suggestions on to how this could be done forget_age()
?This is hanging just like converting the data to a
tibble()
and running it inR
because there's so much of it:Thanks in advance!
The text was updated successfully, but these errors were encountered: