Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/Compatibility Request: Arrow and/or DuckDB support for get_age #18

Open
TPDeramus opened this issue Oct 23, 2024 · 6 comments · Fixed by #22
Open

Feature/Compatibility Request: Arrow and/or DuckDB support for get_age #18

TPDeramus opened this issue Oct 23, 2024 · 6 comments · Fixed by #22

Comments

@TPDeramus
Copy link

Hi Michael.

Thanks for sharing your utility functions, get_age in particular has come in extremely handy for some of the data I've been working with where age needs to be more precise than the year rounded down.

However, a lot of the data I'm working with happens to be part of VERY large datasets that need to be loaded then mutated in arrow or duckdb tables to work even remotely on a decent scale.

I believe it would be possible to do so using arrow by defining it as a function using register_scalar_function:
https://arrow.apache.org/docs/dev/r/reference/register_scalar_function.html

And I had some success with getting it to run this way, but I think the data.table and foverlaps requirements are inadvertently pulling it into R or running all columns at once in the dplyr call and slowing it down in some way.

Would you happen to have some experience with the arrow package to the degree that you might be able to provide some suggestions on to how this could be done for get_age()?

This is hanging just like converting the data to a tibble() and running it in R because there's so much of it:

register_scalar_function(
  "get_age_arrow",
  function(context, birthdays, ref_dates) {
  x <- data.table(bday <- unclass(birthdays),
                  #rem: how many days has it been since the lapse of the
                  #  most recent quadrennium since your birth?
                  rem = ((ref <- unclass(ref_dates)) - bday) %% 1461)
  #cycle_type: which of the four years following your birthday
  #  was the one that had 366 days? 
  x[ , cycle_type := 
       foverlaps(data.table(start = bdr <- bday %% 1461L, end = bdr),
                 #these intervals were calculated by hand;
                 #  e.g., 59 is Feb. 28, 1970. I made the judgment
                 #  call to say that those born on Feb. 29 don't
                 #  have their "birthday" until the following March 1st.
                 data.table(start = c(0L, 59L, 424L, 790L, 1155L), 
                            end = c(58L, 423L, 789L, 1154L, 1460L), 
                            val = c(3L, 2L, 1L, 4L, 3L),
                            key = "start,end"))$val]
  I4 <- diag(4L)[ , -4L] #for conciseness below
  #The `by` approach might seem a little abstruse for those
  #  not familiar with `data.table`; see the edit history
  #  for a more palatable version (which is also slightly slower)
  x[ , extra := 
       foverlaps(data.table(start = rem, end = rem),
                 data.table(start = st <- cumsum(c(0L, rep(365L, 3L) +
                                                     I4[.BY[[1L]],])),
                            end = c(st[-1L] - 1L, 1461L),
                            int_yrs = 0:3, key = "start,end")
       )[ , int_yrs + (i.start - start) / (end + 1L - start)], by = cycle_type]
  #grand finale -- 4 years for every quadrennium, plus the fraction:
  4L * ((ref - bday) %/% 1461L) + x$extra
},
  in_type = schema(birthdays = date32(), ref_dates = date32()),
  out_type = float64(),
  auto_convert = TRUE
)

Thanks in advance!

@MichaelChirico
Copy link
Owner

Sorry I have not had a chance to use {arrow} much, and certainly not in a few years. Pinging @jonkeane who may have a better idea.

@jonkeane
Copy link

jonkeane commented Dec 9, 2024

Hmm, yeah looking at that code using a UDF with arrow would likely be pretty slow. UDFs in arrow operate one row at a time, so there would be a non-trivial amount of overhead doing that.

If one refactors the code calculating fractional age to use any of the base arithmetic functions (a more or less full list of functions that have been mapped are at https://arrow.apache.org/docs/r/reference/acero.html), which should be possible if I'm reading this code correctly, it could run natively in arrow's query execution engine fully vectorized (and likely quite quickly).

@MichaelChirico
Copy link
Owner

@jonkeane how easy is it to register new mappings?

I reckon I could rewrite the foverlaps() steps to use fcase() instead, which works like dplyr::case_when() with a slightly different API. (fcase() didn't exist when I first wrote this function)

I assume it wouldn't require acero to depend on {data.table} (i.e. akin to {dbplyr} which just statically analyzes things and makes replacements to map to the underlying {arrow} backend).

@TPDeramus
Copy link
Author

Wish I had more information but haven't tried it myself.

Seemed relatively straightforward when I tried it with the only bottleneck being the use of data.table() on an extremely large dataset.

@MichaelChirico
Copy link
Owner

Thanks a ton for the FR @TPDeramus. It forced me to revisit the implementation of get_age() -- it's now (1) much simpler to read (2) a fair amount faster and (3) more correct!

IINM the implementation now in #22 can easily be translated to other engines. Working on closing that loop now.

@MichaelChirico
Copy link
Owner

Filed apache/arrow#45098 on the {arrow} side to close the loop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants