Feature/Compatibility Request: Arrow and/or DuckDB support for `get_age` #18

TPDeramus · 2024-10-23T21:04:04Z

Hi Michael.

Thanks for sharing your utility functions, get_age in particular has come in extremely handy for some of the data I've been working with where age needs to be more precise than the year rounded down.

However, a lot of the data I'm working with happens to be part of VERY large datasets that need to be loaded then mutated in arrow or duckdb tables to work even remotely on a decent scale.

I believe it would be possible to do so using arrow by defining it as a function using register_scalar_function:
https://arrow.apache.org/docs/dev/r/reference/register_scalar_function.html

And I had some success with getting it to run this way, but I think the data.table and foverlaps requirements are inadvertently pulling it into R or running all columns at once in the dplyr call and slowing it down in some way.

Would you happen to have some experience with the arrow package to the degree that you might be able to provide some suggestions on to how this could be done for get_age()?

This is hanging just like converting the data to a tibble() and running it in R because there's so much of it:

register_scalar_function(
  "get_age_arrow",
  function(context, birthdays, ref_dates) {
  x <- data.table(bday <- unclass(birthdays),
                  #rem: how many days has it been since the lapse of the
                  #  most recent quadrennium since your birth?
                  rem = ((ref <- unclass(ref_dates)) - bday) %% 1461)
  #cycle_type: which of the four years following your birthday
  #  was the one that had 366 days? 
  x[ , cycle_type := 
       foverlaps(data.table(start = bdr <- bday %% 1461L, end = bdr),
                 #these intervals were calculated by hand;
                 #  e.g., 59 is Feb. 28, 1970. I made the judgment
                 #  call to say that those born on Feb. 29 don't
                 #  have their "birthday" until the following March 1st.
                 data.table(start = c(0L, 59L, 424L, 790L, 1155L), 
                            end = c(58L, 423L, 789L, 1154L, 1460L), 
                            val = c(3L, 2L, 1L, 4L, 3L),
                            key = "start,end"))$val]
  I4 <- diag(4L)[ , -4L] #for conciseness below
  #The `by` approach might seem a little abstruse for those
  #  not familiar with `data.table`; see the edit history
  #  for a more palatable version (which is also slightly slower)
  x[ , extra := 
       foverlaps(data.table(start = rem, end = rem),
                 data.table(start = st <- cumsum(c(0L, rep(365L, 3L) +
                                                     I4[.BY[[1L]],])),
                            end = c(st[-1L] - 1L, 1461L),
                            int_yrs = 0:3, key = "start,end")
       )[ , int_yrs + (i.start - start) / (end + 1L - start)], by = cycle_type]
  #grand finale -- 4 years for every quadrennium, plus the fraction:
  4L * ((ref - bday) %/% 1461L) + x$extra
},
  in_type = schema(birthdays = date32(), ref_dates = date32()),
  out_type = float64(),
  auto_convert = TRUE
)

Thanks in advance!

The text was updated successfully, but these errors were encountered:

MichaelChirico · 2024-12-04T08:02:51Z

Sorry I have not had a chance to use {arrow} much, and certainly not in a few years. Pinging @jonkeane who may have a better idea.

jonkeane · 2024-12-09T04:00:10Z

Hmm, yeah looking at that code using a UDF with arrow would likely be pretty slow. UDFs in arrow operate one row at a time, so there would be a non-trivial amount of overhead doing that.

If one refactors the code calculating fractional age to use any of the base arithmetic functions (a more or less full list of functions that have been mapped are at https://arrow.apache.org/docs/r/reference/acero.html), which should be possible if I'm reading this code correctly, it could run natively in arrow's query execution engine fully vectorized (and likely quite quickly).

MichaelChirico · 2024-12-20T09:10:48Z

@jonkeane how easy is it to register new mappings?

I reckon I could rewrite the foverlaps() steps to use fcase() instead, which works like dplyr::case_when() with a slightly different API. (fcase() didn't exist when I first wrote this function)

I assume it wouldn't require acero to depend on {data.table} (i.e. akin to {dbplyr} which just statically analyzes things and makes replacements to map to the underlying {arrow} backend).

TPDeramus · 2024-12-20T13:37:03Z

Wish I had more information but haven't tried it myself.

Seemed relatively straightforward when I tried it with the only bottleneck being the use of data.table() on an extremely large dataset.

MichaelChirico · 2024-12-23T07:05:37Z

Thanks a ton for the FR @TPDeramus. It forced me to revisit the implementation of get_age() -- it's now (1) much simpler to read (2) a fair amount faster and (3) more correct!

IINM the implementation now in #22 can easily be translated to other engines. Working on closing that loop now.

MichaelChirico · 2024-12-23T07:23:14Z

Filed apache/arrow#45098 on the {arrow} side to close the loop.

This was referenced Dec 21, 2024

Fix unreadable tests using integral date representation #21

Merged

Use fcase() in get_age() #22

Merged

MichaelChirico mentioned this issue Dec 23, 2024

Support translations of simple data.table functions apache/arrow#45098

Open

MichaelChirico closed this as completed in #22 Dec 23, 2024

MichaelChirico reopened this Dec 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/Compatibility Request: Arrow and/or DuckDB support for `get_age` #18

Feature/Compatibility Request: Arrow and/or DuckDB support for `get_age` #18

TPDeramus commented Oct 23, 2024

MichaelChirico commented Dec 4, 2024

jonkeane commented Dec 9, 2024

MichaelChirico commented Dec 20, 2024

TPDeramus commented Dec 20, 2024

MichaelChirico commented Dec 23, 2024

MichaelChirico commented Dec 23, 2024

Feature/Compatibility Request: Arrow and/or DuckDB support for get_age #18

Feature/Compatibility Request: Arrow and/or DuckDB support for get_age #18

Comments

TPDeramus commented Oct 23, 2024

MichaelChirico commented Dec 4, 2024

jonkeane commented Dec 9, 2024

MichaelChirico commented Dec 20, 2024

TPDeramus commented Dec 20, 2024

MichaelChirico commented Dec 23, 2024

MichaelChirico commented Dec 23, 2024

Feature/Compatibility Request: Arrow and/or DuckDB support for `get_age` #18

Feature/Compatibility Request: Arrow and/or DuckDB support for `get_age` #18