Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
87 commits
Select commit Hold shift + click to select a range
ec0661e
adds round_temporal to make_compute_options
djnavarro Jan 12, 2022
49726c9
adds minimal round_temporal() implementation
djnavarro Jan 12, 2022
adf697b
registers date_round as a dplyr binding
djnavarro Jan 13, 2022
594129f
support multiple option to round_temporal
djnavarro Jan 14, 2022
9b37f30
support integer multiples in round_date() unit argument
djnavarro Jan 14, 2022
a2c62dc
use base::is.na
djnavarro Jan 14, 2022
b8db692
adds floor_date() and ceiling_date() bindings
djnavarro Jan 14, 2022
52a4cca
adds failing tests
djnavarro Jan 14, 2022
91ba5df
enforces maximum multiple for seconds, minutes and hours
djnavarro Jan 14, 2022
f20f1d1
support lubridate syntax for rounding to fractional seconds
djnavarro Jan 14, 2022
0dedf9d
date rounding tests mirror lubridate prior to v1.6.0
djnavarro Jan 14, 2022
f548a8a
simplified datetime tests
djnavarro Jan 14, 2022
9bc9814
tidy notes
djnavarro Jan 14, 2022
1076419
moves test_df_v2 to top
djnavarro Jan 16, 2022
18ae05a
allows rounding unit to exceed 60sec, 60min, 24hr
djnavarro Jan 16, 2022
96f9ee8
close parenthesis
djnavarro Jan 16, 2022
2c907e4
spaces after "if"
djnavarro Jan 16, 2022
5173f48
restore lubridate defaults for unit thresholds
djnavarro Jan 16, 2022
bcf0f70
tests lubridate rounding thresholds more directly
djnavarro Jan 16, 2022
8298492
use testdate + 1 in test
djnavarro Jan 17, 2022
7fca3fb
binding arguments mirror lubridate arguments
djnavarro Jan 17, 2022
2fc520d
do not attempt to round to "week" units
djnavarro Jan 17, 2022
76dc55e
remove todo comment
djnavarro Jan 17, 2022
e2ac886
skip timezone dependent tests on windows
djnavarro Jan 17, 2022
825df2d
skips more timezone-dependent tests on windows sigh
djnavarro Jan 18, 2022
946262c
skips all date/time rounding tests on windows
djnavarro Jan 18, 2022
58b0913
sets change_on_boundary = FALSE for ceiling_time tests
djnavarro Mar 10, 2022
92f7978
change_on_boundary=TRUE works for special case of 1 day
djnavarro Mar 10, 2022
cea0c6b
support change_on_boundary=TRUE for datetimes below "month" units
djnavarro Mar 11, 2022
d6e7afc
minimal test for desired date behaviour for change_on_boundary = TRUE
djnavarro Mar 11, 2022
461a4ab
Add change_on_boundary to RoundTemporalOptions
rok Mar 17, 2022
aca003b
Update r/src/compute.cpp
djnavarro Mar 30, 2022
d18c0ca
support change_on_boundary
djnavarro Apr 6, 2022
3cf0069
removes unneeded code
djnavarro Apr 6, 2022
916bb0f
fix change on boundary bug
djnavarro Apr 6, 2022
3e67f75
updates tests
djnavarro Apr 6, 2022
f34d961
minimal support for round to week
djnavarro Apr 6, 2022
a187c26
attempt to fix linting issue
djnavarro Apr 6, 2022
f42f6e1
fix whitespace
djnavarro Apr 6, 2022
b8f14b4
extends ceiling_date tests
djnavarro Apr 6, 2022
b1e64c4
isolates strangeness with date arrays
djnavarro Apr 6, 2022
a04087d
change_on_boundary -> ceil_is_strictly_greater
rok Jun 23, 2022
6898e13
isolates last remaining issue with change_on_boundary to possible flo…
djnavarro Jun 27, 2022
b86e292
ceiling_date and floor_date support "week" units for week start 1 and 7
djnavarro Jun 27, 2022
ee70eab
week_start values 2:6 work for datetimes (but not dates)
djnavarro Jun 27, 2022
24981ef
leap_year calls is_leap_year directly
djnavarro Jun 28, 2022
5f56587
week_starts_monday, ceil_is_strictly_greater, and calendar_based_orig…
djnavarro Jun 28, 2022
f1c24f3
removes skip_on_windows
djnavarro Jun 28, 2022
cf65c1b
reorganise tests
djnavarro Jun 28, 2022
9fcd76f
bypass lubridate round-to-week bug on round_date tests
djnavarro Jun 28, 2022
23a2bde
shift_date32_to_week and shift_timestamp_to_week added for case when …
djnavarro Jun 28, 2022
cec9639
linting fix
djnavarro Jun 28, 2022
61da05b
fix another linting issue
djnavarro Jun 28, 2022
330adc3
even more linting fixes
djnavarro Jun 28, 2022
0b72a1c
groups datetime rounding functions together
djnavarro Jun 28, 2022
f631660
moves all round/floor/ceil-to-week to shift_temoporal_to_week
djnavarro Jun 28, 2022
85896b0
removes redundant binding
djnavarro Jun 28, 2022
438fe57
sigh
djnavarro Jun 28, 2022
b72119d
tidies date rounding functions
djnavarro Jun 29, 2022
0f9325b
removes unneeded skips and adds test comments
djnavarro Jun 29, 2022
9de5cfc
reorganises temporal rounding unit tests
djnavarro Jun 29, 2022
e9ee3e0
checks wider range of boundaries for ceiling_time
djnavarro Jun 29, 2022
eef44ca
restores failing tests for month/quarter/year
djnavarro Jun 29, 2022
dd0dd7e
temporarily removes all datetime rounding tests
djnavarro Jul 1, 2022
84f8602
restores datetime rounding tests
djnavarro Jul 1, 2022
0c37995
fixes linting issues
djnavarro Jul 1, 2022
a4be683
attaches timezone attribute correctly for year_of_dates test object
djnavarro Jul 1, 2022
f73d405
tests week/month/year rounding for Dates bypassing lubridate bug
djnavarro Jul 1, 2022
0ae85db
tests for rounding in local time
djnavarro Jul 1, 2022
3928dee
don't link directly to lubridate issue
djnavarro Jul 1, 2022
ae6b4eb
adds Pacific/Marquesas and Asia/Kathmandu timezones
djnavarro Jul 1, 2022
147f70b
adds more timezone tests
djnavarro Jul 1, 2022
fef7e1c
restricts timezone test cases on windows os
djnavarro Jul 1, 2022
b114276
removes trailing whitespace
djnavarro Jul 1, 2022
c6cf2b1
adds missing test: timezone check for round to week
djnavarro Jul 2, 2022
109e46d
fixes date strings test bug
djnavarro Jul 3, 2022
6310abb
bypasses esoteric lubridate timezone-local rounding bug and adds an i…
djnavarro Jul 4, 2022
a9b3144
adds internal consistency tests for timezone-local rounding to multip…
djnavarro Jul 4, 2022
16f989b
sigh. don't test subsecond rounding with test that isn't designed to …
djnavarro Jul 4, 2022
bbbbe00
Removing ARROW-16412 workarounds
rok Jul 13, 2022
9a402c7
incorporate stylistic comments
djnavarro Jul 20, 2022
dbf6a31
moves temporal rounding helpers to dplyr-datetime-helpers.R
djnavarro Jul 20, 2022
5962dcb
tidies style on helper functions
djnavarro Jul 20, 2022
14b74e0
more stylistic edits
djnavarro Jul 21, 2022
76b3c73
removes outdated comment regarding ARROW-16142
djnavarro Jul 21, 2022
13ae009
notes what we mean by an "easy" date
djnavarro Jul 21, 2022
075fcfc
simplifies second/multiple check
djnavarro Jul 21, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
158 changes: 158 additions & 0 deletions r/R/dplyr-datetime-helpers.R
Original file line number Diff line number Diff line change
Expand Up @@ -417,3 +417,161 @@ build_strptime_exprs <- function(x, formats) {
)
)
}

# This function parses the "unit" argument to round_date, floor_date, and
# ceiling_date. The input x is a single string like "second", "3 seconds",
# "10 microseconds" or "2 secs" used to specify the size of the unit to
# which the temporal data should be rounded. The matching rules implemented
# are designed to mirror lubridate exactly: it extracts the numeric multiple
# from the start of the string (presumed to be 1 if no number is present)
# and selects the unit by looking at the first 3 characters only. This choice
# ensures that "secs", "second", "microsecs" etc are all valid, but it is
# very permissive and would interpret "mickeys" as microseconds. This
# permissive implementation mirrors the corresponding implementation in
# lubridate. The return value is a list with integer-valued components
# "multiple" and "unit"
parse_period_unit <- function(x) {
# the regexp matches against fractional units, but per lubridate
# supports integer multiples of a known unit only
match_info <- regexpr(
pattern = " *(?<multiple>[0-9.,]+)? *(?<unit>[^ \t\n]+)",
text = x[[1]],
perl = TRUE
)

capture_start <- attr(match_info, "capture.start")
capture_length <- attr(match_info, "capture.length")
capture_end <- capture_start + capture_length - 1L

str_unit <- substr(x, capture_start[[2]], capture_end[[2]])
str_multiple <- substr(x, capture_start[[1]], capture_end[[1]])

known_units <- c("nanosecond", "microsecond", "millisecond", "second",
"minute", "hour", "day", "week", "month", "quarter", "year")

# match the period unit
str_unit_start <- substr(str_unit, 1, 3)
unit <- as.integer(pmatch(str_unit_start, known_units)) - 1L

if (any(is.na(unit))) {
abort(
sprintf(
"Invalid period name: '%s'",
str_unit,
". Known units are",
oxford_paste(known_units, "and")
)
)
}

# empty string in multiple interpreted as 1
if (capture_length[[1]] == 0) {
multiple <- 1L

# otherwise parse the multiple
} else {
multiple <- as.numeric(str_multiple)

# special cases: interpret fractions of 1 second as integer
# multiples of nanoseconds, microseconds, or milliseconds
# to mirror lubridate syntax
if (unit == 3L) {
if (multiple < 10^-6) {
unit <- 0L
multiple <- 10^9 * multiple
}
if (multiple < 10^-3) {
unit <- 1L
multiple <- 10^6 * multiple
}
if (multiple < 1) {
unit <- 2L
multiple <- 10^3 * multiple
}
}

multiple <- as.integer(multiple)
}

# more special cases: lubridate imposes sensible maximum
# values on the number of seconds, minutes and hours
if (unit == 3L && multiple > 60) {
abort("Rounding with second > 60 is not supported")
}
if (unit == 4L && multiple > 60) {
abort("Rounding with minute > 60 is not supported")
}
if (unit == 5L && multiple > 24) {
abort("Rounding with hour > 24 is not supported")
}

list(unit = unit, multiple = multiple)
}

# This function handles round/ceil/floor when unit is week. The fn argument
# specifies which of the temporal rounding functions (round_date, etc) is to
# be applied, x is the data argument to the rounding function, week_start is
# an integer indicating which day of the week is the start date. The C++
# library natively handles Sunday and Monday so in those cases we pass the
# week_starts_monday option through. Other week_start values are handled here
shift_temporal_to_week <- function(fn, x, week_start, options) {
if (week_start == 7) { # Sunday
options$week_starts_monday <- FALSE
return(Expression$create(fn, x, options = options))
}

if (week_start == 1) { # Monday
options$week_starts_monday <- TRUE
return(Expression$create(fn, x, options = options))
}

# other cases use offset-from-Monday: to ensure type-stable output there
# are two separate helpers, one to handle date32 input and the other to
# handle timestamps
options$week_starts_monday <- TRUE
offset <- as.integer(week_start) - 1

is_date32 <- inherits(x, "Date") ||
(inherits(x, "Expression") && x$type_id() == Type$DATE32)

if (is_date32) {
shifted_date <- shift_date32_to_week(fn, x, offset, options = options)
} else {
shifted_date <- shift_timestamp_to_week(fn, x, offset, options = options)
}

shifted_date
}

# timestamp input should remain timestamp
shift_timestamp_to_week <- function(fn, x, offset, options) {
offset_seconds <- build_expr(
"cast",
Scalar$create(offset * 86400L, int64()),
options = cast_options(to_type = duration(unit = "s"))
)
shift_offset <- build_expr(fn, x - offset_seconds, options = options)

shift_offset + offset_seconds
}

# to avoid date32 types being cast to timestamp during the temporal
# arithmetic, the offset logic needs to use the count in days and
# use integer arithmetic: this feels inelegant, but it ensures that
# temporal rounding functions remain type stable
shift_date32_to_week <- function(fn, x, offset, options) {
# offset the date
offset <- Expression$scalar(Scalar$create(offset, int32()))
x_int <- build_expr("cast", x, options = cast_options(to_type = int32()))
x_int_offset <- x_int - offset
x_offset <- build_expr("cast", x_int_offset, options = cast_options(to_type = date32()))

# apply round/floor/ceil
shift_offset <- build_expr(fn, x_offset, options = options)

# undo offset and return
shift_int_offset <- build_expr("cast", shift_offset, options = cast_options(to_type = int32()))
shift_int <- shift_int_offset + offset

build_expr("cast", shift_int, options = cast_options(to_type = date32()))
}
52 changes: 52 additions & 0 deletions r/R/dplyr-funcs-datetime.R
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ register_bindings_datetime <- function() {
register_bindings_duration_constructor()
register_bindings_duration_helpers()
register_bindings_datetime_parsers()
register_bindings_datetime_rounding()
}

register_bindings_datetime_utility <- function() {
Expand Down Expand Up @@ -622,4 +623,55 @@ register_bindings_datetime_parsers <- function() {

build_expr("assume_timezone", coalesce_output, options = list(timezone = tz))
})

}

register_bindings_datetime_rounding <- function() {
register_binding(
"round_date",
function(x,
unit = "second",
week_start = getOption("lubridate.week.start", 7)) {

opts <- parse_period_unit(unit)
if (opts$unit == 7L) { # weeks (unit = 7L) need to accommodate week_start
return(shift_temporal_to_week("round_temporal", x, week_start, options = opts))
}

Expression$create("round_temporal", x, options = opts)
})

register_binding(
"floor_date",
function(x,
unit = "second",
week_start = getOption("lubridate.week.start", 7)) {

opts <- parse_period_unit(unit)
if (opts$unit == 7L) { # weeks (unit = 7L) need to accommodate week_start
return(shift_temporal_to_week("floor_temporal", x, week_start, options = opts))
}

Expression$create("floor_temporal", x, options = opts)
})

register_binding(
"ceiling_date",
function(x,
unit = "second",
change_on_boundary = NULL,
week_start = getOption("lubridate.week.start", 7)) {
opts <- parse_period_unit(unit)
if (is.null(change_on_boundary)) {
change_on_boundary <- ifelse(call_binding("is.Date", x), TRUE, FALSE)
}
opts$ceil_is_strictly_greater <- change_on_boundary

if (opts$unit == 7L) { # weeks (unit = 7L) need to accommodate week_start
return(shift_temporal_to_week("ceil_temporal", x, week_start, options = opts))
}

Expression$create("ceil_temporal", x, options = opts)
})

}
29 changes: 29 additions & 0 deletions r/src/compute.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -519,6 +519,35 @@ std::shared_ptr<arrow::compute::FunctionOptions> make_compute_options(
return out;
}

if (func_name == "round_temporal" || func_name == "floor_temporal" ||
func_name == "ceil_temporal") {
using Options = arrow::compute::RoundTemporalOptions;

int64_t multiple = 1;
enum arrow::compute::CalendarUnit unit = arrow::compute::CalendarUnit::DAY;
bool week_starts_monday = true;
bool ceil_is_strictly_greater = true;
bool calendar_based_origin = true;

if (!Rf_isNull(options["multiple"])) {
multiple = cpp11::as_cpp<int64_t>(options["multiple"]);
}
if (!Rf_isNull(options["unit"])) {
unit = cpp11::as_cpp<enum arrow::compute::CalendarUnit>(options["unit"]);
}
if (!Rf_isNull(options["week_starts_monday"])) {
week_starts_monday = cpp11::as_cpp<bool>(options["week_starts_monday"]);
}
if (!Rf_isNull(options["ceil_is_strictly_greater"])) {
ceil_is_strictly_greater = cpp11::as_cpp<bool>(options["ceil_is_strictly_greater"]);
}
if (!Rf_isNull(options["calendar_based_origin"])) {
calendar_based_origin = cpp11::as_cpp<bool>(options["calendar_based_origin"]);
}
return std::make_shared<Options>(multiple, unit, week_starts_monday,
ceil_is_strictly_greater, calendar_based_origin);
}

if (func_name == "round_to_multiple") {
using Options = arrow::compute::RoundToMultipleOptions;
auto out = std::make_shared<Options>(Options::Defaults());
Expand Down
Loading