Extend and simplify API for calculation of range-based rolling window offsets #17807

wence- · 2025-01-24T12:51:27Z

Description

In both cudf.pandas and cudf-polars we would like to be able to use the existing libcudf functionality for computing the preceding and following columns for range-based (grouped) rolling windows.

The current functionality is designed with spark in mind and supports calculations with slightly different requirements compared to pandas and polars.

In this PR, we unify the construction of these offset column calculations to satisfy all uses.

Specifically:

We introduce a new window type: a BOUNDED_OPEN window. This is a range-based window where the endpoint of the range is not included.
We add support for negative values of the window range offset parameter. The existing code only supports non-negative values.

The proposed interface for describing the requested window-type going forward is a strongly-typed one using std::variant. This removes the need for default-constructed scalars when specifying UNBOUNDED and CURRENT_ROW windows.

Closes [FEA] Clean up rolling window interfaces #17578

Performance improvements

Spark permits nulls in the orderby column. In the grouped-rolling case, these nulls must be sorted at one end of each group. The current implementation finds the partition point in each group using thrust::for_each over each group and thrust::partition_point to find the break. For low-cardinality grouped rolling aggregations this can be very slow, since we have only a single thread searching over each group. We replace this with a segmented sum with CUB to find the per-group null count (and hence the break).

Closes [FEA] Performance improvement for low-cardinality grouped range rolling windows with nulls #17714

The dispatched functor for computing the bound given a row value has constexpr dispatch for the common, and required for pandas and polars, case that the orderby column has no-nulls. This shaves some milliseconds.

In the case that the orderby column does have nulls, we extend the interface such that the caller (who must have arranged that it is sorted correctly) tells us whether nulls were sorted BEFORE or AFTER. This avoids multiple kernel launches to deduce the null order, saving some milliseconds, and (in the grouped case) memory footprint. We polyfill this deduction so that the existing interface still works until we can deprecate and remove it.

Guide for review

The core logic is implemented in range_utils.cuh, specifically the range_window_clamper, dispatched to by make_range_window. To keep compile times under control, the template instantiation for computing preceding and following windows is moved into separate translation units: locally compilation takes ~10mins/TU. Things could be split out further if deemed necessary.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2025-01-24T12:51:31Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

wence- · 2025-01-24T17:54:04Z

In draft because waiting for #17787

mythrocks · 2025-01-24T18:22:34Z

For low-cardinality grouped rolling aggregations this can be very slow, since we have only a single thread searching over each group. We replace this with a segmented sum with CUB to find the per-group null count (and hence the break).

This is brilliant.

wence-

signposts

wence- · 2025-01-29T17:30:51Z

cpp/include/cudf/rolling.hpp

+std::unique_ptr<table> grouped_range_rolling_window(
+  table_view const& group_keys,
+  column_view const& orderby,
+  order order,
+  null_order null_order,
+  range_window_type preceding,
+  range_window_type following,
+  size_type min_periods,
+  std::vector<std::pair<column_view const&, rolling_aggregation const&>> requests,
+  rmm::cuda_stream_view stream      = cudf::get_default_stream(),
+  rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref());


I haven't yet added tests of this function, want to bikeshed the interface first.

For example, should the min_periods be part of the request, do we want a rolling_request object similar to the groupby_request we have for grouped aggregations.

wence- · 2025-01-29T17:32:18Z

cpp/src/rolling/detail/range_comparator_utils.cuh

These are now no longer required.

wence- · 2025-01-29T17:34:11Z

cpp/src/rolling/detail/range_utils.cuh

+enum class window_tag : std::int8_t {
+  BOUNDED_OPEN,    ///< Window is bounded by a value-based endpoint, endpoint is excluded.
+  BOUNDED_CLOSED,  ///< Window is bounded by a value-based endpoint, endpoint is included.
+  UNBOUNDED,       ///< Window runs to beginning (or end) of the group the row is in.
+  CURRENT_ROW,     ///< Window contains all rows that compare equal to the current row.


I suppose I can actually use the public types and do constexpr dispatch via is_same_v rather than comparison to a compile-time known enum tag.

wence- · 2025-01-29T17:35:02Z

cpp/src/rolling/detail/range_utils.cuh

+  [[nodiscard]] __device__ constexpr cuda::std::
+    tuple<size_type, size_type, size_type, size_type, size_type, size_type, size_type>
+    row_info(size_type i) const noexcept
+  {
+    if (nulls_at_start) {
+      return {null_count, 0, num_rows, 0, null_count, null_count, num_rows};
+    } else {
+      return {null_count,
+              num_rows,
+              null_count,
+              num_rows - null_count,
+              num_rows,
+              0,
+              num_rows - null_count};
+    }
+  }


I should go back and try again, but I had what seemed to be miscompilation issues if I exposed all of these values via individual function calls instead of returning everything all at once.

wence- · 2025-01-29T17:35:27Z

cpp/src/rolling/detail/range_utils.cuh

+ * Spark requires that orderby columns with floating point type have a
+ * total order on floats where all NaNs compare equal to one-another,
+ * and greater than any non-nan value. These structs implement that logic.


@mythrocks to confirm that I have this behaviour correct.

wence- · 2025-01-29T17:52:12Z

cpp/src/rolling/grouped_rolling.cu

+  auto get_window_type = [](range_window_bounds const& bound) -> range_window_type {
+    if (bound.is_unbounded()) {
+      return unbounded{};
+    } else if (bound.is_current_row()) {
+      return current_row{};
+    } else {
+      return bounded_closed{bound.range_scalar()};
+    }
+  };
+  auto [preceding_column, following_column] =
+    make_range_windows(group_keys,
+                       order_by_column,
+                       order,
+                       get_window_type(preceding),
+                       get_window_type(following),
+                       stream,
+                       cudf::get_current_device_resource_ref());
+


Polyfill to migrate the old interface to the new one. I propose deprecating this API in favour of the one where one passes range_window_types and additionally the null_order.

wence- · 2025-01-29T17:53:22Z

cpp/src/rolling/range_rolling.cu

+[[nodiscard]] null_order deduce_null_order(column_view const& orderby,
+                                           order order,
+                                           rmm::device_uvector<size_type> const& offsets,
+                                           rmm::device_uvector<size_type> const& per_group_nulls,
+                                           rmm::cuda_stream_view stream)


This is needed to polyfill the old interface to the new one.

wence- · 2025-01-29T17:54:33Z

cpp/src/rolling/range_rolling.cu

+{
+  auto d_orderby        = column_device_view::create(orderby, stream);
+  auto const num_groups = offsets.size() - 1;
+  std::size_t bytes{0};
+  auto is_null_it = cudf::detail::make_counting_transform_iterator(
+    cudf::size_type{0}, [orderby = *d_orderby] __device__(size_type i) -> size_type {
+      return static_cast<size_type>(orderby.is_null_nocheck(i));
+    });
+  rmm::device_uvector<cudf::size_type> null_counts{num_groups, stream};
+  cub::DeviceSegmentedReduce::Sum(nullptr,
+                                  bytes,
+                                  is_null_it,
+                                  null_counts.begin(),
+                                  num_groups,
+                                  offsets.begin(),
+                                  offsets.begin() + 1,
+                                  stream.value());
+  auto tmp = rmm::device_buffer(bytes, stream);
+  cub::DeviceSegmentedReduce::Sum(tmp.data(),
+                                  bytes,
+                                  is_null_it,
+                                  null_counts.begin(),
+                                  num_groups,
+                                  offsets.begin(),
+                                  offsets.begin() + 1,
+                                  stream.value());
+  return null_counts;
+}


This is the major performance improvement in this PR. It makes it much faster to find the nulls per group in a grouped rolling window with low-cardinality group key and nulls in the orderby column (in bad cases the old code would take more than a second, this new code takes a few ms).

wence- · 2025-01-29T17:55:16Z

cpp/src/rolling/range_rolling.cu

+std::pair<std::unique_ptr<column>, std::unique_ptr<column>> make_range_windows(
+  table_view const& group_keys,
+  column_view const& orderby,
+  order order,
+  range_window_type preceding,
+  range_window_type following,
+  rmm::cuda_stream_view stream,
+  rmm::device_async_resource_ref mr)


This function goes away once the old APIs are deprecated.

wence- · 2025-01-29T17:55:30Z

cpp/src/rolling/range_rolling.cu

+    auto deduced_null_order = [&]() {
+      if (null_order.has_value()) { return null_order.value(); }
+      if (!orderby.has_nulls()) {
+        // Doesn't matter in this case
+        return null_order::BEFORE;
+      }
+      return deduce_null_order(orderby, order, offsets, per_group_nulls, stream);


This also goes away.

github-actions bot assigned wence- Jan 24, 2025

github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels Jan 24, 2025

wence- added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jan 24, 2025

wence- changed the base branch from branch-25.02 to branch-25.04 January 24, 2025 17:54

wence- added 5 commits January 29, 2025 16:50

New implementation of range-based window bound calculation

1f7f5d6

Refactor to try and reduce/parallelise compile times

ca42179

Better names for range-window column factories

fd93ffb

Simplify dispatch

702f252

Reduce repetition in tests

a1f27ba

wence- force-pushed the wence/fea/range-rolling-redux branch from 4fe1283 to a1f27ba Compare January 29, 2025 17:28

wence- marked this pull request as ready for review January 29, 2025 17:29

wence- requested review from a team as code owners January 29, 2025 17:29

wence- requested review from vyasr and bdice January 29, 2025 17:29

wence- commented Jan 29, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend and simplify API for calculation of range-based rolling window offsets #17807

Extend and simplify API for calculation of range-based rolling window offsets #17807

wence- commented Jan 24, 2025 •

edited

Loading

copy-pr-bot bot commented Jan 24, 2025

wence- commented Jan 24, 2025 •

edited

Loading

mythrocks commented Jan 24, 2025

wence- left a comment

wence- Jan 29, 2025

wence- Jan 29, 2025

wence- Jan 29, 2025

wence- Jan 29, 2025

wence- Jan 29, 2025

wence- Jan 29, 2025

wence- Jan 29, 2025

wence- Jan 29, 2025

wence- Jan 29, 2025

wence- Jan 29, 2025

Extend and simplify API for calculation of range-based rolling window offsets #17807

Are you sure you want to change the base?

Extend and simplify API for calculation of range-based rolling window offsets #17807

Conversation

wence- commented Jan 24, 2025 • edited Loading

Description

Checklist

copy-pr-bot bot commented Jan 24, 2025

wence- commented Jan 24, 2025 • edited Loading

mythrocks commented Jan 24, 2025

wence- left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wence- commented Jan 24, 2025 •

edited

Loading

wence- commented Jan 24, 2025 •

edited

Loading