[CHIP-11] UnionJoin Algorithim #1054
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
CHIP-11: UnionJoin for Sequence Generation
_Author: Abby Whittier, Nikhil Simha
This CHIP introduces UnionJoin, an optimized join algorithm for sequence feature generation in Chronon.
Motivation
Sequence features are a critical component of modern machine learning systems, but generating them efficiently at scale remains challenging. The standard join approach processes each label event independently, leading to redundant computation when generating sequences for multiple events with overlapping time windows.
The primary goal of this CHIP is to introduce an optimized join algorithm that takes advantage of the natural grouping in sequential data to achieve order-of-magnitude performance improvements for sequence feature generation workloads.
Code Reference
Shout out to @nikhil-zlai for the implementation!
Most of my work was massaging into the Airbnb commit history.
https://github.com/zipline-ai/chronon/blob/main/spark/src/main/scala/ai/chronon/spark/join/UnionJoin.scala
How It Works
What is a UnionJoin?
A UnionJoin is an incremental compute algorithm that takes advantage of the natural grouping of sequential data.
Traditional Join Approach:
In a normal join algorithm, each label event, denoted as (user, timestamp, label), is processed independently. Each event joins with the feature event stream, and the corresponding feature data is scattered across different rows, creating a cross product. This is reliable and generic, but leads to the same sequences being recomputed multiple times.
For example, if you have 1000 label events for the same user within a 30-day window, the scatter-join will compute overlapping 30-day sequences 1000 times, with massive redundancy. You can think of this like an O(N²) solution to a sliding window problem on LeetCode.
UnionJoin Approach:
A UnionJoin gathers all related keys, both on the label side and feature side, into a single partition. We then use a sliding window aggregation algorithm to generate sequences in O(N) time and space complexity using queue/dequeue operations.
Dealing with Skew
The implementation is very sensitive to skewed keys, such as skewed queries/skewed rows.
Adaptive Query Engine (AQE) was a significant help in handling skew by dynamically splitting large partitions.
PR here: #1001
Incremental aggregation will be the primary next step for handling very long windows and high-cardinality keys.
Limitations
Results
UnionJoin was 5-10x faster than the existing join algorithm for sequence feature generation within Netlfix compared to the existing.
UX
We expose a spark param for which join algorithm to use. It is configured in
teams.jsonOnce we land incremental aggregation, we hope to use UnionJoin exclusively.