Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(api): refactor the implementation of windowing #9200

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

chloeh13q
Copy link
Contributor

@chloeh13q chloeh13q commented May 15, 2024

Description of changes

Some design considerations:

  • Removed the original implementation of windowing to avoid confusion.
  • I thought about creating a window class to hold all the relevant windowing info like window type, window size, window slide, etc. The API would be slightly cleaner that way but I worry that it would be confused with the existing ibis.window().
  • Tumble and hop are the most basic types of windows that we should support across the board. Session windows and cumulate windows are not supported in every backend and the API is slightly different.

The new API:

>>> import ibis
>>> from ibis import _
>>> t = ibis.table(
...     ibis.schema(
...         {
...             "createTime": "timestamp(3)",
...             "orderId": "int64",
...             "payAmount": "float64",
...             "payPlatform": "int32",
...             "provinceId": "int32",
...         }
...     ),
...     name="payment_msg",
... )
>>> expr = (
...         t.window_by(time_col="createTime")
...         .tumble(size=ibis.interval(seconds=30))
...         .agg(by=["provinceId"], avgPayAmount=_.payAmount.mean())
...     )
>>> expr
r0 := UnboundTable: payment_msg
  createTime  timestamp(3)
  orderId     int64
  payAmount   float64
  payPlatform int32
  provinceId  int32

WindowAggregate[r0]
  window_type:
    tumble
  time_col:
    r0.createTime
  groups:
    provinceId: r0.provinceId
  metrics:
    avgPayAmount: Mean(r0.payAmount)
  window_size:
    30 s
  schema:
    window_start timestamp
    window_end   timestamp
    provinceId   int32
    avgPayAmount float64

Issues closed

#8847

@chloeh13q chloeh13q force-pushed the refactor/windowing branch 3 times, most recently from fc5d2fd to 25fe86d Compare May 31, 2024 21:51
@chloeh13q chloeh13q marked this pull request as ready for review June 5, 2024 19:52
@cpcloud cpcloud added this to the 9.2 milestone Jun 13, 2024
@chloeh13q chloeh13q force-pushed the refactor/windowing branch 2 times, most recently from 00e09e8 to 6f14e43 Compare July 1, 2024 23:24
@cpcloud
Copy link
Member

cpcloud commented Jul 5, 2024

@chloeh13q This LGTM! Just to clarify, are there any user-facing API changes, or is it only internals that have changed?

@cpcloud cpcloud added refactor Issues or PRs related to refactoring the codebase streaming Issue related to streaming APIs or backends labels Jul 5, 2024
@chloeh13q
Copy link
Contributor Author

chloeh13q commented Jul 9, 2024

@cpcloud Thanks for reviewing!

are there any user-facing API changes, or is it only internals that have changed?

There are user-facing API changes, although the changes are pretty subtle. We're changing:

expr = (
    t.window_by(time_col="i")
    .tumble(window_size=ibis.interval(minutes=15))
    .group_by(["window_start", "window_end", "g"])
    .aggregate(mean=_.d.mean())
)

to something like

expr = (
    t.window_by(time_col="i")
    .tumble(window_size=ibis.interval(minutes=15))
    .group_by(["g"])
    .aggregate(mean=_.d.mean())
)

The big difference is that we're trying to hide columns like "window_start" "window_end" from the user (as well as providing more alternative syntax in a more Ibis-like manner, e.g., you can use both .agg and .aggregate, and you can put the groupby into aggregate as well now).

[ps: only implemented windowed aggregations so far to see if we like this new API and IR changes. Planning to implement window top-n, window join, and cascading windows (aka chained window aggregations) (note that we didn't have alternative syntax for these previously, but you could do them with underlying ops like row number, so these are going to be new APIs) in subsequent PRs]

With changes merged from another PR, I was able to add tests for window aggregations in the Spark backend too.

@chloeh13q chloeh13q requested review from cpcloud and jcrist July 9, 2024 21:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
refactor Issues or PRs related to refactoring the codebase streaming Issue related to streaming APIs or backends
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants