Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
108 changes: 99 additions & 9 deletions docs/how_it_works.md
Original file line number Diff line number Diff line change
Expand Up @@ -272,7 +272,104 @@ print((pn.col("a") + 1).mean())
For simple aggregations, Narwhals can just look at `_depth` and `function_name` and figure out
which (efficient) elementary operation this corresponds to in pandas.

## Broadcasting
## Expression Metadata

Let's try printing out a few expressions to the console to see what they show us:

```python exec="1" result="python" session="metadata" source="above"
import narwhals as nw

print(nw.col("a"))
print(nw.col("a").mean())
print(nw.col("a").mean().over("b"))
```

Note how they tell us something about their metadata. This section is all about
making sense of what that all means, what the rules are, and what it enables.

### Expression kinds

Each Narwhals expression can be of one of the following kinds:

- `LITERAL`: expressions which correspond to literal values, such as the `3` in `nw.col('a')+3`.
- `AGGREGATION`: expressions which reduce a column to a single value (e.g. `nw.col('a').mean()`).
- `TRANSFORM`: expressions which don't change length (e.g. `nw.col('a').abs()`).
- `WINDOW`: like `TRANSFORM`, but the last operation is a (row-order-dependent)
window function (`rolling_*`, `cum_*`, `diff`, `shift`, `is_*_distinct`).
- `FILTRATION`: expressions which change length but don't
aggregate (e.g. `nw.col('a').drop_nulls()`).

For example:

- `nw.col('a')` is not order-dependent, so it's `TRANSFORM`.
- `nw.col('a').abs()` is not order-dependent, so it's a `TRANSFORM`.
- `nw.col('a').cum_sum()`'s last operation is `cum_sum`, so it's `WINDOW`.
- `nw.col('a').cum_sum() + 1`'s last operation is `__add__`, and it preserves
the input dataframe's length, so it's a `TRANSFORM`.

How these change depends on the operation.

#### Chaining

Say we have `expr.expr_method()`. How does `expr`'s `ExprMetadata` change?
This depends on `expr_method`.
Comment on lines +310 to +315
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite sure what to suggest, but some notes

  • Maybe the sentence before Chaining is redundant?
  • After Chaining I think there's a few too many expr/Expr 😡


- Element-wise expressions such `abs`, `alias`, `cast`, `__invert__`, and
many more, preserve the input kind (unless `expr` is a `WINDOW`, in
which case it becomes a `TRANSFORM`. This is because for an expression
to be `WINDOW`, the last expression needs to be the order-dependent one).
- `rolling_*`, `cum_*`, `diff`, `shift`, `ewm_mean`, and `is_*_distinct`
are window functions and result in `WINDOW`.
- `mean`, `std`, `median`, and other aggregations result in `AGGREGATION`,
and can only be applied to `TRANSFORM` and `WINDOW`.
- `drop_nulls` and `filter` result in `FILTRATION`, and can only be applied
to `TRANSFORM` and `WINDOW`.
- `over` always results in `TRANSFORM`. This is a bit more complicated,
so we elaborate on it in the ["You open a window ..."](#you-open-a-window-to-another-window-to-another-window-to-another-window).

#### Binary operations (e.g. `nw.col('a') + nw.col('b')`)

How do expression kinds change under binary operations? For example,
if we do `expr1 + expr2`, then what can we say about the output kind?
The rules are:

- If both are `LITERAL`, then the output is `LITERAL`.
- If one is a `FILTRATION`, then:

- if the other is `LITERAL` or `AGGREGATION`, then the output is `FILTRATION`.
- else, we raise an error.

- If one is `TRANSFORM` or `WINDOW` and the other is not `FILTRATION`,
then the output is `TRANSFORM`.
- If one is `AGGREGATION` and the other is `LITERAL` or `AGGREGATION`,
the output is `AGGREGATION`.

For n-ary operations such as `nw.sum_horizontal`, the above logic is
extended across inputs. For example, `nw.sum_horizontal(expr1, expr2, expr3)`
is `LITERAL` if all of `expr1`, `expr2`, and `expr3` are.
Comment on lines +330 to +349
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't worry about it for this PR, but I feel like this could really benefit from some kind of visual aid.

Not sure what format would make the most sense - just wanted to note it as tough to keep track of in my head πŸ˜”


### "You open a window to another window to another window to another window"

When we print out an expression, in addition to the expression kind,
we also see `window_kind`. There are four window kinds:

- `NONE`: non-order-dependent operations, like `.abs()` or `.mean()`.
- `CLOSEABLE`: expression where the last operation is order-dependent. For
example, `nw.col('a').diff()`.
- `UNCLOSEABLE`: expression where some operation is order-dependent but
the order-dependent operation wasn't the last one. For example,
`nw.col('a').diff().abs()`.
- `CLOSED`: expression contains `over` at some point, and any order-dependent
operation was immediately followed by `over(order_by=...)`.

When working with `DataFrame`s, row order is well-defined, as the dataframes
are assumed to be eager and in-memory. Therefore, it's allowed to work
with all window kinds.

When working with `LazyFrame`s, on the other hand, row order is undefined.
Therefore, window kinds must either be `NONE` or `CLOSED`.

### Broadcasting

When performing comparisons between columns and aggregations or scalars, we operate as if the
aggregation or scalar was broadcasted to the length of the whole column. For example, if we
Expand All @@ -282,14 +379,7 @@ with values `[-1, 0, 1]`.

Different libraries do broadcasting differently. SQL-like libraries require an empty window
function for expressions (e.g. `a - sum(a) over ()`), Polars does its own broadcasting of
length-1 Series, and pandas does its own broadcasting of scalars. Narwhals keeps track of
when to trigger a broadcast by tracking the `ExprKind` of each expression. `ExprKind` is an
`Enum` with four variants:

- `TRANSFORM`: expressions which don't change length (e.g. `nw.col('a').abs()`).
- `AGGREGATION`: expressions which reduce a column to a single value (e.g. `nw.col('a').mean()`).
- `CHANGE_LENGTH`: expressions which change length but don't necessarily aggregate (e.g. `nw.col('a').drop_nulls()`).
- `LITERAL`: expressions which correspond to literal values, such as the `3` in `nw.col('a')+3`.
length-1 Series, and pandas does its own broadcasting of scalars.

Narwhals triggers a broadcast in these situations:

Expand Down
128 changes: 101 additions & 27 deletions narwhals/_expression_parsing.py
Original file line number Diff line number Diff line change
Expand Up @@ -121,7 +121,7 @@ class ExprKind(Enum):
- LITERAL vs LITERAL -> LITERAL
- FILTRATION vs (LITERAL | AGGREGATION) -> FILTRATION
- FILTRATION vs (FILTRATION | TRANSFORM | WINDOW) -> raise
- (TRANSFORM | WINDOW) vs (LITERAL | AGGREGATION) -> TRANSFORM
- (TRANSFORM | WINDOW) vs (...) -> TRANSFORM
- AGGREGATION vs (LITERAL | AGGREGATION) -> AGGREGATION
"""

Expand Down Expand Up @@ -191,30 +191,64 @@ def is_multi_output(
return expansion_kind in {ExpansionKind.MULTI_NAMED, ExpansionKind.MULTI_UNNAMED}


class WindowKind(Enum):
"""Describe what kind of window the expression contains."""

NONE = auto()
"""e.g. `nw.col('a').abs()`, no windows."""

CLOSEABLE = auto()
"""e.g. `nw.col('a').cum_sum()` - can be closed if immediately followed by `over(order_by=...)`."""

UNCLOSEABLE = auto()
"""e.g. `nw.col('a').cum_sum().abs()` - the window function (`cum_sum`) wasn't immediately followed by
`over(order_by=...)`, and so the window is uncloseable.

Uncloseable windows can be used freely in `nw.DataFrame`, but not in `nw.LazyFrame` where
row-order is undefined."""

CLOSED = auto()
"""e.g. `nw.col('a').cum_sum().over(order_by='i')`."""

def is_open(self) -> bool:
return self in {WindowKind.UNCLOSEABLE, WindowKind.CLOSEABLE}

def is_closed(self) -> bool:
return self is WindowKind.CLOSED

def is_uncloseable(self) -> bool:
return self is WindowKind.UNCLOSEABLE


class ExprMetadata:
__slots__ = ("_expansion_kind", "_kind", "_n_open_windows")
__slots__ = ("_expansion_kind", "_kind", "_window_kind")

def __init__(
self, kind: ExprKind, /, *, n_open_windows: int, expansion_kind: ExpansionKind
self,
kind: ExprKind,
/,
*,
window_kind: WindowKind,
expansion_kind: ExpansionKind,
) -> None:
self._kind: ExprKind = kind
self._n_open_windows = n_open_windows
self._window_kind = window_kind
self._expansion_kind = expansion_kind

def __init_subclass__(cls, /, *args: Any, **kwds: Any) -> Never: # pragma: no cover
msg = f"Cannot subclass {cls.__name__!r}"
raise TypeError(msg)

def __repr__(self) -> str:
return f"ExprMetadata(kind: {self._kind}, n_open_windows: {self._n_open_windows}, expansion_kind: {self._expansion_kind})"
return f"ExprMetadata(kind: {self._kind}, window_kind: {self._window_kind}, expansion_kind: {self._expansion_kind})"

@property
def kind(self) -> ExprKind:
return self._kind

@property
def n_open_windows(self) -> int:
return self._n_open_windows
def window_kind(self) -> WindowKind:
return self._window_kind

@property
def expansion_kind(self) -> ExpansionKind:
Expand All @@ -223,50 +257,77 @@ def expansion_kind(self) -> ExpansionKind:
def with_kind(self, kind: ExprKind, /) -> ExprMetadata:
"""Change metadata kind, leaving all other attributes the same."""
return ExprMetadata(
kind, n_open_windows=self._n_open_windows, expansion_kind=self._expansion_kind
kind,
window_kind=self._window_kind,
expansion_kind=self._expansion_kind,
)

def with_extra_open_window(self) -> ExprMetadata:
"""Increment `n_open_windows` leaving other attributes the same."""
def with_uncloseable_window(self) -> ExprMetadata:
"""Add uncloseable window, leaving other attributes the same."""
if self._window_kind is WindowKind.CLOSED: # pragma: no cover
msg = "Unreachable code, please report a bug."
raise AssertionError(msg)
return ExprMetadata(
self.kind,
n_open_windows=self._n_open_windows + 1,
window_kind=WindowKind.UNCLOSEABLE,
expansion_kind=self._expansion_kind,
)

def with_kind_and_closeable_window(self, kind: ExprKind, /) -> ExprMetadata:
"""Change metadata kind and add closeable window.

If we already have an uncloseable window, the window stays uncloseable.
"""
if self._window_kind is WindowKind.NONE:
window_kind = WindowKind.CLOSEABLE
elif self._window_kind is WindowKind.CLOSED: # pragma: no cover
msg = "Unreachable code, please report a bug."
raise AssertionError(msg)
else:
window_kind = WindowKind.UNCLOSEABLE
return ExprMetadata(
kind,
window_kind=window_kind,
expansion_kind=self._expansion_kind,
)

def with_kind_and_extra_open_window(self, kind: ExprKind, /) -> ExprMetadata:
"""Change metadata kind and increment `n_open_windows`."""
def with_kind_and_uncloseable_window(self, kind: ExprKind, /) -> ExprMetadata:
"""Change metadata kind and set window kind to uncloseable."""
return ExprMetadata(
kind,
n_open_windows=self._n_open_windows + 1,
window_kind=WindowKind.UNCLOSEABLE,
expansion_kind=self._expansion_kind,
)

@staticmethod
def simple_selector() -> ExprMetadata:
def selector_single() -> ExprMetadata:
# e.g. `nw.col('a')`, `nw.nth(0)`
return ExprMetadata(
ExprKind.TRANSFORM, n_open_windows=0, expansion_kind=ExpansionKind.SINGLE
ExprKind.TRANSFORM,
window_kind=WindowKind.NONE,
expansion_kind=ExpansionKind.SINGLE,
)

@staticmethod
def multi_output_selector_named() -> ExprMetadata:
def selector_multi_named() -> ExprMetadata:
# e.g. `nw.col('a', 'b')`
return ExprMetadata(
ExprKind.TRANSFORM, n_open_windows=0, expansion_kind=ExpansionKind.MULTI_NAMED
ExprKind.TRANSFORM,
window_kind=WindowKind.NONE,
expansion_kind=ExpansionKind.MULTI_NAMED,
)

@staticmethod
def multi_output_selector_unnamed() -> ExprMetadata:
def selector_multi_unnamed() -> ExprMetadata:
# e.g. `nw.all()`
return ExprMetadata(
ExprKind.TRANSFORM,
n_open_windows=0,
window_kind=WindowKind.NONE,
expansion_kind=ExpansionKind.MULTI_UNNAMED,
)


def combine_metadata(
def combine_metadata( # noqa: PLR0915
*args: IntoExpr | object | None,
str_as_lit: bool,
allow_multi_output: bool,
Expand All @@ -285,8 +346,10 @@ def combine_metadata(
has_transforms_or_windows = False
has_aggregations = False
has_literals = False
result_n_open_windows = 0
result_expansion_kind = ExpansionKind.SINGLE
has_closeable_windows = False
has_uncloseable_windows = False
has_closed_windows = False

for i, arg in enumerate(args):
if isinstance(arg, str) and not str_as_lit:
Expand All @@ -307,8 +370,6 @@ def combine_metadata(
result_expansion_kind = resolve_expansion_kind(
result_expansion_kind, arg._metadata.expansion_kind
)
if arg._metadata.n_open_windows:
result_n_open_windows += 1
kind = arg._metadata.kind
if kind is ExprKind.AGGREGATION:
has_aggregations = True
Expand All @@ -322,6 +383,14 @@ def combine_metadata(
msg = "unreachable code"
raise AssertionError(msg)

window_kind = arg._metadata.window_kind
if window_kind is WindowKind.UNCLOSEABLE:
has_uncloseable_windows = True
elif window_kind is WindowKind.CLOSEABLE:
has_closeable_windows = True
elif window_kind is WindowKind.CLOSED:
has_closed_windows = True

if (
has_literals
and not has_aggregations
Expand All @@ -342,10 +411,15 @@ def combine_metadata(
else:
result_kind = ExprKind.AGGREGATION

if has_uncloseable_windows or has_closeable_windows:
result_window_kind = WindowKind.UNCLOSEABLE
elif has_closed_windows:
result_window_kind = WindowKind.CLOSED
else:
result_window_kind = WindowKind.NONE

return ExprMetadata(
result_kind,
n_open_windows=result_n_open_windows,
expansion_kind=result_expansion_kind,
result_kind, window_kind=result_window_kind, expansion_kind=result_expansion_kind
)


Expand Down
2 changes: 1 addition & 1 deletion narwhals/dataframe.py
Original file line number Diff line number Diff line change
Expand Up @@ -2152,7 +2152,7 @@ def _extract_compliant(self: Self, arg: Any) -> Any:
plx = self.__narwhals_namespace__()
return plx.col(arg)
if isinstance(arg, Expr):
if arg._metadata.n_open_windows > 0:
if arg._metadata._window_kind.is_open():
msg = (
"Order-dependent expressions are not supported for use in LazyFrame.\n\n"
"Hints:\n"
Expand Down
Loading
Loading