feat: support window operations for DuckDB#2263
Conversation
56a7eb5 to
ba5b646
Compare
bc4da37 to
8235348
Compare
f16970d to
f2cc84d
Compare
314fa79 to
35e603c
Compare
FBruzzesi
left a comment
There was a problem hiding this comment.
@MarcoGorelli this was an amazing streaming! I left two nitpick comments and a suggestion for whoever wants to have some fun working on #2174 🥦
narwhals/_duckdb/expr.py
Outdated
| from typing import Sequence | ||
| from typing import cast | ||
|
|
||
| import duckdb |
There was a problem hiding this comment.
Should we follow the same pattern as below and directly import from duckdb import SQLExpression?
narwhals/_duckdb/expr.py
Outdated
| if reverse: | ||
| order_by_sql = "order by " + ", ".join( | ||
| f'"{x}" desc nulls last' for x in order_by | ||
| ) | ||
| else: | ||
| order_by_sql = "order by " + ", ".join( | ||
| f'"{x}" asc nulls first' for x in order_by | ||
| ) | ||
| if partition_by: | ||
| partition_by_sql = "partition by " + ",".join( | ||
| f'"{x}"' for x in partition_by | ||
| ) | ||
| else: | ||
| partition_by_sql = "" | ||
| sql = f"sum ({_input}) over ({partition_by_sql} {order_by_sql} rows between unbounded preceding and current row)" | ||
| return duckdb.SQLExpression(sql) |
There was a problem hiding this comment.
I can see a lot of re-usability if everything here for #2174 - only "word" to input is "sum" and other operations 🙌🏼
There was a problem hiding this comment.
yup - let's generalise when we add more of them
| - name: install duckdb nightly | ||
| run: uv pip install -U --pre duckdb --system |
There was a problem hiding this comment.
Should we wait for the duckdb release before merging this PR? You are ahead of them 😂
There was a problem hiding this comment.
nah let's be one step ahead so when they release we're ready 🔥
narwhals/_duckdb/expr.py
Outdated
| ) -> duckdb.Expression: | ||
| if reverse: | ||
| order_by_sql = "order by " + ", ".join( | ||
| f'"{x}" desc nulls last' for x in order_by |
There was a problem hiding this comment.
@MarcoGorelli I commented this on the stream, not sure it was ever tried out, anyways now in context 🙂
| f'"{x}" desc nulls last' for x in order_by | |
| f'{x!r} desc nulls last' for x in order_by |
There was a problem hiding this comment.
i don't think that's the same, we need double quotes around the column names as that's what sql expects (single quotes are string literals)
There was a problem hiding this comment.
i don't think that's the same, we need double quotes around the column names as that's what sql expects (single quotes are string literals)
Does it not respect the outer quotes?
I would've thought these two used different internal ones:
f'{x!r} desc nulls last' for x in order_by
f"{x!r} desc nulls last" for x in order_by| if self._backend_version < (1, 3): | ||
| msg = "At least version 1.3 of DuckDB is required for `over` operation." | ||
| raise NotImplementedError(msg) | ||
| if (window_function := self._window_function) is not None: |
There was a problem hiding this comment.
AFAIK, you only need the is not None when the type overrides __bool__.
Going to guess this is from experience battling pandas, polars, and (maybe) numpy who all tell you off for using __bool__
| if (window_function := self._window_function) is not None: | |
| if (window_function := self._window_function): |
| class WindowFunction(Protocol): | ||
| def __call__( | ||
| self, | ||
| _input: duckdb.Expression, | ||
| partition_by: Sequence[str], | ||
| order_by: Sequence[str], | ||
| ) -> duckdb.Expression: ... |
There was a problem hiding this comment.
I've been meaning to ask about this for a while now.
Is that naming convention intended to signal positional-only for _input?
It seems similar to a convention that was common before PEP 570 – Python Positional-Only Parameters.
Current
class WindowFunction(Protocol):
def __call__(
self,
_input: duckdb.Expression,
partition_by: Sequence[str],
order_by: Sequence[str],
) -> duckdb.Expression: ...Before PEP 570
class WindowFunction(Protocol):
def __call__(
self,
__input: duckdb.Expression,
partition_by: Sequence[str],
order_by: Sequence[str],
) -> duckdb.Expression: ...After PEP 570
class WindowFunction(Protocol):
def __call__(
self,
input: duckdb.Expression,
/,
partition_by: Sequence[str],
order_by: Sequence[str],
) -> duckdb.Expression: ...There was a problem hiding this comment.
yeah _input wasn't a very good name, i don't think i'd put much thought into it at the time and now it's all over
we could use native_expr in its place?
There was a problem hiding this comment.
we could use
native_exprin its place?
Whatever you feel works 👍
Side note
One of the cool things about positional-only args is that you can use different names - and it still works the same at runtime and to a type checker.
Was a bit of a 🤯 when I learned that.
Here, that might mean you could use expr when it is unambiguous; but native_expr when either compliant_expr or native_expr could be confused
|
thanks all for your reviews! |
| with contextlib.suppress(ImportError): # requires duckdb>=1.3.0 | ||
| from duckdb import SQLExpression | ||
|
|
There was a problem hiding this comment.
@MarcoGorelli this is still causing typing issues outside of CI on a fresh install.
Fix 1
I've been using this locally, but is more of a workaround:
diff --git a/narwhals/_duckdb/expr.py b/narwhals/_duckdb/expr.py
index fd371b73..0ec4681d 100644
--- a/narwhals/_duckdb/expr.py
+++ b/narwhals/_duckdb/expr.py
@@ -41,8 +41,11 @@ if TYPE_CHECKING:
from narwhals.utils import Version
from narwhals.utils import _FullContext
-with contextlib.suppress(ImportError): # requires duckdb>=1.3.0
- from duckdb import SQLExpression
+if not TYPE_CHECKING:
+ with contextlib.suppress(ImportError): # requires duckdb>=1.3.0
+ from duckdb import SQLExpression
+else:
+ from duckdb import Expression as SQLExpression
class DuckDBExpr(LazyExpr["DuckDBLazyFrame", "duckdb.Expression"]):Fix 2
Specifying this requirement in pyproject.toml somewhere.
Not sure on exactly how though
Lines 22 to 24 in de9f375
narwhals/.github/workflows/extremes.yml
Line 185 in de9f375
| if "sqlframe" in str(constructor): | ||
| # https://github.com/eakmanrq/sqlframe/issues/325 | ||
| request.applymarker(pytest.mark.xfail) |
There was a problem hiding this comment.
@MarcoGorelli I'm also needing this since the PR merged 🤔
The comment is the error that started showing up for me, but not in CI
diff --git a/tests/expr_and_series/str/to_datetime_test.py b/tests/expr_and_series/str/to_datetime_test.py
index 412485d0..4ed208f6 100644
--- a/tests/expr_and_series/str/to_datetime_test.py
+++ b/tests/expr_and_series/str/to_datetime_test.py
@@ -14,6 +14,7 @@ from tests.utils import PANDAS_VERSION
from tests.utils import PYARROW_VERSION
from tests.utils import assert_equal_data
from tests.utils import is_pyarrow_windows_no_tzdata
+from tests.utils import is_windows
if TYPE_CHECKING:
from tests.utils import Constructor
@@ -219,6 +220,15 @@ def test_to_datetime_tz_aware(
if "cudf" in str(constructor):
# cuDF does not yet support timezone-aware datetimes
request.applymarker(pytest.mark.xfail)
+ if "sqlframe" in str(constructor) and format is not None and is_windows():
+ #
+ # E duckdb.duckdb.InvalidInputException: Invalid Input Error:
+ # Could not parse string "2020-01-01 01:02:03+0100" according to format specifier "%Y-%-m-%-d %-H:%-M:%-SZ"
+ # E 2020-01-01 01:02:03+0100
+ # E ^
+ # E Error: Literal does not match, expected Z
+ #
+ request.applymarker(pytest.mark.xfail)
context = (
pytest.raises(NotImplementedError)
if any(x in str(constructor) for x in ("duckdb",)) and format is None
Was broken for me locally following narwhals-dev#2263 (comment)
What type of PR is this? (check all applicable)
Related issues
Checklist
If you have comments or can explain your changes, please do so below