Skip to content

feat: support window operations for DuckDB#2263

Merged
MarcoGorelli merged 21 commits intomainfrom
duckdb-over
Mar 22, 2025
Merged

feat: support window operations for DuckDB#2263
MarcoGorelli merged 21 commits intomainfrom
duckdb-over

Conversation

@MarcoGorelli
Copy link
Member

What type of PR is this? (check all applicable)

  • 💾 Refactor
  • ✨ Feature
  • 🐛 Bug Fix
  • 🔧 Optimization
  • 📝 Documentation
  • ✅ Test
  • 🐳 Other

Related issues

  • Related issue #<issue number>
  • Closes #<issue number>

Checklist

  • Code follows style guide (ruff)
  • Tests added
  • Documented the changes

If you have comments or can explain your changes, please do so below

@MarcoGorelli MarcoGorelli added the enhancement New feature or request label Mar 21, 2025
Copy link
Member

@FBruzzesi FBruzzesi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MarcoGorelli this was an amazing streaming! I left two nitpick comments and a suggestion for whoever wants to have some fun working on #2174 🥦

from typing import Sequence
from typing import cast

import duckdb
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we follow the same pattern as below and directly import from duckdb import SQLExpression?

Comment on lines 521 to 536
if reverse:
order_by_sql = "order by " + ", ".join(
f'"{x}" desc nulls last' for x in order_by
)
else:
order_by_sql = "order by " + ", ".join(
f'"{x}" asc nulls first' for x in order_by
)
if partition_by:
partition_by_sql = "partition by " + ",".join(
f'"{x}"' for x in partition_by
)
else:
partition_by_sql = ""
sql = f"sum ({_input}) over ({partition_by_sql} {order_by_sql} rows between unbounded preceding and current row)"
return duckdb.SQLExpression(sql)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see a lot of re-usability if everything here for #2174 - only "word" to input is "sum" and other operations 🙌🏼

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup - let's generalise when we add more of them

Comment on lines +87 to +88
- name: install duckdb nightly
run: uv pip install -U --pre duckdb --system
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we wait for the duckdb release before merging this PR? You are ahead of them 😂

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nah let's be one step ahead so when they release we're ready 🔥

) -> duckdb.Expression:
if reverse:
order_by_sql = "order by " + ", ".join(
f'"{x}" desc nulls last' for x in order_by
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MarcoGorelli I commented this on the stream, not sure it was ever tried out, anyways now in context 🙂

Suggested change
f'"{x}" desc nulls last' for x in order_by
f'{x!r} desc nulls last' for x in order_by

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think that's the same, we need double quotes around the column names as that's what sql expects (single quotes are string literals)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think that's the same, we need double quotes around the column names as that's what sql expects (single quotes are string literals)

Does it not respect the outer quotes?

I would've thought these two used different internal ones:

f'{x!r} desc nulls last' for x in order_by
f"{x!r} desc nulls last" for x in order_by

if self._backend_version < (1, 3):
msg = "At least version 1.3 of DuckDB is required for `over` operation."
raise NotImplementedError(msg)
if (window_function := self._window_function) is not None:
Copy link
Member

@dangotbanned dangotbanned Mar 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, you only need the is not None when the type overrides __bool__.

Going to guess this is from experience battling pandas, polars, and (maybe) numpy who all tell you off for using __bool__

Suggested change
if (window_function := self._window_function) is not None:
if (window_function := self._window_function):

Comment on lines +10 to +16
class WindowFunction(Protocol):
def __call__(
self,
_input: duckdb.Expression,
partition_by: Sequence[str],
order_by: Sequence[str],
) -> duckdb.Expression: ...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been meaning to ask about this for a while now.

Is that naming convention intended to signal positional-only for _input?

It seems similar to a convention that was common before PEP 570 – Python Positional-Only Parameters.

Current

    class WindowFunction(Protocol):
        def __call__(
            self,
            _input: duckdb.Expression,
            partition_by: Sequence[str],
            order_by: Sequence[str],
        ) -> duckdb.Expression: ...

Before PEP 570

    class WindowFunction(Protocol):
        def __call__(
            self,
            __input: duckdb.Expression,
            partition_by: Sequence[str],
            order_by: Sequence[str],
        ) -> duckdb.Expression: ...

After PEP 570

    class WindowFunction(Protocol):
        def __call__(
            self,
            input: duckdb.Expression,
            /,
            partition_by: Sequence[str],
            order_by: Sequence[str],
        ) -> duckdb.Expression: ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah _input wasn't a very good name, i don't think i'd put much thought into it at the time and now it's all over

we could use native_expr in its place?

Copy link
Member

@dangotbanned dangotbanned Mar 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could use native_expr in its place?

Whatever you feel works 👍

Side note

One of the cool things about positional-only args is that you can use different names - and it still works the same at runtime and to a type checker.
Was a bit of a 🤯 when I learned that.

Here, that might mean you could use expr when it is unambiguous; but native_expr when either compliant_expr or native_expr could be confused

@MarcoGorelli MarcoGorelli marked this pull request as ready for review March 22, 2025 12:41
@MarcoGorelli
Copy link
Member Author

thanks all for your reviews!

@MarcoGorelli MarcoGorelli merged commit 295ee89 into main Mar 22, 2025
27 of 28 checks passed
@MarcoGorelli MarcoGorelli deleted the duckdb-over branch March 22, 2025 13:25
Comment on lines +45 to +47
with contextlib.suppress(ImportError): # requires duckdb>=1.3.0
from duckdb import SQLExpression

Copy link
Member

@dangotbanned dangotbanned Mar 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MarcoGorelli this is still causing typing issues outside of CI on a fresh install.

Fix 1

I've been using this locally, but is more of a workaround:

diff --git a/narwhals/_duckdb/expr.py b/narwhals/_duckdb/expr.py
index fd371b73..0ec4681d 100644
--- a/narwhals/_duckdb/expr.py
+++ b/narwhals/_duckdb/expr.py
@@ -41,8 +41,11 @@ if TYPE_CHECKING:
     from narwhals.utils import Version
     from narwhals.utils import _FullContext
 
-with contextlib.suppress(ImportError):  # requires duckdb>=1.3.0
-    from duckdb import SQLExpression
+if not TYPE_CHECKING:
+    with contextlib.suppress(ImportError):  # requires duckdb>=1.3.0
+        from duckdb import SQLExpression
+else:
+    from duckdb import Expression as SQLExpression
 
 
 class DuckDBExpr(LazyExpr["DuckDBLazyFrame", "duckdb.Expression"]):

Fix 2

Specifying this requirement in pyproject.toml somewhere.
Not sure on exactly how though

narwhals/Makefile

Lines 22 to 24 in de9f375

typing: ## Run typing checks
# install duckdb nightly so mypy recognises duckdb.SQLExpression
$(VENV_BIN)/uv pip install -U --pre duckdb

uv pip install -U --pre duckdb --system

Comment on lines -219 to -221
if "sqlframe" in str(constructor):
# https://github.com/eakmanrq/sqlframe/issues/325
request.applymarker(pytest.mark.xfail)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MarcoGorelli I'm also needing this since the PR merged 🤔

The comment is the error that started showing up for me, but not in CI

diff --git a/tests/expr_and_series/str/to_datetime_test.py b/tests/expr_and_series/str/to_datetime_test.py
index 412485d0..4ed208f6 100644
--- a/tests/expr_and_series/str/to_datetime_test.py
+++ b/tests/expr_and_series/str/to_datetime_test.py
@@ -14,6 +14,7 @@ from tests.utils import PANDAS_VERSION
 from tests.utils import PYARROW_VERSION
 from tests.utils import assert_equal_data
 from tests.utils import is_pyarrow_windows_no_tzdata
+from tests.utils import is_windows
 
 if TYPE_CHECKING:
     from tests.utils import Constructor
@@ -219,6 +220,15 @@ def test_to_datetime_tz_aware(
     if "cudf" in str(constructor):
         # cuDF does not yet support timezone-aware datetimes
         request.applymarker(pytest.mark.xfail)
+    if "sqlframe" in str(constructor) and format is not None and is_windows():
+        #
+        # E       duckdb.duckdb.InvalidInputException: Invalid Input Error:
+        #         Could not parse string "2020-01-01 01:02:03+0100" according to format specifier "%Y-%-m-%-d %-H:%-M:%-SZ"
+        # E       2020-01-01 01:02:03+0100
+        # E                           ^
+        # E       Error: Literal does not match, expected Z
+        #
+        request.applymarker(pytest.mark.xfail)
     context = (
         pytest.raises(NotImplementedError)
         if any(x in str(constructor) for x in ("duckdb",)) and format is None

dangotbanned added a commit to MarcoGorelli/narwhals that referenced this pull request Mar 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants