chore: replace `zip` with `zip_strict` by raisadz · Pull Request #3003 · narwhals-dev/narwhals

raisadz · 2025-08-17T10:18:23Z

What type of PR is this? (check all applicable)

Related issues

Related issue feat: add DataFrame.top_k and LazyFrame.top_k #2977
Closes #<issue number>

Checklist

Code follows style guide (ruff)
Tests added
Documented the changes

If you have comments or can explain your changes, please do so below

raisadz · 2025-08-17T10:28:48Z

@FBruzzesi I tried to replace zip with a stricter version of zip_equal but it failed for the group_by test pytest tests/frame/group_by_test.py It looks that you could have been working on it #2325. Could you please take a look at it?

dangotbanned · 2025-08-17T10:56:11Z

@raisadz I haven't checked out #2977 yet, but from skimming this PR - it seems like this is being used in a lot of places that have an equal length guarantee elsewhere.
For example, a dict has equal length keys and values

Which of these changes are required to fix an issue for #2977?

FBruzzesi · 2025-08-17T11:03:42Z

@FBruzzesi I tried to replace zip with a stricter version of zip_equal but it failed for the group_by test pytest tests/frame/group_by_test.py It looks that you could have been working on it #2325. Could you please take a look at it?

Hey @raisadz if I understand you correctly, you tried to replace every zip operation.

The issue in group_by (and not sure if somewhere else) is that in the following snippet new_name is used only if the length is the same (i.e. if the key is not multi-output):

safe_keys = [
    # multi-output expression cannot have duplicate names, hence it's safe to suffix
    key.name.map(_temporary_name)
    if (metadata := key._metadata) and metadata.expansion_kind.is_multi_output()
    # otherwise it's single named and we can use Expr.alias
    else key.alias(_temporary_name(new_name))
    for key, new_name in zip(keys, output_names)
]

The test that is failing evaluates to the first expression: key.name.map(_temporary_name), since it's a multi output selector. Hence it does not really require the name to be considered in such point in time.

However this actually spots a hidden issue in case for which multiple "mixed" keys are passed. Let's consider the test case we have:

def test_group_by_selector(constructor: Constructor) -> None:
    data = {"a": [1, 1, 1], "b": [4, 4, 6], "c": [7.5, 8.5, 9.0]}
    result = (
        nw.from_native(constructor(data))
        .group_by(nw.selectors.by_dtype(nw.Int64))
        .agg(nw.col("c").mean())
        .sort("a", "b")
    )
    expected = {"a": [1, 1], "b": [4, 6], "c": [8.0, 9.0]}
    assert_equal_data(result, expected)

If we were to replace the group by statement with .group_by(nw.selectors.by_dtype(nw.Int64), "c"), then in the safe_keys evaluation we would have:

keys = [nw.selectors.by_dtype(nw.Int64), nw.col("c")] 
output_names = ['a', 'b', 'c']

which is bad as we would end up aliasing "c" as "b".

Let me try to address it in a dedicated PR.

Unrelated to the group by issue, could we rename zip_equal to zip_strict?

From python 3.10, zip has a strict keyword
We have a naive implementation in the tests called zip_strict (the one in this PR is much better and more generic)

MarcoGorelli · 2025-08-17T14:25:55Z

cool, thanks, definitely in favour of using strict zip everywhere by default, especially if it's helped uncover an issue

dangotbanned · 2025-08-17T14:40:09Z

cool, thanks, definitely in favour of using strict zip everywhere by default, especially if it's helped uncover an issue

I really disagree with this, the changes in (https://github.com/narwhals-dev/narwhals/pull/3003/files#diff-168a6da9cf3080d5790d562874fe82ff45411bc50b8a1bb90b6dc00ee19ea3ca) aren't safer - they just repeatedly check the same inputs are still the same length

I'm happy with strict zip when we actually don't know the length, but if we can't guarantee anywhere that we have equal length iterables - then there's a major design issue 🤔

MarcoGorelli · 2025-08-17T14:42:54Z

what's the downside of using this?

dangotbanned · 2025-08-17T14:48:04Z

what's the downside of using this?

It adds overhead multiple times to the same operation

How many times is zip_strict called for this test?

narwhals/tests/selectors_test.py

Lines 197 to 225 in a4d8ebf

    
           @pytest.mark.parametrize( 
        
               ("selector", "expected"), 
        
               [ 
        
                   (ncs.numeric() | ncs.boolean(), ["a", "c", "d"]), 
        
                   (ncs.numeric() & ncs.boolean(), []), 
        
                   (ncs.numeric() & ncs.by_dtype(nw.Int64), ["a"]), 
        
                   (ncs.numeric() | ncs.by_dtype(nw.Int64), ["a", "c"]), 
        
                   (~ncs.numeric(), ["b", "d"]), 
        
                   (ncs.boolean() & True, ["d"]), 
        
                   (ncs.boolean() | True, ["d"]), 
        
                   (ncs.numeric() - 1, ["a", "c"]), 
        
                   (ncs.all(), ["a", "b", "c", "d"]), 
        
               ], 
        
           ) 
        
           def test_set_ops( 
        
               constructor: Constructor, 
        
               selector: nw.selectors.Selector, 
        
               expected: list[str], 
        
               request: pytest.FixtureRequest, 
        
           ) -> None: 
        
               if ( 
        
                   any(x in str(constructor) for x in ("duckdb", "sqlframe", "ibis")) 
        
                   and not expected 
        
               ): 
        
                   # https://github.com/narwhals-dev/narwhals/issues/2469 
        
                   request.applymarker(pytest.mark.xfail) 
        
               df = nw.from_native(constructor(data)) 
        
               result = df.select(selector).collect_schema().names() 
        
               assert sorted(result) == expected

MarcoGorelli · 2025-08-17T14:58:26Z

true, but checking if it's necessary or not is a manual task and thus is error-prone. if we decide to use it everywhere, then it's a simple pre-commit check to enforce it

from what i can see, the overhead is minimal (and comparable to the rest of the Python operations we do) so I think I'd rather take this hit (if any, i'd be quite surprised if it actually showed up in a performance test, we're typically zipping over very few elements at a time anyway) and have it be safer. When we're Python3.10, we can just use strict=True

dangotbanned · 2025-08-17T15:07:39Z

Again I'm not concerned about using this where we actually need to validate it

But there are a very large number of changes here that are blindly doing it
For example, why would this be helpful?

https://github.com/raisadz/narwhals/blob/de3ffa46e266ce611578f6db9f990a7f29a0209d/narwhals/_arrow/dataframe.py#L294

If pyarrow.Schema is in such a broken state that it doesn't have equal length names and types - zip_strict isn't saving us

    @property
    def schema(self) -> dict[str, DType]:
        schema = self.native.schema
        return {
            name: native_to_narwhals_dtype(dtype, self._version)
            for name, dtype in zip_strict(schema.names, schema.types)
        }

https://github.com/raisadz/narwhals/blob/de3ffa46e266ce611578f6db9f990a7f29a0209d/narwhals/schema.py#L129-L149

This occurs directly after a length check

        return {
            name: to_native_dtype(dtype=dtype, dtype_backend=backend)
            for name, dtype, backend in zip_strict(self.keys(), self.values(), backends)
        }

This creates things to len, then checks the length?

https://github.com/raisadz/narwhals/blob/de3ffa46e266ce611578f6db9f990a7f29a0209d/narwhals/_duckdb/utils.py#L329-L332

https://github.com/raisadz/narwhals/blob/de3ffa46e266ce611578f6db9f990a7f29a0209d/narwhals/_duckdb/utils.py#L298-L307

All I'm asking for is to do this in a measured way and 100% if it fixes an issue (#3003 (comment))

narwhals/_utils.py

FBruzzesi · 2025-08-17T15:37:00Z

Adding my two cents in the conversation since this PR probably was triggered by a request I did to validate the input in top_k.

My understanding of the situation is the following:

Most of the time polars signature says "iterable" (e.g. in top_k we have by: IntoExpr | Iterable[IntoExpr]). However, does it really make sense? In this situation, I think it would make more sense to have the type as a sequence, after all it has to be finite length since:
```
 by
        Column(s) used to determine the top rows.
        Accepts expression input. Strings are parsed as column names.
```
Therefore we should be able to perform a length check at the top level as I was suggesting in the PR. Which in my opinion would be better than raising a vague "zip_strict: first iterable is (longer|shorter)" message to the user at a certain point in the computation.
I agree with @dangotbanned that internally more often than not we have already verified sequence lengths before zipping sequences (or iterables) together. Yet considering the almost nullable overhead shown in the stackoverflow thread (I am impressed), it can be a good safe guard against ourselves (see the bug I introduced in group_by), but I would not rely on it blindly - we should be always careful in reviews and/or tests

Co-authored-by: Dan Redding <125183946+dangotbanned@users.noreply.github.com>

narwhals/_utils.py

From https://github.com/narwhals-dev/narwhals/blob/4966868bfd94655f395c808ae12562596fe6e73b/narwhals/schema.py#L95-L131

dangotbanned · 2025-08-17T16:24:21Z

To get coverage, this might be a starting point

def test_strict_zip(monkeypatch: pytest.MonkeyPatch) -> None:
    with monkeypatch.context() as mp:
        mp.setattr(sys, "version_info", (3, 9))

MarcoGorelli · 2025-08-18T13:32:29Z

thanks all

we're already running the test suite with python 3.9 so i'd be ok with just pragma no covering the else branch

FBruzzesi

Thanks @raisadz for taking care of this!

MarcoGorelli · 2025-08-18T20:15:20Z

thanks all - let's ship this then!

raisadz added 3 commits August 17, 2025 11:14

replace zip with zip_equal

d3e379b

Merge remote-tracking branch 'upstream/main' into chore/zip-equal

a7ea9b1

add no cover

4ee1060

FBruzzesi mentioned this pull request Aug 17, 2025

fix: GroupBy with mixed length exprs as keys #3004

Merged

10 tasks

rename zip_equal to zip_strict, remove the existing zip_strict

de3ffa4

dangotbanned reviewed Aug 17, 2025

View reviewed changes

narwhals/_utils.py Outdated Show resolved Hide resolved

dangotbanned added the internal label Aug 17, 2025

raisadz mentioned this pull request Aug 17, 2025

feat: add DataFrame.top_k and LazyFrame.top_k #2977

Merged

10 tasks

Update narwhals/_utils.py

2f7a68d

Co-authored-by: Dan Redding <125183946+dangotbanned@users.noreply.github.com>

dangotbanned reviewed Aug 17, 2025

View reviewed changes

narwhals/_utils.py Show resolved Hide resolved

dangotbanned added 3 commits August 17, 2025 16:06

fix my goof

e4be9c4

ok ruff

b126417

refactor: avoid zip entirely

edcd83c

From https://github.com/narwhals-dev/narwhals/blob/4966868bfd94655f395c808ae12562596fe6e73b/narwhals/schema.py#L95-L131

Merge remote-tracking branch 'upstream/main' into chore/zip-equal

b775d3d

add no cover

98f2b5c

raisadz marked this pull request as ready for review August 18, 2025 13:36

raisadz added 2 commits August 18, 2025 14:40

fix merge

711f503

replace zip with zip_strict for group_by that was failing before

d735ffc

Merge remote-tracking branch 'upstream/main' into chore/zip-equal

1194bb1

FBruzzesi changed the title ~~chore: replace zip with zip equal~~ chore: replace zip with zip_strict Aug 18, 2025

FBruzzesi approved these changes Aug 18, 2025

View reviewed changes

MarcoGorelli merged commit 6109633 into narwhals-dev:main Aug 18, 2025
33 of 34 checks passed

Conversation

raisadz commented Aug 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this? (check all applicable)

Related issues

Checklist

If you have comments or can explain your changes, please do so below

Uh oh!

raisadz commented Aug 17, 2025

Uh oh!

dangotbanned commented Aug 17, 2025

Uh oh!

FBruzzesi commented Aug 17, 2025

Uh oh!

MarcoGorelli commented Aug 17, 2025

Uh oh!

dangotbanned commented Aug 17, 2025

Uh oh!

MarcoGorelli commented Aug 17, 2025

Uh oh!

dangotbanned commented Aug 17, 2025

Uh oh!

MarcoGorelli commented Aug 17, 2025

Uh oh!

dangotbanned commented Aug 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

FBruzzesi commented Aug 17, 2025

Uh oh!

Uh oh!

dangotbanned commented Aug 17, 2025

Uh oh!

MarcoGorelli commented Aug 18, 2025

Uh oh!

FBruzzesi left a comment

Choose a reason for hiding this comment

Uh oh!

MarcoGorelli commented Aug 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

raisadz commented Aug 17, 2025 •

edited

Loading

dangotbanned commented Aug 17, 2025 •

edited

Loading