Skip to content

chore: replace zip with zip_strict#3003

Merged
MarcoGorelli merged 13 commits intonarwhals-dev:mainfrom
raisadz:chore/zip-equal
Aug 18, 2025
Merged

chore: replace zip with zip_strict#3003
MarcoGorelli merged 13 commits intonarwhals-dev:mainfrom
raisadz:chore/zip-equal

Conversation

@raisadz
Copy link
Contributor

@raisadz raisadz commented Aug 17, 2025

What type of PR is this? (check all applicable)

  • 💾 Refactor
  • ✨ Feature
  • 🐛 Bug Fix
  • 🔧 Optimization
  • 📝 Documentation
  • ✅ Test
  • 🐳 Other

Related issues

Checklist

  • Code follows style guide (ruff)
  • Tests added
  • Documented the changes

If you have comments or can explain your changes, please do so below

@raisadz
Copy link
Contributor Author

raisadz commented Aug 17, 2025

@FBruzzesi I tried to replace zip with a stricter version of zip_equal but it failed for the group_by test pytest tests/frame/group_by_test.py It looks that you could have been working on it #2325. Could you please take a look at it?

@dangotbanned
Copy link
Member

@raisadz I haven't checked out #2977 yet, but from skimming this PR - it seems like this is being used in a lot of places that have an equal length guarantee elsewhere.
For example, a dict has equal length keys and values

Which of these changes are required to fix an issue for #2977?

@FBruzzesi
Copy link
Member

@FBruzzesi I tried to replace zip with a stricter version of zip_equal but it failed for the group_by test pytest tests/frame/group_by_test.py It looks that you could have been working on it #2325. Could you please take a look at it?

Hey @raisadz if I understand you correctly, you tried to replace every zip operation.

The issue in group_by (and not sure if somewhere else) is that in the following snippet new_name is used only if the length is the same (i.e. if the key is not multi-output):

safe_keys = [
    # multi-output expression cannot have duplicate names, hence it's safe to suffix
    key.name.map(_temporary_name)
    if (metadata := key._metadata) and metadata.expansion_kind.is_multi_output()
    # otherwise it's single named and we can use Expr.alias
    else key.alias(_temporary_name(new_name))
    for key, new_name in zip(keys, output_names)
]

The test that is failing evaluates to the first expression: key.name.map(_temporary_name), since it's a multi output selector. Hence it does not really require the name to be considered in such point in time.

However this actually spots a hidden issue in case for which multiple "mixed" keys are passed. Let's consider the test case we have:

def test_group_by_selector(constructor: Constructor) -> None:
    data = {"a": [1, 1, 1], "b": [4, 4, 6], "c": [7.5, 8.5, 9.0]}
    result = (
        nw.from_native(constructor(data))
        .group_by(nw.selectors.by_dtype(nw.Int64))
        .agg(nw.col("c").mean())
        .sort("a", "b")
    )
    expected = {"a": [1, 1], "b": [4, 6], "c": [8.0, 9.0]}
    assert_equal_data(result, expected)

If we were to replace the group by statement with .group_by(nw.selectors.by_dtype(nw.Int64), "c"), then in the safe_keys evaluation we would have:

keys = [nw.selectors.by_dtype(nw.Int64), nw.col("c")] 
output_names = ['a', 'b', 'c']

which is bad as we would end up aliasing "c" as "b".

Let me try to address it in a dedicated PR.


Unrelated to the group by issue, could we rename zip_equal to zip_strict?

  • From python 3.10, zip has a strict keyword
  • We have a naive implementation in the tests called zip_strict (the one in this PR is much better and more generic)

@MarcoGorelli
Copy link
Member

cool, thanks, definitely in favour of using strict zip everywhere by default, especially if it's helped uncover an issue

@dangotbanned
Copy link
Member

cool, thanks, definitely in favour of using strict zip everywhere by default, especially if it's helped uncover an issue

I really disagree with this, the changes in (https://github.com/narwhals-dev/narwhals/pull/3003/files#diff-168a6da9cf3080d5790d562874fe82ff45411bc50b8a1bb90b6dc00ee19ea3ca) aren't safer - they just repeatedly check the same inputs are still the same length

I'm happy with strict zip when we actually don't know the length, but if we can't guarantee anywhere that we have equal length iterables - then there's a major design issue 🤔

@MarcoGorelli
Copy link
Member

what's the downside of using this?

@dangotbanned
Copy link
Member

what's the downside of using this?

It adds overhead multiple times to the same operation

How many times is zip_strict called for this test?

@pytest.mark.parametrize(
("selector", "expected"),
[
(ncs.numeric() | ncs.boolean(), ["a", "c", "d"]),
(ncs.numeric() & ncs.boolean(), []),
(ncs.numeric() & ncs.by_dtype(nw.Int64), ["a"]),
(ncs.numeric() | ncs.by_dtype(nw.Int64), ["a", "c"]),
(~ncs.numeric(), ["b", "d"]),
(ncs.boolean() & True, ["d"]),
(ncs.boolean() | True, ["d"]),
(ncs.numeric() - 1, ["a", "c"]),
(ncs.all(), ["a", "b", "c", "d"]),
],
)
def test_set_ops(
constructor: Constructor,
selector: nw.selectors.Selector,
expected: list[str],
request: pytest.FixtureRequest,
) -> None:
if (
any(x in str(constructor) for x in ("duckdb", "sqlframe", "ibis"))
and not expected
):
# https://github.com/narwhals-dev/narwhals/issues/2469
request.applymarker(pytest.mark.xfail)
df = nw.from_native(constructor(data))
result = df.select(selector).collect_schema().names()
assert sorted(result) == expected

@MarcoGorelli
Copy link
Member

true, but checking if it's necessary or not is a manual task and thus is error-prone. if we decide to use it everywhere, then it's a simple pre-commit check to enforce it

from what i can see, the overhead is minimal (and comparable to the rest of the Python operations we do) so I think I'd rather take this hit (if any, i'd be quite surprised if it actually showed up in a performance test, we're typically zipping over very few elements at a time anyway) and have it be safer. When we're Python3.10, we can just use strict=True

@dangotbanned
Copy link
Member

dangotbanned commented Aug 17, 2025

Again I'm not concerned about using this where we actually need to validate it

But there are a very large number of changes here that are blindly doing it
For example, why would this be helpful?

https://github.com/raisadz/narwhals/blob/de3ffa46e266ce611578f6db9f990a7f29a0209d/narwhals/_arrow/dataframe.py#L294

If pyarrow.Schema is in such a broken state that it doesn't have equal length names and types - zip_strict isn't saving us

    @property
    def schema(self) -> dict[str, DType]:
        schema = self.native.schema
        return {
            name: native_to_narwhals_dtype(dtype, self._version)
            for name, dtype in zip_strict(schema.names, schema.types)
        }

https://github.com/raisadz/narwhals/blob/de3ffa46e266ce611578f6db9f990a7f29a0209d/narwhals/schema.py#L129-L149

This occurs directly after a length check

        return {
            name: to_native_dtype(dtype=dtype, dtype_backend=backend)
            for name, dtype, backend in zip_strict(self.keys(), self.values(), backends)
        }

This creates things to len, then checks the length?

https://github.com/raisadz/narwhals/blob/de3ffa46e266ce611578f6db9f990a7f29a0209d/narwhals/_duckdb/utils.py#L329-L332

https://github.com/raisadz/narwhals/blob/de3ffa46e266ce611578f6db9f990a7f29a0209d/narwhals/_duckdb/utils.py#L298-L307


All I'm asking for is to do this in a measured way and 100% if it fixes an issue (#3003 (comment))

@FBruzzesi
Copy link
Member

Adding my two cents in the conversation since this PR probably was triggered by a request I did to validate the input in top_k.

My understanding of the situation is the following:

  • Most of the time polars signature says "iterable" (e.g. in top_k we have by: IntoExpr | Iterable[IntoExpr]). However, does it really make sense? In this situation, I think it would make more sense to have the type as a sequence, after all it has to be finite length since:
     by
            Column(s) used to determine the top rows.
            Accepts expression input. Strings are parsed as column names.
  • Therefore we should be able to perform a length check at the top level as I was suggesting in the PR. Which in my opinion would be better than raising a vague "zip_strict: first iterable is (longer|shorter)" message to the user at a certain point in the computation.
  • I agree with @dangotbanned that internally more often than not we have already verified sequence lengths before zipping sequences (or iterables) together. Yet considering the almost nullable overhead shown in the stackoverflow thread (I am impressed), it can be a good safe guard against ourselves (see the bug I introduced in group_by), but I would not rely on it blindly - we should be always careful in reviews and/or tests

Co-authored-by: Dan Redding <125183946+dangotbanned@users.noreply.github.com>
@dangotbanned
Copy link
Member

To get coverage, this might be a starting point

def test_strict_zip(monkeypatch: pytest.MonkeyPatch) -> None:
    with monkeypatch.context() as mp:
        mp.setattr(sys, "version_info", (3, 9))

@MarcoGorelli
Copy link
Member

thanks all

we're already running the test suite with python 3.9 so i'd be ok with just pragma no covering the else branch

@raisadz raisadz marked this pull request as ready for review August 18, 2025 13:36
@FBruzzesi FBruzzesi changed the title chore: replace zip with zip equal chore: replace zip with zip_strict Aug 18, 2025
Copy link
Member

@FBruzzesi FBruzzesi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @raisadz for taking care of this!

@MarcoGorelli
Copy link
Member

thanks all - let's ship this then!

@MarcoGorelli MarcoGorelli merged commit 6109633 into narwhals-dev:main Aug 18, 2025
33 of 34 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants