Skip to content

feat: add DataFrame.top_k and LazyFrame.top_k#2977

Merged
MarcoGorelli merged 17 commits intonarwhals-dev:mainfrom
raisadz:feat/top_k
Aug 23, 2025
Merged

feat: add DataFrame.top_k and LazyFrame.top_k#2977
MarcoGorelli merged 17 commits intonarwhals-dev:mainfrom
raisadz:feat/top_k

Conversation

@raisadz
Copy link
Contributor

@raisadz raisadz commented Aug 12, 2025

What type of PR is this? (check all applicable)

  • 💾 Refactor
  • ✨ Feature
  • 🐛 Bug Fix
  • 🔧 Optimization
  • 📝 Documentation
  • ✅ Test
  • 🐳 Other

Related issues

Checklist

  • Code follows style guide (ruff)
  • Tests added
  • Documented the changes

If you have comments or can explain your changes, please do so below

@raisadz raisadz added the pyspark Issue is related to pyspark backend label Aug 12, 2025
@raisadz raisadz marked this pull request as ready for review August 12, 2025 16:34
Copy link
Member

@FBruzzesi FBruzzesi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @raisadz - I left a few comments, only one which I really care about which is about length input validation at the narwhals level

def top_k(
self, k: int, *, by: str | Iterable[str], reverse: bool | Sequence[bool] = False
) -> Self:
flatten_by = flatten([by])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a check that if reverse is a sequence, and it's length is different than flatten_by, then an exception is raise? This guarantees that zip(by, reverse) at the compliant level is same as zip_strict.

From polars:

df = pl.DataFrame(
    {
        "a": ["a", "b", "a", "b", "b", "c"],
        "b": [2, 1, 1, 3, 2, 1],
    }
)

df.top_k(4, by=["b", "a"], reverse=[True])

ValueError: the length of reverse (1) does not match the length of by (2)

Copy link
Member

@FBruzzesi FBruzzesi Aug 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@raisadz I would still prefer to add a check at this level to also align the error with polars (notice that the output of flatten is a list anyway), but feel free to merge. We can follow up on it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think there's some other places where this would be useful (like sort) so we could probably make a validation utility for this and use it in multiple places

return self._with_native(self.native.sort(*it))

def top_k(self, k: int, *, by: Iterable[str], reverse: bool | Sequence[bool]) -> Self:
df = self.native # noqa: F841
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you prefix the variable name with an underscore (_df) you can avoid the # noqa: F841 flag. It's hacky I know

@FBruzzesi FBruzzesi added the enhancement New feature or request label Aug 16, 2025
@raisadz raisadz mentioned this pull request Aug 17, 2025
10 tasks
raisadz and others added 7 commits August 17, 2025 11:30
@raisadz
Copy link
Contributor Author

raisadz commented Aug 17, 2025

Thanks for the review @FBruzzesi ! I addressed your comments and will add zip_strict from #3003 after it is merged

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, thanks both @raisadz and @FBruzzesi !

happy to ship it if there's no further comments

@MarcoGorelli
Copy link
Member

merging then, i've opened #3026 for a follow-up, thanks all for comments!

@MarcoGorelli MarcoGorelli merged commit 285815a into narwhals-dev:main Aug 23, 2025
31 of 33 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request pyspark Issue is related to pyspark backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

enh?: {DataFrame/LazyFrame}.top_k

3 participants