Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add Series|Expr.rank #1342

Draft
wants to merge 9 commits into
base: main
Choose a base branch
from
Draft

feat: add Series|Expr.rank #1342

wants to merge 9 commits into from

Conversation

FBruzzesi
Copy link
Member

@FBruzzesi FBruzzesi commented Nov 9, 2024

What type of PR is this? (check all applicable)

  • πŸ’Ύ Refactor
  • ✨ Feature
  • πŸ› Bug Fix
  • πŸ”§ Optimization
  • πŸ“ Documentation
  • βœ… Test
  • 🐳 Other

Related issues

Checklist

  • Code follows style guide (ruff)
  • Tests added
  • Documented the changes

If you have comments or can explain your changes, please do so below.

This PR provides initial support for rank method. I will start it as a draft due to a bunch of shortcomings:

  • pandas:
    • there is a (nasty) trick to make it work with nulls/nans and nullable dtypes (related: to BUG: rank does not respect na_option='keep' for numpy nullable integer dtypes, which should be fixed in pandas 3) - I will comment the trick in the code
    • support in group by context seems quite hard, or at least I could not find a clear way to achieve it due to the inability to forward rank arguments. The API to rank on a grouped object is df.groupby(keys)[cols].rank(*args, **kwargs), and this does not even return an aggregated value. Maybe we could support it if that's the only expression passed in the context, yet we need to figure out how to pass the arguments along. This is relevant since over is implemented as a group_by under the hood.
  • pyarrow:
    • does not support polars default method (namely, "average"), therefore if rank is called without specifying another method, it will end up raising an error
    • does not implement rank in a group by context
    • I am using pyarrow.compute.rank which is available but not exposed/documented (?)
  • dask: it just does not support ranking
  • polars:
    • group by context always returns an aggregate, in this case a list of ranks - which is fairly useless as it is a list of increasing/descreasing range until the size of the group

@github-actions github-actions bot added the enhancement New feature or request label Nov 9, 2024
Comment on lines +747 to +749
# crazy workaround for the case of `na_option="keep"` and nullable
# integer dtypes. This should be supported in pandas > 3.0
# https://github.com/pandas-dev/pandas/issues/56976
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the workaround.

@MarcoGorelli I was not able to properly use the pandas like util function get_dtype_backend to figure out the nullable backend. It should not really matter as the non-nullable backend would not result in integer type if the series contains nulls anyway

constructor: Constructor,
method: Literal["average", "min", "max", "dense", "ordinal"],
) -> None:
if "polars" not in str(constructor):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well πŸ₯²

@FBruzzesi
Copy link
Member Author

Hey @adamblake, this is an initial implementation to support rank. In the description I tried to explain all the shortcomings and the challenges I am facing.

@FBruzzesi FBruzzesi changed the title feat: ass Series|Expr.rank feat: add Series|Expr.rank Nov 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant