-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-16469: [Python] Table.filter accepts a boolean expression in addition to boolean array #13155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| null_selection_behavior | ||
| How nulls in the mask should be handled. | ||
| How nulls in the mask should be handled, does nothing if | ||
| an :class:`.Expression` is used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not possible to pass through to the filter node?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not in any way that I can see, the filter node has a pretty straightforward constructor:
explicit FilterNodeOptions(Expression filter_expression, bool async_mode = true), it only accepts an expression.
I think that if you care about special handling nulls, you probably want to build an expression that evaluates as you wish for nulls
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that if you care about special handling nulls, you probably want to build an expression that evaluates as you wish for nulls
I don't think is possible to get the "emit null" behaviour by changing the expression (for dropping/keeping, you can explicitly fill the null with False/True, but for preserving the row as null, that's only possible through this option). I suppose that is a good reason this is an option of the filter kernel and not eg comparison kernels.
Anyway, this is not that important given that the "drop" behaviour is the default for both (and is the typical behaviour you want, I think), but this might be something to open a JIRA for to add FilterOptions to the FilterNodeOptions (cc @westonpace would that make sense?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uhm, not sure I follow, why you can't use an expression?
Given
>>> pa.table({"rows": [1, 2, 3, None, 5, 6]})
pyarrow.Table
rows: int64
----
rows: [[1,2,3,null,5,6]]
If I want to drop the nulls, I do
>>> t.filter(pc.field("rows") < 5)
pyarrow.Table
rows: int64
----
rows: [[1,2,3]]
If instead I want to keep the nulls, I do
>>> t.filter((pc.field("rows") < 5) | (pc.field("rows").is_null()))
pyarrow.Table
rows: int64
----
rows: [[1,2,3,null]]
Regarding the "nulls" in the selection mask itself, I don't think FilterNode supports anything different from a boolean Expression, so the option doesn't make much sense in that context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The option is about introducing nulls in the output data where the mask is null, not about preserving nulls from the input data. So for preserving nulls in the input, you can change your expression. But for introducing nulls, I don't think that is possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using your example table:
In [29]: t.filter(pa.array([True, None, True, False, False, False]))
Out[29]:
pyarrow.Table
rows: int64
----
rows: [[1,3]]
vs
In [33]: t.filter(pa.array([True, None, True, False, False, False]), null_selection_behavior="emit_null")
Out[33]:
pyarrow.Table
rows: int64
----
rows: [[1,null,3]]
The null is in a place where the original data had a "2"
|
Benchmark runs are scheduled for baseline = 1483b82 and contender = 71737ea. 71737ea is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Expose a `Dataset.filter` method that applies a filter to the dataset without actually loading it in memory. Addresses what was discussed in #13155 (comment) - [x] Update documentation - [x] Ensure the filtered dataset preserves the filter when writing it back - [x] Ensure the filtered dataset preserves the filter when joining - [x] Ensure the filtered dataset preserves the filter when applying standard `Dataset.something` methods. - [x] Allow to extend the filter by adding more conditions subsequently `dataset(filter=X).filter(filter=Y).scanner(filter=Z)` (related to #13409 (comment)) - [x] Refactor to use only `Dataset` class instead of `FilteredDataset` as discussed with @ jorisvandenbossche - [x] Add support in replace_schema - [x] Error in get_fragments in case a filter is set. - [x] Verify support in UnionDataset Lead-authored-by: Alessandro Molina <[email protected]> Co-authored-by: Joris Van den Bossche <[email protected]> Co-authored-by: Weston Pace <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Alessandro Molina <[email protected]>
No description provided.