ARROW-16469: [Python] Table.filter accepts a boolean expression in addition to boolean array #13155

amol- · 2022-05-13T16:52:27Z

No description provided.

github-actions · 2022-05-13T16:52:44Z

https://issues.apache.org/jira/browse/ARROW-16469

python/pyarrow/_dataset.pyx

python/pyarrow/table.pxi

jorisvandenbossche · 2022-05-17T17:22:26Z

python/pyarrow/table.pxi

        null_selection_behavior
-            How nulls in the mask should be handled.
+            How nulls in the mask should be handled, does nothing if
+            an :class:`.Expression` is used.


This is not possible to pass through to the filter node?

Not in any way that I can see, the filter node has a pretty straightforward constructor:
explicit FilterNodeOptions(Expression filter_expression, bool async_mode = true), it only accepts an expression.

I think that if you care about special handling nulls, you probably want to build an expression that evaluates as you wish for nulls

I think that if you care about special handling nulls, you probably want to build an expression that evaluates as you wish for nulls

I don't think is possible to get the "emit null" behaviour by changing the expression (for dropping/keeping, you can explicitly fill the null with False/True, but for preserving the row as null, that's only possible through this option). I suppose that is a good reason this is an option of the filter kernel and not eg comparison kernels.

Anyway, this is not that important given that the "drop" behaviour is the default for both (and is the typical behaviour you want, I think), but this might be something to open a JIRA for to add FilterOptions to the FilterNodeOptions (cc @westonpace would that make sense?)

Uhm, not sure I follow, why you can't use an expression?
Given

>>> pa.table({"rows": [1, 2, 3, None, 5, 6]}) pyarrow.Table rows: int64 ---- rows: [[1,2,3,null,5,6]]

If I want to drop the nulls, I do

>>> t.filter(pc.field("rows") < 5) pyarrow.Table rows: int64 ---- rows: [[1,2,3]]

If instead I want to keep the nulls, I do

>>> t.filter((pc.field("rows") < 5) | (pc.field("rows").is_null())) pyarrow.Table rows: int64 ---- rows: [[1,2,3,null]]

Regarding the "nulls" in the selection mask itself, I don't think FilterNode supports anything different from a boolean Expression, so the option doesn't make much sense in that context.

The option is about introducing nulls in the output data where the mask is null, not about preserving nulls from the input data. So for preserving nulls in the input, you can change your expression. But for introducing nulls, I don't think that is possible.

Using your example table:

In [29]: t.filter(pa.array([True, None, True, False, False, False])) Out[29]: pyarrow.Table rows: int64 ---- rows: [[1,3]]

vs

In [33]: t.filter(pa.array([True, None, True, False, False, False]), null_selection_behavior="emit_null") Out[33]: pyarrow.Table rows: int64 ---- rows: [[1,null,3]]

The null is in a place where the original data had a "2"

ursabot · 2022-05-20T11:41:06Z

Benchmark runs are scheduled for baseline = 1483b82 and contender = 71737ea. 71737ea is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.74% ⬆️0.0%] test-mac-arm
[Failed ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.28% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 71737eae ec2-t3-xlarge-us-east-2
[Failed] 71737eae test-mac-arm
[Failed] 71737eae ursa-i9-9960x
[Finished] 71737eae ursa-thinkcentre-m75q
[Finished] 1483b82b ec2-t3-xlarge-us-east-2
[Failed] 1483b82b test-mac-arm
[Failed] 1483b82b ursa-i9-9960x
[Finished] 1483b82b ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Expose a `Dataset.filter` method that applies a filter to the dataset without actually loading it in memory. Addresses what was discussed in #13155 (comment) - [x] Update documentation - [x] Ensure the filtered dataset preserves the filter when writing it back - [x] Ensure the filtered dataset preserves the filter when joining - [x] Ensure the filtered dataset preserves the filter when applying standard `Dataset.something` methods. - [x] Allow to extend the filter by adding more conditions subsequently `dataset(filter=X).filter(filter=Y).scanner(filter=Z)` (related to #13409 (comment)) - [x] Refactor to use only `Dataset` class instead of `FilteredDataset` as discussed with @ jorisvandenbossche - [x] Add support in replace_schema - [x] Error in get_fragments in case a filter is set. - [x] Verify support in UnionDataset Lead-authored-by: Alessandro Molina <[email protected]> Co-authored-by: Joris Van den Bossche <[email protected]> Co-authored-by: Weston Pace <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Alessandro Molina <[email protected]>

Table.filter and Dataset.filter

bdea447

github-actions bot added the Component: Python label May 13, 2022

Fix docs

0d84f3a

amol- requested a review from jorisvandenbossche May 16, 2022 15:30

jorisvandenbossche reviewed May 17, 2022

View reviewed changes

Name argument back to mask

7945ed3

jorisvandenbossche changed the title ~~ARROW-16469: [Python] Table.filter and Dataset.filter~~ ARROW-16469: [Python] Table.filter accepts a boolean expression in addition to boolean array May 19, 2022

Remove Dataset.filter

f537d9f

jorisvandenbossche approved these changes May 19, 2022

View reviewed changes

jorisvandenbossche closed this in 71737ea May 19, 2022

amol- mentioned this pull request Jun 23, 2022

ARROW-16616: [Python] Add lazy Dataset.filter() method #13409

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ARROW-16469: [Python] Table.filter accepts a boolean expression in addition to boolean array #13155

ARROW-16469: [Python] Table.filter accepts a boolean expression in addition to boolean array #13155

Uh oh!

amol- commented May 13, 2022

Uh oh!

github-actions bot commented May 13, 2022

Uh oh!

Uh oh!

Uh oh!

jorisvandenbossche May 17, 2022

Uh oh!

amol- May 19, 2022

Uh oh!

jorisvandenbossche May 19, 2022

Uh oh!

amol- May 19, 2022 •

edited

Loading

Uh oh!

jorisvandenbossche May 19, 2022

Uh oh!

jorisvandenbossche May 19, 2022

Uh oh!

ursabot commented May 20, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ARROW-16469: [Python] Table.filter accepts a boolean expression in addition to boolean array #13155

ARROW-16469: [Python] Table.filter accepts a boolean expression in addition to boolean array #13155

Uh oh!

Conversation

amol- commented May 13, 2022

Uh oh!

github-actions bot commented May 13, 2022

Uh oh!

Uh oh!

Uh oh!

jorisvandenbossche May 17, 2022

Choose a reason for hiding this comment

Uh oh!

amol- May 19, 2022

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche May 19, 2022

Choose a reason for hiding this comment

Uh oh!

amol- May 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche May 19, 2022

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche May 19, 2022

Choose a reason for hiding this comment

Uh oh!

ursabot commented May 20, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

amol- May 19, 2022 •

edited

Loading