Skip to content

Conversation

@amol-
Copy link
Member

@amol- amol- commented May 13, 2022

No description provided.

@github-actions
Copy link

@amol- amol- requested a review from jorisvandenbossche May 16, 2022 15:30
null_selection_behavior
How nulls in the mask should be handled.
How nulls in the mask should be handled, does nothing if
an :class:`.Expression` is used.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not possible to pass through to the filter node?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not in any way that I can see, the filter node has a pretty straightforward constructor:
explicit FilterNodeOptions(Expression filter_expression, bool async_mode = true), it only accepts an expression.

I think that if you care about special handling nulls, you probably want to build an expression that evaluates as you wish for nulls

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that if you care about special handling nulls, you probably want to build an expression that evaluates as you wish for nulls

I don't think is possible to get the "emit null" behaviour by changing the expression (for dropping/keeping, you can explicitly fill the null with False/True, but for preserving the row as null, that's only possible through this option). I suppose that is a good reason this is an option of the filter kernel and not eg comparison kernels.

Anyway, this is not that important given that the "drop" behaviour is the default for both (and is the typical behaviour you want, I think), but this might be something to open a JIRA for to add FilterOptions to the FilterNodeOptions (cc @westonpace would that make sense?)

Copy link
Member Author

@amol- amol- May 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhm, not sure I follow, why you can't use an expression?
Given

>>> pa.table({"rows": [1, 2, 3, None, 5, 6]})
pyarrow.Table
rows: int64
----
rows: [[1,2,3,null,5,6]]

If I want to drop the nulls, I do

>>> t.filter(pc.field("rows") < 5)
pyarrow.Table
rows: int64
----
rows: [[1,2,3]]

If instead I want to keep the nulls, I do

>>> t.filter((pc.field("rows") < 5) | (pc.field("rows").is_null()))
pyarrow.Table
rows: int64
----
rows: [[1,2,3,null]]

Regarding the "nulls" in the selection mask itself, I don't think FilterNode supports anything different from a boolean Expression, so the option doesn't make much sense in that context.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The option is about introducing nulls in the output data where the mask is null, not about preserving nulls from the input data. So for preserving nulls in the input, you can change your expression. But for introducing nulls, I don't think that is possible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using your example table:

In [29]: t.filter(pa.array([True, None, True, False, False, False]))
Out[29]: 
pyarrow.Table
rows: int64
----
rows: [[1,3]]

vs

In [33]: t.filter(pa.array([True, None, True, False, False, False]), null_selection_behavior="emit_null")
Out[33]: 
pyarrow.Table
rows: int64
----
rows: [[1,null,3]]

The null is in a place where the original data had a "2"

@jorisvandenbossche jorisvandenbossche changed the title ARROW-16469: [Python] Table.filter and Dataset.filter ARROW-16469: [Python] Table.filter accepts a boolean expression in addition to boolean array May 19, 2022
@ursabot
Copy link

ursabot commented May 20, 2022

Benchmark runs are scheduled for baseline = 1483b82 and contender = 71737ea. 71737ea is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.74% ⬆️0.0%] test-mac-arm
[Failed ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.28% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 71737eae ec2-t3-xlarge-us-east-2
[Failed] 71737eae test-mac-arm
[Failed] 71737eae ursa-i9-9960x
[Finished] 71737eae ursa-thinkcentre-m75q
[Finished] 1483b82b ec2-t3-xlarge-us-east-2
[Failed] 1483b82b test-mac-arm
[Failed] 1483b82b ursa-i9-9960x
[Finished] 1483b82b ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

amol- added a commit that referenced this pull request Dec 12, 2022
Expose a `Dataset.filter` method that applies a filter to the dataset without actually loading it in memory.

Addresses what was discussed in #13155 (comment)

- [x] Update documentation
- [x] Ensure the filtered dataset preserves the filter when writing it back
- [x] Ensure the filtered dataset preserves the filter when joining
- [x] Ensure the filtered dataset preserves the filter when applying standard `Dataset.something` methods.
- [x] Allow to extend the filter by adding more conditions subsequently `dataset(filter=X).filter(filter=Y).scanner(filter=Z)` (related to #13409 (comment))
- [x]  Refactor to use only `Dataset` class instead of `FilteredDataset` as discussed with @ jorisvandenbossche 
- [x] Add support in replace_schema
- [x] Error in get_fragments in case a filter is set.
- [x] Verify support in UnionDataset


Lead-authored-by: Alessandro Molina <[email protected]>
Co-authored-by: Joris Van den Bossche <[email protected]>
Co-authored-by: Weston Pace <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Alessandro Molina <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants