Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 13 additions & 6 deletions python/pyarrow/table.pxi
Original file line number Diff line number Diff line change
Expand Up @@ -2886,20 +2886,23 @@ cdef class Table(_PandasConvertible):
"""
Select rows from the table.

See :func:`pyarrow.compute.filter` for full usage.
The Table can be filtered based on a mask, which will be passed to
:func:`pyarrow.compute.filter` to perform the filtering, or it can
be filtered through a boolean :class:`.Expression`

Parameters
----------
mask : Array or array-like
The boolean mask to filter the table with.
mask : Array or array-like or .Expression
The boolean mask or the :class:`.Expression` to filter the table with.
null_selection_behavior
How nulls in the mask should be handled.
How nulls in the mask should be handled, does nothing if
an :class:`.Expression` is used.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not possible to pass through to the filter node?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not in any way that I can see, the filter node has a pretty straightforward constructor:
explicit FilterNodeOptions(Expression filter_expression, bool async_mode = true), it only accepts an expression.

I think that if you care about special handling nulls, you probably want to build an expression that evaluates as you wish for nulls

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that if you care about special handling nulls, you probably want to build an expression that evaluates as you wish for nulls

I don't think is possible to get the "emit null" behaviour by changing the expression (for dropping/keeping, you can explicitly fill the null with False/True, but for preserving the row as null, that's only possible through this option). I suppose that is a good reason this is an option of the filter kernel and not eg comparison kernels.

Anyway, this is not that important given that the "drop" behaviour is the default for both (and is the typical behaviour you want, I think), but this might be something to open a JIRA for to add FilterOptions to the FilterNodeOptions (cc @westonpace would that make sense?)

Copy link
Member Author

@amol- amol- May 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhm, not sure I follow, why you can't use an expression?
Given

>>> pa.table({"rows": [1, 2, 3, None, 5, 6]})
pyarrow.Table
rows: int64
----
rows: [[1,2,3,null,5,6]]

If I want to drop the nulls, I do

>>> t.filter(pc.field("rows") < 5)
pyarrow.Table
rows: int64
----
rows: [[1,2,3]]

If instead I want to keep the nulls, I do

>>> t.filter((pc.field("rows") < 5) | (pc.field("rows").is_null()))
pyarrow.Table
rows: int64
----
rows: [[1,2,3,null]]

Regarding the "nulls" in the selection mask itself, I don't think FilterNode supports anything different from a boolean Expression, so the option doesn't make much sense in that context.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The option is about introducing nulls in the output data where the mask is null, not about preserving nulls from the input data. So for preserving nulls in the input, you can change your expression. But for introducing nulls, I don't think that is possible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using your example table:

In [29]: t.filter(pa.array([True, None, True, False, False, False]))
Out[29]: 
pyarrow.Table
rows: int64
----
rows: [[1,3]]

vs

In [33]: t.filter(pa.array([True, None, True, False, False, False]), null_selection_behavior="emit_null")
Out[33]: 
pyarrow.Table
rows: int64
----
rows: [[1,null,3]]

The null is in a place where the original data had a "2"


Returns
-------
filtered : Table
A table of the same schema, with only the rows selected
by the boolean mask.
by applied filtering

Examples
--------
Expand Down Expand Up @@ -2932,7 +2935,11 @@ cdef class Table(_PandasConvertible):
n_legs: [[2,4,null]]
animals: [["Flamingo","Horse",null]]
"""
return _pc().filter(self, mask, null_selection_behavior)
if isinstance(mask, _pc().Expression):
return _pc()._exec_plan._filter_table(self, mask,
output_type=Table)
else:
return _pc().filter(self, mask, null_selection_behavior)

def take(self, object indices):
"""
Expand Down
16 changes: 16 additions & 0 deletions python/pyarrow/tests/test_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -4620,3 +4620,19 @@ def test_dataset_join_collisions(tempdir):
[10, 20, None, 99],
["A", "B", None, "Z"],
], names=["colA", "colB", "colVals", "colB_r", "colVals_r"])


@pytest.mark.dataset
def test_dataset_filter(tempdir):
t1 = pa.table({
"colA": [1, 2, 6],
"col2": ["a", "b", "f"]
})
ds.write_dataset(t1, tempdir / "t1", format="parquet")
ds1 = ds.dataset(tempdir / "t1")

result = ds1.scanner(filter=pc.field("colA") < 3)
assert result.to_table() == pa.table({
"colA": [1, 2],
"col2": ["a", "b"]
})
24 changes: 24 additions & 0 deletions python/pyarrow/tests/test_table.py
Original file line number Diff line number Diff line change
Expand Up @@ -2121,3 +2121,27 @@ def test_table_join_collisions():
[10, 20, None, 99],
["A", "B", None, "Z"],
], names=["colA", "colB", "colVals", "colB", "colVals"])


@pytest.mark.dataset
def test_table_filter_expression():
t1 = pa.table({
"colA": [1, 2, 6],
"colB": [10, 20, 60],
"colVals": ["a", "b", "f"]
})

t2 = pa.table({
"colA": [99, 2, 1],
"colB": [99, 20, 10],
"colVals": ["Z", "B", "A"]
})

t3 = pa.concat_tables([t1, t2])

result = t3.filter(pc.field("colA") < 10)
assert result.combine_chunks() == pa.table({
"colA": [1, 2, 6, 2, 1],
"colB": [10, 20, 60, 20, 10],
"colVals": ["a", "b", "f", "B", "A"]
})