-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-16616: [Python] Add lazy Dataset.filter() method #13409
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
|
I have no idea whether we want to expose such lazy construction APIs on Dataset. |
FYI, this task has spawned from #13155 (comment) discussion |
|
@pitrou Addressed all comments, should be ready for re-review |
westonpace
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not completely against this but having FilteredDataset instead of something like Query might be a bit short-sighted. What happens if a user wants to add a dynamic column (project)?
If you had both a projection expression and a filter expression that might be more close to what scanner / datasets provides today.
|
I am personally also a bit wary of adding a new public class like |
|
|
@westonpace I'm not particularly attached to the I also dislike the idea of reusing the |
|
From #13409 (comment)
Could you expand on this a bit? (I don't know where or why it was asked to move to a dedicated class, the only reference I find in the other PR is the question if this shouldn't live on the Scanner) It seems to me that if we want to expose a helper |
Co-authored-by: Antoine Pitrou <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Co-authored-by: Weston Pace <[email protected]>
jorisvandenbossche
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update!
There are a few more complexities:
There are some other Dataset methods that might also need to take into account this filter: replace_schema, get_fragments.
And what with a UnionDataset unioning filtered datasets ?
We should probably also note somewhere that Fragments don't inherit this filter.
The ParquetFileFragment has some methods that work with filters (split_by_row_group, subset).
python/pyarrow/includes/libarrow.pxd
Outdated
|
|
||
| cdef cppclass CSourceNodeOptions "arrow::compute::SourceNodeOptions"(CExecNodeOptions): | ||
| @staticmethod | ||
| CResult[shared_ptr[CSourceNodeOptions]] FromRecordBatchReader( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still used in the current version of the PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No anymore, but it still had a value on its own, so I didn't remove it. If it's a concern I can easily get rid of it. (Even though I would leave the C++ implementation around).
Co-authored-by: Joris Van den Bossche <[email protected]>
|
@jorisvandenbossche I checked for If the dataset is created using It seems that Asking because if the plan is to deprecate it some day, then it probably doesn't make much sense to invest the effort to work toward feature parity with |
|
I think you certainly don't have to care about ParquetDataset in this PR (we generally didn't add any of the additional methods from pyarrow.dataset.Dataset to it, so this PR just follows that). And let's have the discussion about what to do with ParquetDataset in a dedicated place (eg we already have https://issues.apache.org/jira/browse/ARROW-9720, although the description there is a bit outdated) |
Expose a
Dataset.filtermethod that applies a filter to the dataset without actually loading it in memory.Addresses what was discussed in #13155 (comment)
Dataset.somethingmethods.dataset(filter=X).filter(filter=Y).scanner(filter=Z)(related to ARROW-16616: [Python] Add lazy Dataset.filter() method #13409 (comment))Datasetclass instead ofFilteredDatasetas discussed with @jorisvandenbossche