-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-6341: [Python] Implement low-level bindings for Dataset #5237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@kszucs is this still WIP (I assume so based on CI builds?) |
|
@emkornfield yes, I'll continue to work on it after the 0.15.1 release. Theoretically the Dataset API should be ready for having bindings. |
694087f to
76abfe1
Compare
|
Something I ran into yesterday: trying to access the where "test_dataset" is a simple directory with a single small parquet file in it. |
|
@jorisvandenbossche I've fixed that. |
python/pyarrow/_dataset.pyx
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to add an implicit cast like in R for this to be bearable. Otherwise you get annoying errors:
In [41]: cond = ds.ComparisonExpression(ds.CompareOperator.Greater, ds.FieldExpression("total_amount"), ds.ScalarExpression(1000.0))
In [42]: scanner_builder.filter(cond)
---------------------------------------------------------------------------
ArrowTypeError Traceback (most recent call last)
<ipython-input-42-bb6fba558cf8> in <module>
----> 1 scanner_builder.filter(cond)
~/src/db/arrow/python/pyarrow/_dataset.pyx in pyarrow._dataset.ScannerBuilder.filter()
951 self : ScannerBuilder
952 """
--> 953 check_status(self.builder.Filter(filter_expression.unwrap()))
954 return self
955
~/src/db/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()
86 raise ArrowNotImplementedError(message)
87 elif status.IsTypeError():
---> 88 raise ArrowTypeError(message)
89 elif status.IsCapacityError():
90 raise ArrowCapacityError(message)
ArrowTypeError: cannot compare expressions of differing type, float vs double
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I didn't want to use any implicit behaviour in the first iteration. Perhaps this should be done by the C++ filter method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should do it as a follow-up PR to handle it nicely.
python/pyarrow/_dataset.pyx
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not fully clear to me what this last sentence exactly means.
Does it mean this scheme will be used to filter/project the output of the data sources, or that the other projections/filters specified when scanning should result in something that matches this schema?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, it was not clear to me either, it is copied from the C++ apidocs. In case of projections this schema seems to be omitted? cc @fsaintjacques
|
Hmm, github doesn't allow to request @jorisvandenbossche's review, so here it is :) |
python/pyarrow/_dataset.pyx
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should have a jira about introducing InMemoryDataSource, if I recall correctly we already have one?
|
I'm not planning to implement any new features here, although we should discuss the possible follow-up PRs. A couple candidates:
|
jorisvandenbossche
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quickly tested again some of the code snippets I wrote last week, and that's all still working nicely.
One thing I noticed is that due to the removal of DataFragment and FileSource, you no longer can check the file paths in a Dataset/FileSystemDataSource. Maybe something to consider later if / how we want to expose something like that.
There are still several classes and methods/properties that need docstrings, but it would be fine for me to do that as a follow-up if that makes we can merge this faster.
That sounds good to me.
Yes, something like the
And we also still need to discuss to what extent we want to keep the existing parquet dataset implementation, or deprecate it, or try to implement it partly using the new machinery (eg if we want to support dask's usage, we need to keep it) |
bkietz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking good, thanks for doing this!
I think most of your follow up candidates can wait for a different PR, but I think the unit tests need some work before this can be merged.
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some comments while I skimmed over this.
python/pyarrow/tests/test_dataset.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not obvious to me why partition_expression should be equal to the source_partition constructor argument. Can you explain?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That argument is used to set the partition_expression property, so they should be equal
python/pyarrow/_dataset.pyx
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to add docstrings for all the public classes here.
python/pyarrow/_dataset.pyx
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should also add docstrings for public methods.
wesm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my quick read didn't see anything too unreasonable. I left a handful of comments
python/pyarrow/tests/test_dataset.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to check that the one you pass in is passed on correctly to the ScanContext? I think you can use logging_memory_pool to check
python/pyarrow/_dataset.pyx
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might consider doing dispatch with a dict instead
| self.init(shared_ptr[CFileFormat](new CParquetFileFormat())) | ||
|
|
||
|
|
||
| cdef class PartitionScheme: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary to expose partition scheme as a class at all? I think it would suffice to have factories like make_hive_partition_scheme()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary to expose partition scheme as a class at all? I think it would suffice to have factories like make_hive_partition_scheme()
What would a make_hive_partition_scheme() then return if the partition scheme object itself is not exposed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ds.partition_scheme(pa.Schema schema, string flavor='hive') -> ds.PartitionSchemeAlthough I prefer to have not opaque return types, so I can distinguish between:
ds.partition_scheme(schema)
ds.partition_scheme(schema, flavor='hive')If we return the same class then I don't have the ability to inspect that they are differ.
bkietz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Follow up JIRA for winnowing the public classes: https://issues.apache.org/jira/browse/ARROW-7391
Follow-up on #5237 adding a higher-level API for datasets Closes #6022 from jorisvandenbossche/dataset-python and squashes the following commits: 745c218 <Joris Van den Bossche> rename keyword to partitioning + refactor tests + more coverage 8e03282 <Joris Van den Bossche> update for big renaming + doc updates 9c95938 <Joris Van den Bossche> Use FileSystem.from_uri ac0d83d <Joris Van den Bossche> split into source / dataset functions 866f72c <Joris Van den Bossche> Add single partitioning() function from kszucs + tests 7481fb6 <Joris Van den Bossche> fix import for python 2 d59595d <Joris Van den Bossche> add partition scheme creation functions 260b737 <Joris Van den Bossche> add support for Pathlib 5e00c87 <Joris Van den Bossche> fix with new partition discovery option 757fe80 <Joris Van den Bossche> Add higher level open_dataset function Authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: Neal Richardson <[email protected]>
No description provided.