-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-7413: [Python] Expose and test the partioning discovery #6151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-7413: [Python] Expose and test the partioning discovery #6151
Conversation
|
@bkietz this is further exposing the PartitionSchemeDiscovery to make it usable from python. Do we have preference on the API? cc @kszucs |
| format = ds.ParquetFileFormat() | ||
|
|
||
| options = ds.FileSystemDiscoveryOptions('subdir') | ||
| schema_discovery = ds.SchemaPartitionScheme.discover(['group', 'key']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO ds.SchemaPartitionSchemeDiscovery(['group', 'key']) would be better, subclassing ds.PartitionSchemeDiscovery would follow the current semantics of the binding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the main counter argument for that is that such a SchemaPartitionSchemeDiscovery is not needed for the rest, and will have no functionality (the user only needs to pass an instance of the base class to the discovery options).
I seem to recall that @bkietz mentioned that for other classes as well in the python bindings.
Another option would also be a more functional approach in the user facing API, eg a hive_partition() function that either constructs an actual PartitionScheme or a PartitionSchemeDiscovery depending on the input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exposing subclasses of PartitionSchemeDiscovery is indeed not necessary and should be avoided. There's a JIRA up for removing some class hierarchies which are currently exposed https://issues.apache.org/jira/browse/ARROW-7391 like PartitionScheme and Expressions.
@jorisvandenbossche I think a single hive_partition() function like you describe is ideal; R currently does this. hive_partition(schema) -> explicit Scheme, while hive_partition(None) -> Discovery. Similarly I'd like to see ordered_partition(schema) -> explicit Scheme, ordered_partition(list_of_names) -> Discovery.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not in favor of returning different types with different APIs depending on the function argument. I have intentionally introduced only classes before adding any factory functions (like the open_dataset, or partition_scheme etc.). Returning objects with the appropriate type does no harm, wrapping them with functions is easy and straightforward.
Could You give me an end-to-end example of the imagined API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the "imagined" API exists in R already:
https://github.com/apache/arrow/blob/master/r/R/dataset.R
https://github.com/apache/arrow/blob/master/r/tests/testthat/test-dataset.R
One could argue that you should start with the factory functions and only expose the classes and methods you actually need--anything else is YAGNI. In R, for example, I found that I needed to create HivePartitionScheme and HivePartitionSchemeDiscovery classes in R because hive_partition() would return one or the other, but they didn't need any methods defined because the only thing you'd do with either is pass them to a DataSourceDiscovery, which would consume them appropriately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Getting back to this. One option is to merge as is: it's closest to the C++ API for those low-level classes, and a high-level function like hive_partitioning() can be added later in the open_dataset PR?
(but I also happily implement the additional subclasses to move forward with this)
I'm not in favor of returning different types with different APIs depending on the function argument.
@kszucs in general I agree with this. However, for a high level user API in this case, I think it's also not great to have multiple functions like both hive_partitioning(schema) and hive_partitioning_factory(), and both ordered_partitioning(schema) and ordered_partitioning_factory(list_of_names) (as the result in both cases will be passed to the same keyword in open_dataset).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. Either way, we should come up with a better API, but we can do that refinement in a follow-up PR.
| ("group", pa.int32()), | ||
| ("key", pa.string()), | ||
| ]) | ||
| assert inspected_schema.remove_metadata().equals(expected_schema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI There is a check_metadata keyword for Schema.equals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI There is a check_metadata keyword for Schema.equals.
Ah, that's good to know ;)
|
Thanks, will update the other PR, so can further discuss the higher level user API there. |
https://issues.apache.org/jira/browse/ARROW-7413