ARROW-7413: [Python] Expose and test the partioning discovery #6151

jorisvandenbossche · 2020-01-09T14:34:43Z

https://issues.apache.org/jira/browse/ARROW-7413

jorisvandenbossche · 2020-01-09T14:39:02Z

@bkietz this is further exposing the PartitionSchemeDiscovery to make it usable from python.

Do we have preference on the API?
Currently, I made it similar to the C++ API with a static method on the actual (hive/schema) PartitionScheme, returning the base (existing) PartitionSchemeDiscovery object, which can be passed to the option.

cc @kszucs

github-actions · 2020-01-09T14:48:44Z

https://issues.apache.org/jira/browse/ARROW-7413

kszucs · 2020-01-10T14:30:59Z

python/pyarrow/tests/test_dataset.py

+    format = ds.ParquetFileFormat()
+
+    options = ds.FileSystemDiscoveryOptions('subdir')
+    schema_discovery = ds.SchemaPartitionScheme.discover(['group', 'key'])


IMO ds.SchemaPartitionSchemeDiscovery(['group', 'key']) would be better, subclassing ds.PartitionSchemeDiscovery would follow the current semantics of the binding.

I think the main counter argument for that is that such a SchemaPartitionSchemeDiscovery is not needed for the rest, and will have no functionality (the user only needs to pass an instance of the base class to the discovery options).

I seem to recall that @bkietz mentioned that for other classes as well in the python bindings.

Another option would also be a more functional approach in the user facing API, eg a hive_partition() function that either constructs an actual PartitionScheme or a PartitionSchemeDiscovery depending on the input.

Exposing subclasses of PartitionSchemeDiscovery is indeed not necessary and should be avoided. There's a JIRA up for removing some class hierarchies which are currently exposed https://issues.apache.org/jira/browse/ARROW-7391 like PartitionScheme and Expressions.

@jorisvandenbossche I think a single hive_partition() function like you describe is ideal; R currently does this. hive_partition(schema) -> explicit Scheme, while hive_partition(None) -> Discovery. Similarly I'd like to see ordered_partition(schema) -> explicit Scheme, ordered_partition(list_of_names) -> Discovery.

I'm not in favor of returning different types with different APIs depending on the function argument. I have intentionally introduced only classes before adding any factory functions (like the open_dataset, or partition_scheme etc.). Returning objects with the appropriate type does no harm, wrapping them with functions is easy and straightforward.

Could You give me an end-to-end example of the imagined API?

I think the "imagined" API exists in R already:

https://github.com/apache/arrow/blob/master/r/R/dataset.R
https://github.com/apache/arrow/blob/master/r/tests/testthat/test-dataset.R

One could argue that you should start with the factory functions and only expose the classes and methods you actually need--anything else is YAGNI. In R, for example, I found that I needed to create HivePartitionScheme and HivePartitionSchemeDiscovery classes in R because hive_partition() would return one or the other, but they didn't need any methods defined because the only thing you'd do with either is pass them to a DataSourceDiscovery, which would consume them appropriately.

Getting back to this. One option is to merge as is: it's closest to the C++ API for those low-level classes, and a high-level function like hive_partitioning() can be added later in the open_dataset PR?
(but I also happily implement the additional subclasses to move forward with this)

I'm not in favor of returning different types with different APIs depending on the function argument.

@kszucs in general I agree with this. However, for a high level user API in this case, I think it's also not great to have multiple functions like both hive_partitioning(schema) and hive_partitioning_factory(), and both ordered_partitioning(schema) and ordered_partitioning_factory(list_of_names) (as the result in both cases will be passed to the same keyword in open_dataset).

Agree. Either way, we should come up with a better API, but we can do that refinement in a follow-up PR.

kszucs · 2020-01-14T10:25:41Z

python/pyarrow/tests/test_dataset.py

+        ("group", pa.int32()),
+        ("key", pa.string()),
+    ])
+    assert inspected_schema.remove_metadata().equals(expected_schema)


FYI There is a check_metadata keyword for Schema.equals.

FYI There is a check_metadata keyword for Schema.equals.

Ah, that's good to know ;)

jorisvandenbossche · 2020-01-14T10:34:26Z

Thanks, will update the other PR, so can further discuss the higher level user API there.

ARROW-7413: [Python] Expose and test the partioning discovery

0a2e215

bkietz self-requested a review January 9, 2020 19:27

kszucs reviewed Jan 10, 2020

View reviewed changes

kszucs reviewed Jan 14, 2020

View reviewed changes

kszucs approved these changes Jan 14, 2020

View reviewed changes

kszucs closed this in 38bf178 Jan 14, 2020

jorisvandenbossche deleted the ARROW-7413-partition-discovery branch January 14, 2020 10:34

jorisvandenbossche mentioned this pull request Jan 14, 2020

ARROW-7432: [Python] Add higher level open_dataset function #6022

Closed

asfimport mentioned this pull request Jan 14, 2020

[Python][Dataset] Add tests for PartitionSchemeDiscovery #23686

Closed

ARROW-7413: [Python] Expose and test the partioning discovery #6151

ARROW-7413: [Python] Expose and test the partioning discovery #6151

Uh oh!

Conversation

jorisvandenbossche commented Jan 9, 2020

Uh oh!

jorisvandenbossche commented Jan 9, 2020

Uh oh!

github-actions bot commented Jan 9, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bkietz Jan 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche commented Jan 14, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

bkietz Jan 10, 2020 •

edited

Loading