Skip to content

Conversation

@jorisvandenbossche
Copy link
Member

@jorisvandenbossche
Copy link
Member Author

@bkietz this is further exposing the PartitionSchemeDiscovery to make it usable from python.

Do we have preference on the API?
Currently, I made it similar to the C++ API with a static method on the actual (hive/schema) PartitionScheme, returning the base (existing) PartitionSchemeDiscovery object, which can be passed to the option.

cc @kszucs

@github-actions
Copy link

github-actions bot commented Jan 9, 2020

@bkietz bkietz self-requested a review January 9, 2020 19:27
format = ds.ParquetFileFormat()

options = ds.FileSystemDiscoveryOptions('subdir')
schema_discovery = ds.SchemaPartitionScheme.discover(['group', 'key'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO ds.SchemaPartitionSchemeDiscovery(['group', 'key']) would be better, subclassing ds.PartitionSchemeDiscovery would follow the current semantics of the binding.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the main counter argument for that is that such a SchemaPartitionSchemeDiscovery is not needed for the rest, and will have no functionality (the user only needs to pass an instance of the base class to the discovery options).

I seem to recall that @bkietz mentioned that for other classes as well in the python bindings.

Another option would also be a more functional approach in the user facing API, eg a hive_partition() function that either constructs an actual PartitionScheme or a PartitionSchemeDiscovery depending on the input.

Copy link
Member

@bkietz bkietz Jan 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exposing subclasses of PartitionSchemeDiscovery is indeed not necessary and should be avoided. There's a JIRA up for removing some class hierarchies which are currently exposed https://issues.apache.org/jira/browse/ARROW-7391 like PartitionScheme and Expressions.

@jorisvandenbossche I think a single hive_partition() function like you describe is ideal; R currently does this. hive_partition(schema) -> explicit Scheme, while hive_partition(None) -> Discovery. Similarly I'd like to see ordered_partition(schema) -> explicit Scheme, ordered_partition(list_of_names) -> Discovery.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not in favor of returning different types with different APIs depending on the function argument. I have intentionally introduced only classes before adding any factory functions (like the open_dataset, or partition_scheme etc.). Returning objects with the appropriate type does no harm, wrapping them with functions is easy and straightforward.

Could You give me an end-to-end example of the imagined API?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the "imagined" API exists in R already:

https://github.com/apache/arrow/blob/master/r/R/dataset.R
https://github.com/apache/arrow/blob/master/r/tests/testthat/test-dataset.R

One could argue that you should start with the factory functions and only expose the classes and methods you actually need--anything else is YAGNI. In R, for example, I found that I needed to create HivePartitionScheme and HivePartitionSchemeDiscovery classes in R because hive_partition() would return one or the other, but they didn't need any methods defined because the only thing you'd do with either is pass them to a DataSourceDiscovery, which would consume them appropriately.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting back to this. One option is to merge as is: it's closest to the C++ API for those low-level classes, and a high-level function like hive_partitioning() can be added later in the open_dataset PR?
(but I also happily implement the additional subclasses to move forward with this)

I'm not in favor of returning different types with different APIs depending on the function argument.

@kszucs in general I agree with this. However, for a high level user API in this case, I think it's also not great to have multiple functions like both hive_partitioning(schema) and hive_partitioning_factory(), and both ordered_partitioning(schema) and ordered_partitioning_factory(list_of_names) (as the result in both cases will be passed to the same keyword in open_dataset).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. Either way, we should come up with a better API, but we can do that refinement in a follow-up PR.

("group", pa.int32()),
("key", pa.string()),
])
assert inspected_schema.remove_metadata().equals(expected_schema)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI There is a check_metadata keyword for Schema.equals.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI There is a check_metadata keyword for Schema.equals.

Ah, that's good to know ;)

@kszucs kszucs closed this in 38bf178 Jan 14, 2020
@jorisvandenbossche
Copy link
Member Author

Thanks, will update the other PR, so can further discuss the higher level user API there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants