ARROW-7366: [C++][Dataset] Use PartitionSchemeDiscovery in DataSourceDiscovery #6008

bkietz · 2019-12-10T19:09:25Z

No description provided.

github-actions · 2019-12-10T19:15:39Z

https://issues.apache.org/jira/browse/ARROW-7366

fsaintjacques · 2019-12-11T13:47:56Z

Needs a rebase.

r/src/dataset.cpp

r/R/dataset.R

fsaintjacques

Looks good in general:

PartitionScheme/PartitionSchemeDiscovery should be optional, there's a lot of the code that assumes otherwise. Such internal details should not leak in the exposed interface. Python lost the ability to not pass a schema.
Drop the timestamp conversion stuff from this PR, make a seperate issue.

cpp/src/arrow/dataset/dataset_test.cc

fsaintjacques · 2019-12-18T19:25:51Z

cpp/src/arrow/dataset/discovery.h

Use an optional parameter and merge the previous definitions. See discovery.cc about the behavior of this function, e.g. validation of explicit schema.

Actually I'm not sure I agree with this one; default parameters in a virtual method pass the default case handling down to each implementation. Is there any behavior which should be supported for "Finish without explicit schema" other than "inspect a schema then finish with that"?

Then let's remove the virtual case for now :)

I think it's an anti-pattern to expect the user to call the Validate method so it doesn't blow up in his face. Anyhow, I'll tackle this in my next PR on DatasetDiscovery as discrepency will likely happen there more than in the DataSourceDiscovery.

cpp/src/arrow/dataset/discovery.h

fsaintjacques · 2019-12-18T19:36:53Z

cpp/src/arrow/dataset/discovery.cc

The schema handling part of this body should be:

If a schema is provided, validate it against Inspect result.

If a schema is not provided, use Inspect result as the schema.

Think about other DataSource, e.g. OdbcData, how should it validate. The struct options in discovery is class specific, so maybe with functions:

// Validate infer against provided or use infered schema. Finish(const std::shared_ptr<Schema>& schema = nullptr); // Take as-is with no validation, might blow up at runtime. FinishUnsafe(const std::shared_ptr<Schema>& schema);

I think this is a valid concern but it's out of scope for this PR. Also, rather than separate Finish methods I think keeping the unsafe Finish(schema) overload and adding a Validate(schema) method would fit the interactive case better:

Inspect schema

tweak

validate

if failed, goto 2

finish

cpp/src/arrow/dataset/discovery.cc

python/pyarrow/_dataset.pyx

python/pyarrow/includes/libarrow_dataset.pxd

bkietz · 2019-12-19T15:47:56Z

@fsaintjacques there will always be a partition scheme; the default is DefaultPartitionScheme which is a noop and attaches scalar(true) to everything. I'll extract the timestamp conversion changes

bkietz requested a review from fsaintjacques December 10, 2019 19:09

bkietz force-pushed the 7366-Dataset-Use-PartitionSche branch from 717bcad to 153befc Compare December 11, 2019 16:10

bkietz marked this pull request as ready for review December 11, 2019 17:51

bkietz force-pushed the 7366-Dataset-Use-PartitionSche branch 2 times, most recently from 4481b6f to f6418af Compare December 11, 2019 21:59

nealrichardson reviewed Dec 12, 2019

View reviewed changes

r/src/dataset.cpp Outdated Show resolved Hide resolved

bkietz commented Dec 12, 2019

View reviewed changes

r/R/dataset.R Outdated Show resolved Hide resolved

bkietz force-pushed the 7366-Dataset-Use-PartitionSche branch 4 times, most recently from 49b23e6 to c11d056 Compare December 17, 2019 16:46

nealrichardson mentioned this pull request Dec 18, 2019

ARROW-7432: [Python] Add higher level open_dataset function #6022

Closed

fsaintjacques requested changes Dec 18, 2019

View reviewed changes

bkietz force-pushed the 7366-Dataset-Use-PartitionSche branch from 91b7672 to bdc8156 Compare December 19, 2019 18:58

fsaintjacques approved these changes Dec 20, 2019

View reviewed changes

ARROW-7366: [C++] Use PartitionSchemeDiscovery in DataSourceDiscovery

6990465

bkietz force-pushed the 7366-Dataset-Use-PartitionSche branch from bdc8156 to 6990465 Compare December 20, 2019 16:03

bkietz closed this in a8e3e41 Dec 20, 2019

bkietz deleted the 7366-Dataset-Use-PartitionSche branch February 25, 2021 16:50

asfimport mentioned this pull request Apr 10, 2020

[C++][Dataset] Use PartitionSchemeDiscovery in DataSourceDiscovery #23643

Closed

ARROW-7366: [C++][Dataset] Use PartitionSchemeDiscovery in DataSourceDiscovery #6008

ARROW-7366: [C++][Dataset] Use PartitionSchemeDiscovery in DataSourceDiscovery #6008

Uh oh!

Conversation

bkietz commented Dec 10, 2019

Uh oh!

github-actions bot commented Dec 10, 2019

Uh oh!

fsaintjacques commented Dec 11, 2019

Uh oh!

Uh oh!

Uh oh!

fsaintjacques left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fsaintjacques Dec 18, 2019

Choose a reason for hiding this comment

Uh oh!

bkietz Dec 19, 2019

Choose a reason for hiding this comment

Uh oh!

bkietz Dec 19, 2019

Choose a reason for hiding this comment

Uh oh!

fsaintjacques Dec 19, 2019

Choose a reason for hiding this comment

Uh oh!

fsaintjacques Dec 20, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fsaintjacques Dec 18, 2019

Choose a reason for hiding this comment

Uh oh!

bkietz Dec 19, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bkietz commented Dec 19, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants