-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-7366: [C++][Dataset] Use PartitionSchemeDiscovery in DataSourceDiscovery #6008
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Needs a rebase. |
717bcad to
153befc
Compare
4481b6f to
f6418af
Compare
49b23e6 to
c11d056
Compare
fsaintjacques
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good in general:
- PartitionScheme/PartitionSchemeDiscovery should be optional, there's a lot of the code that assumes otherwise. Such internal details should not leak in the exposed interface. Python lost the ability to not pass a schema.
- Drop the timestamp conversion stuff from this PR, make a seperate issue.
cpp/src/arrow/dataset/discovery.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use an optional parameter and merge the previous definitions. See discovery.cc about the behavior of this function, e.g. validation of explicit schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alright
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I'm not sure I agree with this one; default parameters in a virtual method pass the default case handling down to each implementation. Is there any behavior which should be supported for "Finish without explicit schema" other than "inspect a schema then finish with that"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then let's remove the virtual case for now :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's an anti-pattern to expect the user to call the Validate method so it doesn't blow up in his face. Anyhow, I'll tackle this in my next PR on DatasetDiscovery as discrepency will likely happen there more than in the DataSourceDiscovery.
cpp/src/arrow/dataset/discovery.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The schema handling part of this body should be:
- If a schema is provided, validate it against
Inspectresult. - If a schema is not provided, use
Inspectresult as the schema.
Think about other DataSource, e.g. OdbcData, how should it validate. The struct options in discovery is class specific, so maybe with functions:
// Validate infer against provided or use infered schema.
Finish(const std::shared_ptr<Schema>& schema = nullptr);
// Take as-is with no validation, might blow up at runtime.
FinishUnsafe(const std::shared_ptr<Schema>& schema);There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a valid concern but it's out of scope for this PR. Also, rather than separate Finish methods I think keeping the unsafe Finish(schema) overload and adding a Validate(schema) method would fit the interactive case better:
- Inspect schema
- tweak
- validate
- if failed, goto 2
- finish
|
@fsaintjacques there will always be a partition scheme; the default is |
91b7672 to
bdc8156
Compare
bdc8156 to
6990465
Compare
No description provided.