Skip to content

Conversation

@bkietz
Copy link
Member

@bkietz bkietz commented Dec 10, 2019

No description provided.

@bkietz bkietz requested a review from fsaintjacques December 10, 2019 19:09
@github-actions
Copy link

@fsaintjacques
Copy link
Contributor

Needs a rebase.

@bkietz bkietz force-pushed the 7366-Dataset-Use-PartitionSche branch from 717bcad to 153befc Compare December 11, 2019 16:10
@bkietz bkietz marked this pull request as ready for review December 11, 2019 17:51
@bkietz bkietz force-pushed the 7366-Dataset-Use-PartitionSche branch 2 times, most recently from 4481b6f to f6418af Compare December 11, 2019 21:59
@bkietz bkietz force-pushed the 7366-Dataset-Use-PartitionSche branch 4 times, most recently from 49b23e6 to c11d056 Compare December 17, 2019 16:46
Copy link
Contributor

@fsaintjacques fsaintjacques left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good in general:

  • PartitionScheme/PartitionSchemeDiscovery should be optional, there's a lot of the code that assumes otherwise. Such internal details should not leak in the exposed interface. Python lost the ability to not pass a schema.
  • Drop the timestamp conversion stuff from this PR, make a seperate issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use an optional parameter and merge the previous definitions. See discovery.cc about the behavior of this function, e.g. validation of explicit schema.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alright

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I'm not sure I agree with this one; default parameters in a virtual method pass the default case handling down to each implementation. Is there any behavior which should be supported for "Finish without explicit schema" other than "inspect a schema then finish with that"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then let's remove the virtual case for now :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's an anti-pattern to expect the user to call the Validate method so it doesn't blow up in his face. Anyhow, I'll tackle this in my next PR on DatasetDiscovery as discrepency will likely happen there more than in the DataSourceDiscovery.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The schema handling part of this body should be:

  • If a schema is provided, validate it against Inspect result.
  • If a schema is not provided, use Inspect result as the schema.

Think about other DataSource, e.g. OdbcData, how should it validate. The struct options in discovery is class specific, so maybe with functions:

// Validate infer against provided or use infered schema.
Finish(const std::shared_ptr<Schema>& schema = nullptr);
// Take as-is with no validation, might blow up at runtime.
FinishUnsafe(const std::shared_ptr<Schema>& schema);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a valid concern but it's out of scope for this PR. Also, rather than separate Finish methods I think keeping the unsafe Finish(schema) overload and adding a Validate(schema) method would fit the interactive case better:

  1. Inspect schema
  2. tweak
  3. validate
  4. if failed, goto 2
  5. finish

@bkietz
Copy link
Member Author

bkietz commented Dec 19, 2019

@fsaintjacques there will always be a partition scheme; the default is DefaultPartitionScheme which is a noop and attaches scalar(true) to everything. I'll extract the timestamp conversion changes

@bkietz bkietz force-pushed the 7366-Dataset-Use-PartitionSche branch from 91b7672 to bdc8156 Compare December 19, 2019 18:58
@bkietz bkietz force-pushed the 7366-Dataset-Use-PartitionSche branch from bdc8156 to 6990465 Compare December 20, 2019 16:03
@bkietz bkietz closed this in a8e3e41 Dec 20, 2019
@bkietz bkietz deleted the 7366-Dataset-Use-PartitionSche branch February 25, 2021 16:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants