Skip to content

Conversation

@xabriel
Copy link
Contributor

@xabriel xabriel commented Jan 30, 2019

In this PR we continue the work discussed in (#82), (#83), extending it to:

  1. Adding iceberg.case.sensitive flag to ConfigProperties.
  2. Make all Evaluators used in Iceberg read path honor this flag.
  3. Expose this flag for Spark Reader use in com.netflix.iceberg.spark.source.Reader

I acknowledge this is a big PR, but the caseSensitive flag had to be trickled down as needed. I'm happy to break this into multiple PRs if committers think changes (0),(1),(2) should be separate, but do note the bulk of work is (1).

@rdblue
Copy link
Contributor

rdblue commented Feb 8, 2019

Thanks for working on this, @xabriel! I haven't replied yet because I'm trying to debate between approaches to this problem.

There are two ways to go:

  1. Add a case sensitivity flag to all the cases that end up using a binder, like this PR.
  2. Requires that predicates passed to these evaluators are already bound.

I'm not sure which is the right way to go. There are quiet a few paths changed by this, and most of them are duplicating work that could be done before passing an expression in. Passing a bound expression would be safe because the bound expression visitor throws an exception when it hits an unbound predicate. Passing in a bound predicate means we don't need to pass options for expression binding in so many places.

The drawback is that these classes are less friendly to callers. Instead of preparing predicates as needed, they would throw runtime exceptions when the predicates don't meet some expectation. I think tests would cover all the cases, but it would be more difficult for people to work with.

What do you think?

@xabriel
Copy link
Contributor Author

xabriel commented Feb 11, 2019

There are quiet a few paths changed by this, and most of them are duplicating work that could be done before passing an expression in.

Agreed.

Passing in a bound predicate means we don't need to pass options for expression binding in so many places.

Agreed as well.

Still, I tend to side with the approach of this PR. Here is my rationale:

What refactoring would yield more cohesion and less coupling?

In this PR's approach, the caller only needs to know how to pass thru a variable, to be used as the callee sees fit. In the alternate, the caller now is in the business of binding, even though it is just passing thru the Expression object.

I also think that, in future, there may be other parameters that we will want to pass down to enhance Evaluators. At that moment, we could upgrade caseSensitive flag to be an object, say: EvaluatorParams, that would contain caseSensitive, and whatever else we find we need. In that case, should the caller be in the business of knowing how to apply each and every EvaluatorParam just to maintain the interface intact, or should the caller just pass it thru as another variable?

So, to me, with approach on this PR, even though verbose, we have less coupling by just passing thru the one flag, and more cohesion later on when, inevitably, we find other contextual flags we will need to pass down.

( On a side note, whether we go with approach in this PR or alternate, we would still need to pass caseSensitive to the other classes: TableScan, FilteredManifest, and the Spark Reader. Not in this PR, but I presume we'd need it in the Presto Reader as well. )

Let me know if this rationale makes sense @rdblue.

@xabriel
Copy link
Contributor Author

xabriel commented Feb 26, 2019

Quick ping here @rdblue.

Please let me know if you agree with discussion above, or if we should look into your proposed alternative?

@rdblue
Copy link
Contributor

rdblue commented Feb 26, 2019

Hey, sorry. I do agree. I just need to find some time to review this. Thanks for your patience!

@xabriel
Copy link
Contributor Author

xabriel commented Mar 9, 2019

Regarding PR conflicts, do we prefer merges so that we keep PR history, or rebase ?

@rdblue
Copy link
Contributor

rdblue commented Mar 12, 2019

@xabriel, sorry I didn't see you question sooner. I prefer rebases.

@xabriel xabriel force-pushed the expose-case-sensitivity-flag branch from fb4a759 to b289f4e Compare March 18, 2019 20:54
@xabriel
Copy link
Contributor Author

xabriel commented Mar 18, 2019

@rdblue this one is ready for re-review whenever you have some time. No rush.

*
* @return the Schema to project
*/
private Schema lazyColumnProjection() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not put this implementation in schema?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix in separate PR.

return this;
}

public ScanBuilder caseInsensitive() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other places use caseSensitive(boolean). I typically like to add both variants and I think it would be a good idea if it isn't difficult.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re IcebergGenerics, I was confused on when a caller should use that instead of Table.newScan() since TableScan allows you to refine?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IcebergGenerics is a builder, not a refinement pattern. It is used to build and execute a scan. We use this for our Java client, where users read directly from tables. The builder pattern is better for users. (The refinement pattern is good for passing partially configured scans to other components.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other places use caseSensitive(boolean). I typically like to add both variants and I think it would be a good idea if it isn't difficult.

Just to make sure I follow, you suggest that we add caseSensitive(boolean) to IcebergGenerics while keeping caseInsensitive(), correct?

If so, do you suggest we also add caseInsensitive() to all other interfaces that have caseSensitive(boolean)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, both. I think it is good to have a couple of versions of methods like these that set booleans. For example, Tasks exposes throwFailureWhenFinished() as well as suppressFailureWhenFinished() and throwFailureWhenFinished(boolean) that all control a boolean variable. That allows passing a boolean through using the last case, but also makes it easy to set the option to a constant.

This is a minor update and not that important, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Linking these thread to Issue #83 so that we don't loose track. Will fix in subsequent PR.

manifest -> {
ManifestReader reader = ManifestReader.read(ops.io().newInputFile(manifest.path()));
ManifestReader reader = ManifestReader
.read(ops.io().newInputFile(manifest.path()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: only indented 2 spaces instead of the normal 4 for continuation lines.

@rdblue
Copy link
Contributor

rdblue commented Mar 18, 2019

Looking good. I could merge this now, but I'd like to add the caseInsensitive refinement methods and clean up a few of the minor things, like unnecessary variables. If you need this right away, I'd be happy to commit it and clean up those minor problems later. Thanks!

@xabriel
Copy link
Contributor Author

xabriel commented Mar 19, 2019

How about I tackle the trivial ones right away and move the ones that require more though into a separate PR? That way we won't accumulate more conflicts here.

@rdblue rdblue merged commit 5e37ff4 into apache:master Mar 19, 2019
@rdblue
Copy link
Contributor

rdblue commented Mar 19, 2019

Merged! @xabriel, thank you for working on this, it was a really big project and I'm glad you were persistent and got it done!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants