-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Make read-path Evaluators honor case sensitivity flag. Expose flag in Spark Reader. #89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
api/src/main/java/com/netflix/iceberg/expressions/InclusiveManifestEvaluator.java
Outdated
Show resolved
Hide resolved
api/src/main/java/com/netflix/iceberg/expressions/InclusiveMetricsEvaluator.java
Outdated
Show resolved
Hide resolved
api/src/main/java/com/netflix/iceberg/expressions/InclusiveMetricsEvaluator.java
Show resolved
Hide resolved
|
Thanks for working on this, @xabriel! I haven't replied yet because I'm trying to debate between approaches to this problem. There are two ways to go:
I'm not sure which is the right way to go. There are quiet a few paths changed by this, and most of them are duplicating work that could be done before passing an expression in. Passing a bound expression would be safe because the bound expression visitor throws an exception when it hits an unbound predicate. Passing in a bound predicate means we don't need to pass options for expression binding in so many places. The drawback is that these classes are less friendly to callers. Instead of preparing predicates as needed, they would throw runtime exceptions when the predicates don't meet some expectation. I think tests would cover all the cases, but it would be more difficult for people to work with. What do you think? |
Agreed.
Agreed as well. Still, I tend to side with the approach of this PR. Here is my rationale: What refactoring would yield more cohesion and less coupling? In this PR's approach, the caller only needs to know how to pass thru a variable, to be used as the callee sees fit. In the alternate, the caller now is in the business of binding, even though it is just passing thru the I also think that, in future, there may be other parameters that we will want to pass down to enhance So, to me, with approach on this PR, even though verbose, we have less coupling by just passing thru the one flag, and more cohesion later on when, inevitably, we find other contextual flags we will need to pass down. ( On a side note, whether we go with approach in this PR or alternate, we would still need to pass Let me know if this rationale makes sense @rdblue. |
|
Quick ping here @rdblue. Please let me know if you agree with discussion above, or if we should look into your proposed alternative? |
|
Hey, sorry. I do agree. I just need to find some time to review this. Thanks for your patience! |
api/src/main/java/com/netflix/iceberg/expressions/InclusiveManifestEvaluator.java
Outdated
Show resolved
Hide resolved
api/src/test/java/com/netflix/iceberg/expressions/TestEvaluatior.java
Outdated
Show resolved
Hide resolved
spark/src/main/java/com/netflix/iceberg/spark/SparkExpressions.java
Outdated
Show resolved
Hide resolved
spark/src/test/java/com/netflix/iceberg/spark/source/TestFilteredScan.java
Outdated
Show resolved
Hide resolved
spark/src/main/java/com/netflix/iceberg/spark/source/Reader.java
Outdated
Show resolved
Hide resolved
spark/src/main/java/com/netflix/iceberg/spark/source/IcebergSource.java
Outdated
Show resolved
Hide resolved
|
Regarding PR conflicts, do we prefer merges so that we keep PR history, or rebase ? |
|
@xabriel, sorry I didn't see you question sooner. I prefer rebases. |
api/src/main/java/com/netflix/iceberg/types/IndexByLowerCaseName.java
Outdated
Show resolved
Hide resolved
fb4a759 to
b289f4e
Compare
|
@rdblue this one is ready for re-review whenever you have some time. No rush. |
| * | ||
| * @return the Schema to project | ||
| */ | ||
| private Schema lazyColumnProjection() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not put this implementation in schema?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will fix in separate PR.
| return this; | ||
| } | ||
|
|
||
| public ScanBuilder caseInsensitive() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other places use caseSensitive(boolean). I typically like to add both variants and I think it would be a good idea if it isn't difficult.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re IcebergGenerics, I was confused on when a caller should use that instead of Table.newScan() since TableScan allows you to refine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IcebergGenerics is a builder, not a refinement pattern. It is used to build and execute a scan. We use this for our Java client, where users read directly from tables. The builder pattern is better for users. (The refinement pattern is good for passing partially configured scans to other components.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other places use caseSensitive(boolean). I typically like to add both variants and I think it would be a good idea if it isn't difficult.
Just to make sure I follow, you suggest that we add caseSensitive(boolean) to IcebergGenerics while keeping caseInsensitive(), correct?
If so, do you suggest we also add caseInsensitive() to all other interfaces that have caseSensitive(boolean)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, both. I think it is good to have a couple of versions of methods like these that set booleans. For example, Tasks exposes throwFailureWhenFinished() as well as suppressFailureWhenFinished() and throwFailureWhenFinished(boolean) that all control a boolean variable. That allows passing a boolean through using the last case, but also makes it easy to set the option to a constant.
This is a minor update and not that important, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Linking these thread to Issue #83 so that we don't loose track. Will fix in subsequent PR.
spark/src/main/java/com/netflix/iceberg/spark/source/Reader.java
Outdated
Show resolved
Hide resolved
| manifest -> { | ||
| ManifestReader reader = ManifestReader.read(ops.io().newInputFile(manifest.path())); | ||
| ManifestReader reader = ManifestReader | ||
| .read(ops.io().newInputFile(manifest.path())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: only indented 2 spaces instead of the normal 4 for continuation lines.
|
Looking good. I could merge this now, but I'd like to add the |
|
How about I tackle the trivial ones right away and move the ones that require more though into a separate PR? That way we won't accumulate more conflicts here. |
|
Merged! @xabriel, thank you for working on this, it was a really big project and I'm glad you were persistent and got it done! |
In this PR we continue the work discussed in (#82), (#83), extending it to:
iceberg.case.sensitiveflag toConfigProperties.com.netflix.iceberg.spark.source.ReaderI acknowledge this is a big PR, but the
caseSensitiveflag had to be trickled down as needed. I'm happy to break this into multiple PRs if committers think changes (0),(1),(2) should be separate, but do note the bulk of work is (1).