Support Metrics mode when creating/writing Iceberg tables#9938
Support Metrics mode when creating/writing Iceberg tables#9938findepi merged 1 commit intotrinodb:masterfrom
Conversation
|
Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: q-li.
|
|
Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please submit the signed CLA to cla@trino.io. For more information, see https://github.com/trinodb/cla. |
2 similar comments
|
Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please submit the signed CLA to cla@trino.io. For more information, see https://github.com/trinodb/cla. |
|
Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please submit the signed CLA to cla@trino.io. For more information, see https://github.com/trinodb/cla. |
There was a problem hiding this comment.
Do we not handle metrics mode truncate?
There was a problem hiding this comment.
@rdsr yes we do, we handle truncate in the IcebergMinMax
There was a problem hiding this comment.
I see we added support for this below
There was a problem hiding this comment.
the whole table property map is already passed in as a part of the writable table handle, why do we need to construct the metrics config as a separated field?
There was a problem hiding this comment.
@jackye1995 yes you are right. There is no need to construct the metrics config since the table properties have already been passed to the writable table handle. I checked the code and we already have functions in place to retrieve metrics related properties so we can make this even simpler. Will add a fix to this, thanks!
There was a problem hiding this comment.
error message does not match.
Btw, I think it makes more sense to pass in the MetricsConfig to the constructor instead of storageProperties, because file format is also derived from storageProperties and we are passing it in separately.
There was a problem hiding this comment.
@jackye1995 Sorry you meant passing in the MetricsConfig to the pagesink or IcebergWritableTableHandle constructor?
The file format is derived from the property map and then passed to the writable table handle as a separate component, which is how we passed inMetricsConfig initially.
There was a problem hiding this comment.
to the page sink. But I don't have a strong opinion on this, there are many duplicated information passed across the codebase as of today. We can just fix the error message first.
|
@liqinrae can you please check the Ci build results? |
findepi
left a comment
There was a problem hiding this comment.
Please squash your commits.
When doing so, try not to change the base commit:
git rebase -i $(git merge-base head master)
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergFileWriterFactory.java
Outdated
Show resolved
Hide resolved
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergPageSink.java
Outdated
Show resolved
Hide resolved
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergPageSinkProvider.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
MetricsMode is an interface, open for extensions. The fact that something isn't Counts doesn't mean it's Full.
Also, there is no javadoc indicating .get() returns same instance every time, so using .equals() would be more appropriate here.
this should be handled in a defensive manner:
if (metricsMode.equals(MetricsModes.None.get())) {
// nothing to do
}
else if (metricsMode.equals(MetricsModes.Counts.get()) {
.. collect counts
}
else if (metricsMode.equals( full )) {
...
}
else {
throw new UnsupportedOperationException("Unsupported metrics mode: " + metricsMode);
}
There was a problem hiding this comment.
@findepi The conditional statement here handles the case where the metricsMode is truncate(length) or full. Both of these two cases need to store min/max except that truncate need to apply truncation if needed.
All of the metrics modes other than None need to collect counts. There would be duplication of code if we are to handle each mode separately. Would adding a verify function here more feasible if we want to handle it more defensively?
There was a problem hiding this comment.
binary (varbinary) and fixed too
(we don't support creating tables with fixed, but i think we do support writes to such tables)
There was a problem hiding this comment.
@findepi I was referring to the OrcMetrics implementation here: https://github.com/apache/iceberg/blob/ddc5aff9167c7e367e2b4313b91e55451f09ef16/orc/src/main/java/org/apache/iceberg/orc/OrcMetrics.java#L281
Seems like it only added truncation for string bound in the orc format.
There was a problem hiding this comment.
because they don't produce any stats for binary for ORC, see
right?
There was a problem hiding this comment.
still, not having an if for varbinary doesn't seem future proof, does it?
There was a problem hiding this comment.
@findepi yes I agree. I'll add varbinary here.
There was a problem hiding this comment.
if MetricsMode is != Full (eg Counts, None), no min/max should be stored.
again, this should be handled in defensive manner, instead of if-ing on a Full class
There was a problem hiding this comment.
lowerBounds.get(9) is the actual, while SELECT substring(min(comment), 1, 16) FROM tpch.tiny.orders is the expected. The expected/actual are reversed here.
There was a problem hiding this comment.
@findepi I tried reversing the order for all the assertQuery in the test function but encountered failures when using SELECT substring(min(comment), 1, 16) FROM tpch.tiny.orders as the expected sql statement. Checked other test files and looks like they placed all select statement as the expected and result as the actual.
There was a problem hiding this comment.
we shouldn't use assertQuery at all, when we want to assert a value that we have at hand.
this is preexisting, so ok to followup
031c2fd to
000ad30
Compare
|
Hi @findepi @jackye1995 I addressed the comments and left a few replies on this PR. Please let me know your thoughts. Thanks! |
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergFileWriterFactory.java
Outdated
Show resolved
Hide resolved
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergOrcFileWriter.java
Outdated
Show resolved
Hide resolved
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergOrcFileWriter.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
we shouldn't use assertQuery at all, when we want to assert a value that we have at hand.
this is preexisting, so ok to followup
There was a problem hiding this comment.
assert on upperBounds.get(9) without taking substring
on the left side, use substring(max(comment), 1, 15) || '...' to construct the expected value
|
@liqinrae have you had time to update this for my comments? |
000ad30 to
3c81be5
Compare
7f52b71 to
f319189
Compare
3040381 to
ffb9007
Compare
Added support for all metrics modes in orc format as mentioned in #9791