Compute min/max for too long orc string columns#11652
Conversation
53ca66e to
b38c22e
Compare
lib/trino-orc/src/main/java/io/trino/orc/metadata/statistics/StringStatisticsBuilder.java
Outdated
Show resolved
Hide resolved
lib/trino-orc/src/main/java/io/trino/orc/metadata/statistics/StringStatisticsBuilder.java
Outdated
Show resolved
Hide resolved
lib/trino-orc/src/main/java/io/trino/orc/metadata/statistics/StringStatisticsBuilder.java
Outdated
Show resolved
Hide resolved
lib/trino-orc/src/main/java/io/trino/orc/metadata/statistics/StringStatisticsBuilder.java
Outdated
Show resolved
Hide resolved
lib/trino-orc/src/main/java/io/trino/orc/metadata/statistics/StringStatisticsBuilder.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
I think we have this logic in Slice / SliceUtf8 or somewhere.
94f8b84 to
f87e11c
Compare
There was a problem hiding this comment.
Is this the default? i think it should be true
There was a problem hiding this comment.
hm ok I can change this but I didn't want to change current behaviour for all connectors that are using trino-orc
There was a problem hiding this comment.
The new behavior is just better
lib/trino-orc/src/main/java/io/trino/orc/metadata/statistics/StringStatisticsBuilder.java
Outdated
Show resolved
Hide resolved
lib/trino-orc/src/main/java/io/trino/orc/metadata/statistics/StringStatisticsBuilder.java
Outdated
Show resolved
Hide resolved
lib/trino-orc/src/main/java/io/trino/orc/metadata/statistics/StringStatisticsBuilder.java
Outdated
Show resolved
Hide resolved
lib/trino-orc/src/main/java/io/trino/orc/metadata/statistics/StringStatisticsBuilder.java
Outdated
Show resolved
Hide resolved
lib/trino-orc/src/main/java/io/trino/orc/metadata/statistics/StringStatisticsBuilder.java
Outdated
Show resolved
Hide resolved
lib/trino-orc/src/main/java/io/trino/orc/metadata/statistics/StringStatisticsBuilder.java
Outdated
Show resolved
Hide resolved
lib/trino-orc/src/main/java/io/trino/orc/metadata/statistics/StringStatisticsBuilder.java
Outdated
Show resolved
Hide resolved
lib/trino-orc/src/main/java/io/trino/orc/metadata/statistics/StringStatisticsBuilder.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
! io.airlift.slice.SliceUtf8#isContinuationByte instead
There was a problem hiding this comment.
right. let's copy it and mark with a comment it's copied from there
There was a problem hiding this comment.
isn't it cleaner to have a method that checks the condition we want instead of having one which result has to be inverted/negated?
There was a problem hiding this comment.
Also this one was copied for orc-core
There was a problem hiding this comment.
i wanted isContinuationByte because it's a somewhat familar thing for our codebase (even if it's a private method in arilift)
d9d5109 to
9deaca9
Compare
There was a problem hiding this comment.
This can end up returning empty bytes array (empty slice), so why are we special-casing the maxBytes==0 case above
(eg consider an input being 4-byte utf8 sequence, and maxBytes=3; please add a test)
There was a problem hiding this comment.
So I had a check here for a situation when maxBytes is longer than provided byte array but you wrote that it is a dead code. It will obviously fell now for situations when slice is longer than maxBytes
There was a problem hiding this comment.
So I had a check here for a situation when maxBytes is longer than provided byte array
yes, but that's not the situation i'm concerned about here
There was a problem hiding this comment.
this will throw (IllegalArgumentException("Provided byte array is not a valid utf8 string")) when input is eg 4-byte utf8 sequence, and maxBytes=3 (please add a test)
let's
- change so that
findLastCharacterInRangereturns-1(or perhaps Optional.empty) in such case- return early when we get -1 here
- remove
if (maxBytes == 0)above, as it becomes obsolete
There was a problem hiding this comment.
Ok but this will look very ugly
|
CI looks related |
CI is related and I already spent 2 hours trying to fix it ;) |
9deaca9 to
746e876
Compare
Description
it is a change to internal trino library
It allows to create better statistics for ORC files that contains column of type VARCHAR. In some specific cases it may slightly improve performance of reading orc tables.
Documentation
(.) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.
Release notes
(.) No release notes entries required.
( ) Release notes entries required with the following suggested text: