-
Notifications
You must be signed in to change notification settings - Fork 2.9k
ORC:ORC supports rolling writers. #3784
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
7af31f0 to
8c93b98
Compare
|
@hililiwei, can you describe how you're estimating the size of data that is buffered in memory for ORC? I think a description to explain to reviewers would help. |
If a file is being written, to estimate its size, in three steps:
Add these three values to estimate the data size. |
|
@hililiwei, I don't understand what #3 is. Why is this tracking the data that hasn't been submitted to the writer? It seems like all you're doing is adding a constant to the estimated size. For Parquet, we use the current file offset plus the size that is buffered in memory. |
#3 mainly refers to the data in the iceberg/orc/src/main/java/org/apache/iceberg/orc/OrcFileAppender.java Lines 81 to 91 in 2208b24
The data is written to the |
|
Any update about this? We found use orc can save more storage space than using parquet. So I'd like to try the Orc file. |
ping @rdblue @liubo1022126 |
|
@coolderli yes, parquet query performance is worse than orc when select by trino. @hililiwei and does this pr have any remaining unfinished work? I want merge this pr to my branch. |
For now, there are no major changes. However, I'm still waiting for comments from @rdblue or anyone else, so may revise it again. 😄 |
|
There are 3 failure cases from travis CI report: |
orc/src/main/java/org/apache/iceberg/orc/EstimateOrcAveWidthVisitor.java
Outdated
Show resolved
Hide resolved
orc/src/main/java/org/apache/iceberg/orc/EstimateOrcAveWidthVisitor.java
Outdated
Show resolved
Hide resolved
orc/src/main/java/org/apache/iceberg/orc/EstimateOrcAveWidthVisitor.java
Outdated
Show resolved
Hide resolved
| return 0; | ||
| } | ||
|
|
||
| switch (primitive.getCategory()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to align with the approach to estimate the avg width for each data type. I think the basic rule is: we need to read the GenericOrcWriters to see how those data types are encouded into the ORC column vector. That is the occupied in-memory byte size without any columnar compression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The corresponding relationship is as follows:
Boolean -> LongColumnVector
Byte -> LongColumnVector
Short -> LongColumnVector
INT -> LongColumnVector
LONG -> LongColumnVector
FLOAT -> DoubleColumnVector
DOUBLE -> DoubleColumnVector
DATE -> LongColumnVector
TIMESTAMP -> TimestampColumnVector
BINARY -> BytesColumnVector
STRING -> BytesColumnVector
DECIMAL -> Decimal18Writer or Decimal38Writer
The byte estimation corresponds:
LongColumnVector -> 8 byte
DoubleColumnVector -> 8 byte
TimestampColumnVector -> 12 byte
Decimal18Writer/Decimal38Writer -> (precision + 4) / 2 byte
BytesColumnVector -> 128 byte
How about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The estimated byte size of decimal is just: precision + 2. Just as I said in another comment, each digit will occupy just one byte. and in fact, the BigDecimal's unscaled value is usually a BigInteger, and the BigInteger will just encode each digit into a byte.
orc/src/main/java/org/apache/iceberg/orc/EstimateOrcAveWidthVisitor.java
Show resolved
Hide resolved
orc/src/test/java/org/apache/iceberg/orc/TestEstimateOrcAveWidthVisitor.java
Outdated
Show resolved
Hide resolved
orc/src/test/java/org/apache/iceberg/orc/TestEstimateOrcAveWidthVisitor.java
Outdated
Show resolved
Hide resolved
kbendick
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @hililiwei. Left some further comments.
Additionally, is it possible for these changes to be backported to earlier Spark versions in subsequent PRs to make reviewing easier? It's possible I missed some discussion on this, so let me know if so.
| this.avgRowByteSize = | ||
| OrcSchemaVisitor.visitSchema(orcSchema, new EstimateOrcAvgWidthVisitor()).stream().reduce(Integer::sum) | ||
| .orElse(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of orElse(0) concerns me somewhat.
Looking at its usage, it seems as though using avgRowByteSize of 0 would mean that the entirety of batch.size would be unaccounted for in the estimate in the length function.
return (long) (dataLength + (estimateMemory + (long) batch.size * avgRowByteSize) * 0.2);Under what situations would we expect this to reasonably return 0? Is that possible / expected in some edge case, or more indicative of a bug?
Would it make sense to default to some non-zero value (even 1) so that the ongoing batch.size isn't entirely dropped?
At the very least, it seems like we should potentially log a debug message stating that 0 is being used. If user's are investigating ORC files being written at sizes they find strange, having a log would be beneficial.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I initially set it to 1, but as long as the Schema has a field, it won't be 0. Setting it to 1 might mask some exceptions. When the value is 0, we can raise an WARN in the log.
orc/src/test/java/org/apache/iceberg/orc/TestEstimateOrcAvgWidthVisitor.java
Outdated
Show resolved
Hide resolved
orc/src/test/java/org/apache/iceberg/orc/TestEstimateOrcAvgWidthVisitor.java
Outdated
Show resolved
Hide resolved
reverted old version changes for flink and spark. |
| @Override | ||
| protected FileWriter<T, DataWriteResult> newWriter(PartitionSpec spec, StructLike partition) { | ||
| // TODO: support ORC rolling writers | ||
| if (fileFormat == FileFormat.ORC) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When ORC support rolling writers, fileFormat is using in no where.
Should we deprecated/remove it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since these methods involve multiple spark/flink versions, I suggest that a separate PR cleans it up after this is done.
openinx
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me now.
|
Got this merged now, thanks all for reviewing, and thanks @hililiwei for the contribution ! |
|
Thanks openinx and all for reviewing. 😃 |
|
In OrcFileAppender, i found writer.estimateMemory() is 0 and writer.getStripes() is empty, why? |
close #3169
The length method of the OrcFileAppender class is modified. If the file is closed, the value of 'file.toInputFile().getLength()' is return, If not closed, use the estimated memory usage in treeWriter plus the position of the last tripe. Use reflection to get treeWriter(it's private).