-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Add Avro row position reader #1222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
core/src/main/java/org/apache/iceberg/data/avro/DataReader.java
Outdated
Show resolved
Hide resolved
shardulm94
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably an unhandled edge case, but otherwise +1
| constantList.add(idToConstant.get(field.fieldId())); | ||
| } else if (field.fieldId() == MetadataColumns.ROW_POSITION.fieldId()) { | ||
| // replace the _pos field reader with a position reader | ||
| // only if the position reader is set and this is a top-level field |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this comment line
only if the position reader is set
reader is set where?
and this is a top-level field
I don't see where the top-level field restriction comes from
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is out of date. I had to move the check because the position reader is set after this constructor. I'll update the comment.
| long totalRows = 0; | ||
| long nextSyncPos = in.getPos(); | ||
|
|
||
| while (nextSyncPos < start) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this valid start state, but I think there can be an edge case here
row-count|compressed-size-in-bytes|block-bytes|sync|EOF
^
|
start
In this case it will seek and read the sync and then continue to read next block row count even after EOF
So should probably be while (nextSyncPos + 16 < start)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The loop condition is correct with respect to start because a block starts at the beginning of a sync. You're right about the EOF behavior, though. I'll take a look at that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the insightful comment! (before it went away 😃 ) I was not familiar Avro's split reading behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I should have edited, not deleted. I missed the part about it being just before the EOF.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fix is to catch EOFException and return the number of rows.
rdsr
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 . LG apart from the one EOF issue
d107f77 to
df1cae9
Compare
| } | ||
|
|
||
| @Test | ||
| public void testPosWithEOFSplit() throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shardulm94, here's a test for the EOF case.
|
+1 LGTM |
This adds an Avro
ValueReaderthat returns the position of a row within a file.The position reader's initial position is set from a callback that returns the starting row position of the split that is being read. The callback is passed to classes that implement a new interface,
SupportsRowPosition. This uses a callback so that if there is no position reader, the starting row position does not need to be calculated, which is expensive.Finding the row position at the start of a split requires scanning through an Avro file stream.
AvroIOnow includes a utility method that keeps track of the number of rows in each Avro block and seeks past the block content until the next block is after the given split starting point. This validates Avro sync bytes to ensure the count is accurate.Fixes #1019.