Skip to content

Conversation

@rdblue
Copy link
Contributor

@rdblue rdblue commented Jul 20, 2020

This adds an Avro ValueReader that returns the position of a row within a file.

The position reader's initial position is set from a callback that returns the starting row position of the split that is being read. The callback is passed to classes that implement a new interface, SupportsRowPosition. This uses a callback so that if there is no position reader, the starting row position does not need to be calculated, which is expensive.

Finding the row position at the start of a split requires scanning through an Avro file stream. AvroIO now includes a utility method that keeps track of the number of rows in each Avro block and seeks past the block content until the next block is after the given split starting point. This validates Avro sync bytes to ensure the count is accurate.

Fixes #1019.

@rdblue rdblue added this to the Row-level Delete milestone Jul 20, 2020
Copy link
Contributor

@shardulm94 shardulm94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably an unhandled edge case, but otherwise +1

constantList.add(idToConstant.get(field.fieldId()));
} else if (field.fieldId() == MetadataColumns.ROW_POSITION.fieldId()) {
// replace the _pos field reader with a position reader
// only if the position reader is set and this is a top-level field
Copy link
Contributor

@shardulm94 shardulm94 Jul 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this comment line

only if the position reader is set

reader is set where?

and this is a top-level field

I don't see where the top-level field restriction comes from

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is out of date. I had to move the check because the position reader is set after this constructor. I'll update the comment.

long totalRows = 0;
long nextSyncPos = in.getPos();

while (nextSyncPos < start) {
Copy link
Contributor

@shardulm94 shardulm94 Jul 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this valid start state, but I think there can be an edge case here

row-count|compressed-size-in-bytes|block-bytes|sync|EOF
                                                 ^
                                                 |
                                               start

In this case it will seek and read the sync and then continue to read next block row count even after EOF

So should probably be while (nextSyncPos + 16 < start)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The loop condition is correct with respect to start because a block starts at the beginning of a sync. You're right about the EOF behavior, though. I'll take a look at that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the insightful comment! (before it went away 😃 ) I was not familiar Avro's split reading behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I should have edited, not deleted. I missed the part about it being just before the EOF.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix is to catch EOFException and return the number of rows.

Copy link
Contributor

@rdsr rdsr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 . LG apart from the one EOF issue

@rdblue rdblue force-pushed the add-avro-pos-reader branch from d107f77 to df1cae9 Compare July 23, 2020 20:36
}

@Test
public void testPosWithEOFSplit() throws IOException {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shardulm94, here's a test for the EOF case.

@shardulm94
Copy link
Contributor

+1 LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add _file and _pos metadata columns to Avro readers

3 participants