Skip to content

Conversation

@singhpk234
Copy link
Contributor

@singhpk234 singhpk234 commented Apr 6, 2022

  • At present we iterate the iterator of addedFiles to find the no of added files in the snapshot, this info is already present in snapshot summary. We should use this rather than iterating the addedFiles iterator to determine files added.

Came across this while working on


cc @rdblue, @RussellSpitzer, @jackye1995, @kbendick

@github-actions github-actions bot added the spark label Apr 6, 2022
Copy link
Collaborator

@SreeramGarlapati SreeramGarlapati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@singhpk234 - we explicitly opted out of reading the SnapshotSummary - based on discussion with @aokolnychyi, @RussellSpitzer & @rdblue.

rationale: SnapshotSummary is a free form dictionary & ADDED_FILES_PROP as a key in this dictionary - is NOT added to Iceberg Spec & is not NOT populated by all engines yet.

Pl. read thru this comment thread for full context.
#2660 (comment)

@singhpk234
Copy link
Contributor Author

singhpk234 commented Apr 7, 2022

Thanks @SreeramGarlapati !!! for sharing the #2660 (comment) thread, it really helps. Glad to know it was already considered while the implementation.

It looks like we were on the fence to establish the reliability of this metric and hence decided to re-visit it in future, refering this comment #2660 (comment)

SnapshotSummary is a free form dictionary & ADDED_FILES_PROP as a key in this dictionary - is NOT added to Iceberg Spec

Agree with you on this.

is not NOT populated by all engines yet

As per my understanding engines don't own the responsibility of calculating / updating the snapshot summary stats, IIUC it's done entirely by core library. The flow for ex : commit() [from engine] -> apply() in SnapshotProducer (core) -> Summary calculation (core).
Am I missing something here ?

@rdblue
Copy link
Contributor

rdblue commented Apr 10, 2022

@singhpk234, this is going to get the size of latestSnapshot.addedFiles() either way because it needs to calculate that value to pass it into the property check.

@SreeramGarlapati makes a good point, but if the added files property is present, then it seems to me like we could use it and fall back to reading the snapshot.

@singhpk234 singhpk234 force-pushed the fix/use-snapshot-summary-ss branch from 12f298c to 8ff5320 Compare April 26, 2022 04:05
@singhpk234 singhpk234 requested a review from jackye1995 May 1, 2022 15:59
long addedFilesCount = PropertyUtil.propertyAsLong(latestSnapshot.summary(), SnapshotSummary.ADDED_FILES_PROP, -1);
// If snapshotSummary doesn't have SnapshotSummary.ADDED_FILES_PROP, iterate through addedFiles iterator to find
// addedFilesCount.
addedFilesCount = addedFilesCount == -1 ? Iterables.size(latestSnapshot.addedFiles()) : addedFilesCount;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah I see, good point we cannot just put Iterables.size(latestSnapshot.addedFiles()) as the input to the last method otherwise it will be evaluated.

Copy link
Contributor

@jackye1995 jackye1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me!

@jackye1995
Copy link
Contributor

@SreeramGarlapati @rdblue any additional comments? If not I think it is good to be merged.

@singhpk234 singhpk234 force-pushed the fix/use-snapshot-summary-ss branch from 8ff5320 to 73211a1 Compare May 27, 2022 17:27
@jackye1995
Copy link
Contributor

I think this is good to go, and there is no further comment for quite a while now. I will merge it, thanks for the reviews! @rdblue @SreeramGarlapati

@jackye1995 jackye1995 merged commit ec1beb9 into apache:master Jun 7, 2022
namrathamyske pushed a commit to namrathamyske/iceberg that referenced this pull request Jul 10, 2022
namrathamyske pushed a commit to namrathamyske/iceberg that referenced this pull request Jul 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants