-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Collect row stats while writing manifests #738
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
0e9bd63 to
9e18ddc
Compare
| import org.junit.Assert; | ||
| import org.junit.Test; | ||
|
|
||
| public class TestManifestWriter extends TableTestBase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically, this doesn't have to extend TableTestBase but we need writeManifest. I think we can refactor a bit and introduce a separate parent test class with the logic for writing/checking manifests/snapshots.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to extract out base classes for just manifest and snapshots.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either way is fine with me. I usually prefer to keep things simple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to refactor but it looks the changes won't be worth the effort after a closer look. For example, we will need to pass a partition spec and FileIO to write manifests that we can simply take from table in TableTestBase. Let's keep it as is for now.
| ))), | ||
| optional(512, "added_rows_count", Types.LongType.get()), | ||
| optional(513, "existing_rows_count", Types.LongType.get()), | ||
| optional(514, "deleted_rows_count", Types.LongType.get())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add these up by the data files counts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rdblue, not sure I got. Do you mean whether we can avoid storing these and add them up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I meant that we don't need to add these at the end of the schema. We can add them up by the file count columns, like you do with all of the accessor methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, let me update this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, reordering actually led to failures while writing manifest lists as GenericAvroWriter doesn't take into account field ids.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rdblue, do we want to build something as ProjectionDatumReader for the write side?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rdblue, I reverted this change. I propose to merge this PR as is and create a follow-up issue to implement writers that take into account field ids.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Writers don't use field IDs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rdblue, yes, GenericAvroWriter doesn't. I think SparkAvroWriter does respect field ids.
|
Looks good to me. |
9e18ddc to
4185fe0
Compare
This reverts commit 873ae91
|
Thanks @aokolnychyi! I'll merge this. |
This PR extends the information stored in the manifest list with row stats to avoid touching manifests in #675. This change is backward and forward-compatible.
This addresses #733.