-
Notifications
You must be signed in to change notification settings - Fork 3k
Add v2 manifest lists #907
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| public TemporaryFolder temp = new TemporaryFolder(); | ||
|
|
||
| @Test | ||
| public void testManifestsWithoutRowStats() throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved into TestManifestFileVersions.
|
+1 |
|
Instead of making spec changes in each v2 commit I'm using a v2-spec PR, #912. That way we don't trigger CI runs in PRs that are waiting to be merged for updates to the spec. |
456f2ca to
220c6a5
Compare
* Update GenericManifestFile to v2 schema * Update v1 manifest list writer to use the v1 schema * Add a v2 manifest list writer * Add tests for v1 and v2 manifest list formats
00cb37d to
f564456
Compare
| optional(512, "added_rows_count", Types.LongType.get()), | ||
| optional(513, "existing_rows_count", Types.LongType.get()), | ||
| optional(514, "deleted_rows_count", Types.LongType.get())); | ||
| PATH, LENGTH, SPEC_ID, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember we had issues with reordering fields in ManifestFile as GenericAvroWriter was using ordinal positions instead of field ids. Did we solve that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this is handled in V1Metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the IndexedRecord field order needs to match the schema order. This is why I added a test for this as well.
| this.fromProjectionPos = null; | ||
| } | ||
|
|
||
| public GenericManifestFile(String path, long length, int specId, Long snapshotId, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even though it was public, I don't think anyone is using it. Seems OK to change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this should be fine. Classes in core are only semi-public and not part of the API. That module has stronger guarantees.
| private Integer existingFilesCount = null; | ||
| private Long existingRowsCount = null; | ||
| private Integer deletedFilesCount = null; | ||
| private Long addedRowsCount = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we are trying to match the new ordering of fields in ManifestFile. Earlier, we co-located ...FilesCount with ...RowsCount to match the ordering of methods in ManifestFile and args in constructors. Is this change intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I think it's likely that these are going to be null in some cases, like manifests that contain equality delete files. Instead of mixing null, non-null, null, non-null, etc. I think it's better to keep the probably-null columns colocated for compression.
| static class V1Writer extends ManifestListWriter { | ||
| private V1Writer(OutputFile snapshotFile, long snapshotId, Long parentSnapshotId) { | ||
| super(snapshotFile, snapshotId, parentSnapshotId); | ||
| private final V1Metadata.IndexedManifestFile wrapper = new V1Metadata.IndexedManifestFile(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like this is the place where we can pass a snapshot id similar to how we pass a sequence number to V2 and get rid of the logic for inheriting metadata for ManifestEntry via setSnapshotId and iterating through manifests during commit.
Do I get it correctly, @rdblue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I'm going to rework inheritance for snapshot ID in a separate commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great!
aokolnychyi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
|
I've merged this. Thanks, @rdblue! |
This extends #903 with a v2 manifest list. The v2 schema is not final and may change. This is mainly to implement a separate write path and a framework for compatibility tests.
Specific changes: