-
Notifications
You must be signed in to change notification settings - Fork 3k
Update v2 manifests to store only DataFile (WIP) #963
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
||
| void checkEntry(ManifestEntry entry, Long expectedSequenceNumber) { | ||
| Assert.assertEquals("Status", ManifestEntry.Status.ADDED, entry.status()); | ||
| void checkEntry(ManifestEntry entry, ManifestEntry.Status status, Long expectedSequenceNumber) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This changes how checkEntry and checkDataFile are used. This updates checkEntry to now validate the v1 view of the data by inspecting the ManifestEntry and the DataFile it wraps. Similarly, checkDataFile validates the v2 view where all fields are part of DataFile.
All of the test cases now use both.
f29cd94 to
c72b334
Compare
|
Let me go through this now. |
aokolnychyi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. A couple of questions.
| pos = fromProjectionPos[i]; | ||
| } | ||
|
|
||
| if (!(skipEntryFields && pos < 3)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I feel !skipEntryFields || pos >= 3 would be easier to understand
| @Override | ||
| public int size() { | ||
| return 13; | ||
| return 15; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We added 3 more fields but removed block size? The Indexed wrappers will still handle the block correctly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is what the v1 indexed wrapper is for. It can translate to the v1 format.
| .rename("r102", PartitionData.class.getName()) | ||
| .rename("data_file", GenericDataFile.class.getName()) | ||
| .rename("r2", GenericDataFile.class.getName()) | ||
| .classLoader(GenericManifestFile.class.getClassLoader()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just out of curiosity: do we use the class loader from GenericManifestFile on purpose?
| .reuseContainers() | ||
| .build(); | ||
| } else { | ||
| AvroIterable<GenericDataFile> files = Avro.read(file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
External systems always write v2 manifests right now. This block ensures they will be read correctly even if the table metadata is still v1, correct? If somebody consumed the recent changes already, we could have v2 manifests of different format?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. This ensures that the correct schema is used to read a v2 manifest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is very likely that we will also track this in the manifest list as well, but this is a good work-around to keep the changes separate. And it will help us detect if other metadata is ever wrong.
| return CloseableIterable.transform( | ||
| ManifestFiles.read(manifest, io).project(fileSchema).allEntries(), | ||
| file -> (GenericManifestEntry) file); | ||
| GenericDataFile.AsManifestEntry.class::cast); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This keeps the old schema because we are using ManifestEntry.getSchema(partitionType) in AsManifestEntry, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. This table continues to use ManifestEntry as a view of the data.
|
@rdblue, seems like this breaks our metadata tests. Could you take a look? |
This updates DataFile to contain metadata from ManifestEntry because the separation no longer makes sense. V1 metadata files still use ManifestEntry, but v2 will not.
This reverts commit 76bded6.
|
I'm closing this because we plan to keep manifest_entry and data_file as separate structs. |
This includes commits from #952. Once that PR is merged, I'll remove them from this one.
This merges the fields from ManifestEntry into DataFile for v2 manifests. Now, v2 manifests store DataFile that has status, snapshot id, and sequence number. This should make v2 metadata easier to work with.
Other notable changes:
get(int pos, Class<?> javaType)was returning values based on the position in an Avro projection instead of the position in the schema expressions bind to (seegetInternalchanges)