-
Notifications
You must be signed in to change notification settings - Fork 3k
Add DeleteFile and manifest reader and writer for deletes #1064
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
1152bc1 to
3dd6521
Compare
| private int[] fromProjectionPos; | ||
| private Types.StructType partitionType; | ||
|
|
||
| private FileContent content = FileContent.DATA; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the comment says that the BaseFile is the base class for DataFile and DeleteFile, is it suitable to make the FileContent use FileContent.DATA by default ? Just curious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am also curious whether we can rely on content() defined in ContentFile as both DeleteFile and DataFile override that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The FileContent for DeleteFile could be either POSITION_DELETES or EQUALITY_DELETES so we need to store it. DataFile always overrides it though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We actually need a field as DeleteFiles can be either positional or equality.
| protected enum FileType { | ||
| DATA_FILES(GenericDataFile.class.getName()), | ||
| DELETE_FILES("..."); | ||
| DELETE_FILES(GenericDeleteFile.class.getName()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice ..
| * @return a {@link ManifestReader} | ||
| */ | ||
| public static ManifestReader read(ManifestFile manifest, FileIO io, Map<Integer, PartitionSpec> specsById) { | ||
| Preconditions.checkArgument(manifest.content() == ManifestContent.DATA, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my understanding, the DATA manifest & DELETE manifest could share the same read / write path so I think we could use the common reader+writer. Is there any other reason that we need to make them separate paths ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They have the same schema to keep the format simpler, but in the APIs we want to keep them separate by using DataFile and DeleteFile types. The readers and writers mostly share the same code, but are separate so that they can use the correct file type interface.
|
Let me also go through this now. |
|
Will |
Probably not, but I don't think we need to update it in this PR. |
aokolnychyi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
This adds a new interface, DeleteFile, and implementations of ManfiestReader and ManifestWriter for deletes.
DeleteFile and DataFile now inherit from a common interface, ContentFile, with all of the metadata. The purpose of separate interfaces is to keep data and delete files separate in the APIs. DataFile can't be written to a delete manifest, for example. This also uses a common implementation, BaseFile for both GenericDataFile and GenericDeleteFile.
Because ManifestEntry stores a DataFile or a DeleteFile, this adds a type parameter to it. Adding this type parameter is why so many files changed, but most of the changes are in the last commit and are just parameter additions.
Some classes that may be used to return DeleteFiles (or ManifestEntry) currently return only DataFile, like ManifestGroup. This is currently safe because there is no code to read a DeleteManifest in those classes. We can update the implementations as we add support for delete manifests.