Skip to content

Conversation

@rdblue
Copy link
Contributor

@rdblue rdblue commented May 25, 2020

This adds a new interface, DeleteFile, and implementations of ManfiestReader and ManifestWriter for deletes.

DeleteFile and DataFile now inherit from a common interface, ContentFile, with all of the metadata. The purpose of separate interfaces is to keep data and delete files separate in the APIs. DataFile can't be written to a delete manifest, for example. This also uses a common implementation, BaseFile for both GenericDataFile and GenericDeleteFile.

Because ManifestEntry stores a DataFile or a DeleteFile, this adds a type parameter to it. Adding this type parameter is why so many files changed, but most of the changes are in the last commit and are just parameter additions.

Some classes that may be used to return DeleteFiles (or ManifestEntry) currently return only DataFile, like ManifestGroup. This is currently safe because there is no code to read a DeleteManifest in those classes. We can update the implementations as we add support for delete manifests.

@rdblue rdblue force-pushed the v2-delete-files branch 2 times, most recently from 1152bc1 to 3dd6521 Compare May 25, 2020 23:06
@rdblue rdblue requested a review from aokolnychyi May 25, 2020 23:19
@rdblue rdblue added this to the Row-level Delete milestone May 28, 2020
@rdblue rdblue requested a review from danielcweeks May 28, 2020 22:58
private int[] fromProjectionPos;
private Types.StructType partitionType;

private FileContent content = FileContent.DATA;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the comment says that the BaseFile is the base class for DataFile and DeleteFile, is it suitable to make the FileContent use FileContent.DATA by default ? Just curious.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also curious whether we can rely on content() defined in ContentFile as both DeleteFile and DataFile override that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The FileContent for DeleteFile could be either POSITION_DELETES or EQUALITY_DELETES so we need to store it. DataFile always overrides it though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually need a field as DeleteFiles can be either positional or equality.

protected enum FileType {
DATA_FILES(GenericDataFile.class.getName()),
DELETE_FILES("...");
DELETE_FILES(GenericDeleteFile.class.getName());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice ..

* @return a {@link ManifestReader}
*/
public static ManifestReader read(ManifestFile manifest, FileIO io, Map<Integer, PartitionSpec> specsById) {
Preconditions.checkArgument(manifest.content() == ManifestContent.DATA,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my understanding, the DATA manifest & DELETE manifest could share the same read / write path so I think we could use the common reader+writer. Is there any other reason that we need to make them separate paths ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They have the same schema to keep the format simpler, but in the APIs we want to keep them separate by using DataFile and DeleteFile types. The readers and writers mostly share the same code, but are separate so that they can use the correct file type interface.

@aokolnychyi
Copy link
Contributor

Let me also go through this now.

@aokolnychyi
Copy link
Contributor

Will ManifestGroup be only used for data files?

@rdblue
Copy link
Contributor Author

rdblue commented May 29, 2020

Will ManifestGroup be only used for data files?

Probably not, but I don't think we need to update it in this PR.

@rdblue rdblue force-pushed the v2-delete-files branch from 9183fa2 to 6a796c5 Compare May 29, 2020 18:47
Copy link
Contributor

@aokolnychyi aokolnychyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@aokolnychyi aokolnychyi merged commit 527240b into apache:master May 29, 2020
cmathiesen pushed a commit to ExpediaGroup/iceberg that referenced this pull request Aug 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants