-
Notifications
You must be signed in to change notification settings - Fork 3k
API: add an action API for rewrite deletes #2841
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: add an action API for rewrite deletes #2841
Conversation
|
@jackye1995 @RussellSpitzer Could you please take a look? |
RussellSpitzer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! We are gonna have a lot of fun new rewrites!
| import org.apache.iceberg.Table; | ||
|
|
||
| public interface RewriteDeleteStrategy { | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we also need the methods similar to RewriteStrategy such as name, validOptions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't add validOptions because we haven't finalized what options are valid and it can be added late. But I'm fine to add them now.
For name, It looks like the class name already includes the strategy name. Do we still need it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
name is mostly used for logging purpose. I agree it is already in the class name, but for consistency with RewriteStrategy I think we can still have it here.
| * | ||
| * @return iterable of original delete file to be replaced. | ||
| */ | ||
| Iterable<DeleteFile> selectDeletes(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you explain a bit about how these 2 interfaces will be used? For RewriteStrategy the method Iterable<FileScanTask> selectFilesToRewrite(Iterable<FileScanTask> dataFiles) was quite straightforward, but here the method has no arguments so I am not very sure how would it be used to actually select deletes.
I think it goes to the fundamental question of what are some potential rewrite delete strategies? For rewrite data files we know there are well-known operations like bin-packing. For deletes, what are some strategies we plan to implement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The argument of selectFilesToRewrite(Iterable<FileScanTask> dataFiles) comes from planFileGroups and the filter logic. I think we could build similar logic by storing the iterable of FileScanTask as a local variable in the concrete RewriteDeleteStrategy class, and grouping them in rewriteDeletes() API. As for filter logic, we could add it later and apply it in selectDeletes() API.
As for strategies, it depends on different scenarios. In our cases, which mostly use HDFS as storage, we care more about the small files. they could impact the planning, reading performance and also bring overhead to the name node. So the strategy we want badly is to merge deletes. So we are planning two strategies at first: 1) merge all position deletes. 2) read all equality deletes under the partition and convert them into one position delete. I think the first one is kind of bin-packing, the second one is convert-and-bin-packing. In the future, we might also want to sort the merged position delete by file name and break it into pieces so that executors could read only the delete contents that match the data files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation. The case you described are also what I am interested in. I think for RewriteStrategy we first had a design doc so when there was the PR everything was in place. I am fine if we are taking incremental progress for delete strategies to gradually evolve the interface and add configuration options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @chenjunjiedada , thanks for tracking the v2's compaction for long time, It's really challenging. I think we reviewers were distracted by other internal or external things , which blocked the v2 compaction progress for a long time. While the fact is: the current v2 compaction feature is the biggest blocker for v2 right now, I'd like to push this feature forward, so that we don't block v2 forever. (And thanks for your bandwidth @jackye1995, you always come out to help when the things get stuck).
About this PR, since it's a API definition, I'd like to take a look the whole delete rewrite design or preview PR if you have one because in that case I could evaluate whether the API abstraction is perfectly matching the real implementation. Of course, we could introduce a basic API, and then iterate the interface design when we get more implementation. It's looks better for me to reach a global agreement before we starting to provide a specific design, that can help us reduce context sync on the core paths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for a general design. And please let me know if there is anything that can be parallelized for implementation once we reached an agreement on this. I am currently mostly spending time trying to push the work on Trino side, but should have some bandwidth to take some implementation work for this if necessary. As openinx said, this is the biggest blocker for v2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a design doc right now, but most of the ideas come from discussions in #2216 and #2364. I'm working on rebasing #2216 to apply #2372 (API definition) and #2841 (delete row reader). That would be a full implementation and hopefully could give you the full picture of how these APIs are used. I will post in the following days.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great, let me read through those, thank you!
| * | ||
| * @return iterable of original delete file to be replaced. | ||
| */ | ||
| Iterable<DeleteFile> selectDeletes(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation. The case you described are also what I am interested in. I think for RewriteStrategy we first had a design doc so when there was the PR everything was in place. I am fine if we are taking incremental progress for delete strategies to gradually evolve the interface and add configuration options.
|
@RussellSpitzer @jackye1995 Is this ready to merge now? |
|
@rdblue @aokolnychyi @openinx You may want to take a look as well. |
|
@openinx @jackye1995 I rebased the #2364, now you can take a quick look at how the APIs are used. |
|
Thank you! I reviewed the PR. I will paste my major comment also here: After reading And we can get the partition Any thoughts? |
|
+1 to Jack's comment on making the delete strategy API mimic the one for |
| * | ||
| * @return this for method chaining | ||
| */ | ||
| RewriteDeletes rewriteEqDeletes(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious on the reason for having rewriteEqDeletes and rewritePosDeletes here, looks like in RewriteDataFiles we define methods based on the exact strategy to use, and for rewriting delete files there could be strategies like eqDelete->posDelete, combine posDelete into one, and even take both delete and data files and return data files (I think the last one should be a separate interface). Here it seems like the user can decide whether to rewrite certain type of delete files, but I would think strategy might be the thing that the user want to control more? Sorry I've been away from the community conversations for a while so might lack a lot of context if this was previously discussed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rewriteEqDeletes and rewritePosDeletes is a top-level choice for users to choose which kind of delete he or she wants to rewrite. Here's an example: actionsProvider().rewriteDeletes(table).rewriteEqDeletes().execute();, users could clearly know what kind of deletes are handled. But you remind me that, we might also don't need these two as well, use one strategy API to select its implementation is enough.
The strategy options are open and we could pass options from options interface from the specified strategy implementation.
jackye1995
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the change, this looks good to me!
| * | ||
| * @return this for method chaining | ||
| */ | ||
| RewriteDeletes strategy(String strategyImpl); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible for us to pass in RewriteDeleteStrategy here rather than a free formed string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RewriteDeleteStrategy is defined in the core package, while this is in the api package. This is like the pattern for the Catalog, as the usage of loading catalog:
Catalog catalog = CatalogUtil.loadCatalog(HiveCatalog.class.getName(), "hive", ImmutableMap.of(), hadoopConf);we could have a similar usage like:
actionsProvider()
.rewriteDeletes(table)
.strategy(ConvertEqDeletesStrategy.class.getName())
.execute();There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For data file rewrites, we didn't allow custom extension because we don't think that there's a lot of value in it just yet and we want to keep things simple if there isn't a use for dynamically loaded classes. I think we should probably take the same approach here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for not having a custom impl here because if you think some strategy is valuable, it's likely also valuable for other users. Or maybe could you provide more context about what customization you would like to do in addition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we all agree that eq->pos is the one that can share with others. Let me update this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 here too. We debated this while designing the new data compaction API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think another common strategy would be to just compact delete files. For example, take 100 position deletes and write them as one (within a partition). Another use case is compacting a large number of small global deletes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. That is the next action, I have verified internally that it could bring some benefit for query.
|
@RussellSpitzer @jackye1995 @yyanyy @openinx , I think we are on the same page now, can we merge this to continue rewrite actions? |
|
ping @rdblue @openinx @jackye1995 again... This might be lost in PRs. |
| /** | ||
| * Sets options to be used with this strategy | ||
| */ | ||
| RewriteDeleteStrategy options(Map<String, String> options); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: it would be nice to use the same method order as RewriteStrategy for consistency.
| * @param dataFiles iterable of FileScanTasks for data files in a given partition | ||
| * @return iterable of original delete file to be replaced. | ||
| */ | ||
| Iterable<DeleteFile> selectDeletesToRewrite(Iterable<FileScanTask> dataFiles); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why accept FileScanTask here? Will the delete action only find delete files based on a scan?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was discussed in #2841 (comment). It is in order to align with RewriteStrategy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think as long as we follow a similar pattern it's fine, but it's more important to make the interface suitable for deletes. Here we can consider 2 cases:
- rewrite equality -> position: we actually need the reverse of the current information, which is a list of data files associated with the delete file.
- rewrite position -> data: the current interface using FileScanTask seems suitable because we need to generate something similar for rewrite anyway.
I see a few different ways we can go with this:
- use this current interface, the "scan task" is just a holder of delete and data files organized in the form ready for a scan. equality -> position can then reorganize information.
- invent a new interface suitable for both use cases, something like a
Iterable<RewriteDeleteTask> - instead of having a single rewrite strategy for both, having one for each, so that we can have something customized for equality delete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is a third use case: simply compacting position deletes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, will position -> data be something we take here or in RewriteDataFiles? I can see it being part of the data compaction. Like if we detect that certain data files have too many matching delete files, we want to rewrite them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. This is the major compaction that we mentioned before. And I think this may be already available through RewriteDataFiles.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jackye1995, In order to make API more generic, maybe we can go back to use empty args and let the strategy implementation maintain the map like the implementation here. So that users could choose the deletes in their own way, for example, read from manifest directly. And since we are rewriting deletes, it looks more natural to return Iterable<DeleteFile>.
@rdblue @jackye1995 WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aokolnychyi thanks for the feedback!
I think there is a third use case: simply compacting position deletes.
Yes agree, that maps to the Hive minor compaction as well.
will position -> data be something we take here or in RewriteDataFiles? I can see it being part of the data compaction. Like if we detect that certain data files have too many matching delete files, we want to rewrite them.
I remember I had a discussion around this with Russell when he was implementing bin packing. When we rewrite data files, yes position delete information is merged to that, but delete files are not dropped, because the delete file might still reference other files that are not included during bin-packing. We discussed the possibility for also adding that feature in the data compaction action, the conclusion was to have another action specifically for this use case to (1) not make data compaction too complicated, (2) have separated actions for different purposes, (3) have better SLA control for different actions.
| * @param deleteFilesToRewrite a group of files to be rewritten together | ||
| * @return iterable of delete files used to replace the original delete files. | ||
| */ | ||
| Set<DeleteFile> rewriteDeletes(Set<DeleteFile> deleteFilesToRewrite); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this strategy handle sequence numbers? Are all of the new delete files committed to the table at the next sequence number? If so, wouldn't that mean that this may only return position delete files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question! Besides converting to position deletes we could split equality delete into small ones so that each file scan can carry small equality deletes as Jack mentioned before. But as you pointed out, I think it would be hard to handle the sequence number since the committed snapshot should be immutable. Can we do the trick?
I'm not sure how much benefit that splitting can provide, but I believe small files bring worse problems. How about we make rewriteDeletes more specific? Like removeEqDeletes and rewritePosDeletes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I was talking about that use case I was also in experimental phase, but so far what I see is that in practice the 2 most commonly used ones are still (1) equality -> position, (2) position -> data. Splitting equality deletes is most of the time not worth the computation effort because it's already doing a planning and it's more cost efficient to just rewrite it to position deletes.
Going back to the conversation we are having for the interface for selectDeletesToRewrite, it feels like we should have 2 different interfaces for rewriting equality and position deletes so that their interfaces can be a bit more specialized for their own use cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the verification! @jackye1995. Now, I'm inclined to use more specific ones. @rdblue WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we can choose a specific rewrite strategy in RewriteDeleteFiles, and we don't have another strategy to handle equality deletes, I think we could keep this one as it is. Does that sound reasonable to you? @rdblue @jackye1995 .
|
I'll do a pass today too. |
| import java.util.Set; | ||
| import org.apache.iceberg.DeleteFile; | ||
|
|
||
| public interface RewriteDeletes extends SnapshotUpdate<RewriteDeletes, RewriteDeletes.Result> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: shall we add some Javadoc similarly to RewriteDataFiles?
/**
* An action for rewriting delete files according to a rewrite strategy.
* Generally used for optimizing the sizing and layout of delete files within a table.
*/
| import java.util.Set; | ||
| import org.apache.iceberg.DeleteFile; | ||
|
|
||
| public interface RewriteDeletes extends SnapshotUpdate<RewriteDeletes, RewriteDeletes.Result> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: RewriteDeletes -> RewriteDeleteFiles to match RewriteDataFiles for consistency?
| * An action for converting the equality delete files according to a convert strategy. | ||
| * Generally used for optimizing the sizing and layout of delete files within a table. | ||
| */ | ||
| public interface ConvertDeleteFiles extends SnapshotUpdate<ConvertDeleteFiles, ConvertDeleteFiles.Result> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any other conversions we plan to support other than equality -> position? According to the doc, it seems to be the only conversion we consider.
If so, would it make sense to call it ConvertEqualityDeleteFiles? Just asking. Then we can drop the method below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense to me, the below function should be used in the original API definition which has both rewrite and converts stuff
|
|
||
| /** | ||
| * An action for converting the equality delete files according to a convert strategy. | ||
| * Generally used for optimizing the sizing and layout of delete files within a table. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure the second sentence is accurate. It applies more to the rewrite action.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I think the first one should be enough, let me delete them.
| /** | ||
| * Returns the count of the added position delete files. | ||
| */ | ||
| int addedDeleteFilesCount(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we think this action will be always about converting equality to position, then these methods can be a bit more specific just like their Javadoc.
For example, we can call them convertedEqualityDeleteFilesCount and addedPositionDeleteFilesCount.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense to me. Will do.
| import org.apache.iceberg.expressions.Expression; | ||
|
|
||
| /** | ||
| * An action for rewriting the position delete files according to a rewrite strategy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: In the doc, we are still debating whether a rewrite of equality deletes makes sense. I agree it is useful to have a generic name for this action and I like RewriteDeleteFiles. Should we ensure the Javadoc is generic too? What about dropping position in the class doc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of make API and doc generic, how about make them more specific? I mean change RewriteDeleteFiles to RewritePositionDeleteFiles.
| * | ||
| * @return this for method chaining | ||
| */ | ||
| RewriteDeleteFiles combinePositionDeletes(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question. There is consensus that bin-packing position deletes is probably the most useful rewrite. However, we may add a rewrite that would also reshuffle deletes, not only bin-pack them. We don't know whether that will be needed and when we will implement it but what about calling this binPackPositionDeletes similar to binPack in data compaction? That way, we can add another strategy later as combineXXX seems vague.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I call this combineXXX because I think bin-pack is a kind of combined action. But this indeed may be vague if we have something like reshuffle in the future. Let me make this more specific.
| /** | ||
| * The action result that contains a summary of the execution. | ||
| */ | ||
| interface Result { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume we will compact each partition separately so we may want to provide some file group stats in the future, just like we do in data compaction. Too early to tell whether that will be useful, though. Just something to keep in mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. We could always improve the API later.
| import org.apache.iceberg.FileScanTask; | ||
| import org.apache.iceberg.Table; | ||
|
|
||
| public interface ConvertDeleteStrategy { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RussellSpitzer, we may want to rename RewriteStrategy into something data-specific since we are about to add more rewrite strategies.
|
Any other thoughts, @RussellSpitzer @rdblue @jackye1995? |
| extends SnapshotUpdate<ConvertEqualityDeleteFiles, ConvertEqualityDeleteFiles.Result> { | ||
|
|
||
| /** | ||
| * A filter for choosing the equality deletes to convert. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that "find" would be a better verb than "choose". This doesn't really allow choice, it limits the parts of the table that are considered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, will update.
| import org.apache.iceberg.expressions.Expression; | ||
|
|
||
| /** | ||
| * An action for converting the equality delete files to position delete files according to a convert strategy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that there is a conversion strategy here, right? We can probably remove that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
| * @param expression An iceberg expression used to choose deletes. | ||
| * @return this for method chaining | ||
| */ | ||
| ConvertEqualityDeleteFiles filter(Expression expression); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we have a partition filter and a row/data filter, or just a row filter? Since equality delete files are stored by partition, I think that we will actually convert a filter to a partition filter and then rewrite any equality delete file in that partition. So it may make sense to allow both. If, for example, you have a table that is bucketed and you rewrite the deletes in 1/10th of the buckets every day. Then you'd want to be able to specify id_bucket IN (...) rather than supplying a data filter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense to me. Even a partition filter is enough, but let me keep both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rdblue , Anton mentioned that we could always convert the data filter to a partition filter, just want to confirm that do we have any specific reason to keep the partition filter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I guess you could use in(bucket("id"), 1, 2, 3). That would be fine. We can remove the partition filter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, done!
api/src/main/java/org/apache/iceberg/actions/RewritePositionDeleteFiles.java
Outdated
Show resolved
Hide resolved
| /** | ||
| * An action for rewriting the position delete files. | ||
| * <p> | ||
| * Generally used for optimizing the sizing and layout of position delete files within a table. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: should be "size" not "sizing"
| extends SnapshotUpdate<RewritePositionDeleteFiles, RewritePositionDeleteFiles.Result> { | ||
|
|
||
| /** | ||
| * bin-pack the position deletes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be capitalized.
| * | ||
| * @return this for method chaining | ||
| */ | ||
| RewritePositionDeleteFiles binPackPositionDeletes(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why add PositionDeletes to the end of the method name and not just binPack()? Also, if we aren't exposing multiple strategies in the near term, should we simply remove this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense to remove this, ConvertEqualityDeleteFiles already remove the convertEqualityDelete().
| extends SnapshotUpdate<ConvertEqualityDeleteFiles, ConvertEqualityDeleteFiles.Result> { | ||
|
|
||
| /** | ||
| * A row filter for finding the equality deletes to convert. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this should mention that this filter will be converted to a partition filter with an inclusive projection and that matching delete files will be rewritten. It should be clear that it isn't possible to match individual deletes, only whole files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will add doc.
| * @param expression An iceberg expression used to find deletes. | ||
| * @return this for method chaining | ||
| */ | ||
| RewritePositionDeleteFiles rowFilter(Expression expression); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we plan to convert this to a partition filter using an inclusive projection like the equality delete rewrite, correct? @jackye1995, is there a use case for matching files and then using stats to filter out position delete files that don't match? I can't think of one.
| * | ||
| * @return iterable of original delete file to be converted. | ||
| */ | ||
| Iterable<DeleteFile> selectDeleteFiles(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the data file rewrite strategy, the equivalent method accepts an Iterable<FileScanTask> Should this be passed an Iterable<DeleteFile> so that this is selecting from the already loaded files rather than responsible for finding delete files itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my POC implementation, there is a local variable that holds the iterable. Accepting the Itrerable<FileScanTask> also sounds reasonable to me. Let up add it.
| * @param dataFiles iterable of data files that contain the DeleteFile to be converted | ||
| * @return iterable of lists of FileScanTasks which will be processed together | ||
| */ | ||
| Iterable<Iterable<FileScanTask>> planDeleteFileGroups(Iterable<FileScanTask> dataFiles); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: The order of methods doesn't match RewriteStrategy for data files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will update.
| * A row filter for finding the equality deletes to convert. | ||
| * <p> | ||
| * The row filter will be converted to a partition filter with an inclusive projection, so that candidate deletes are | ||
| * selected if any row match the expression. The matching delete files will be converted to position delete files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't quite correct. An inclusive projection matches if a file can contain rows that match. But it doesn't necessarily. I would change this to "Any file that may contain rows matching this filter will be used by the action."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An inclusive metrics evaluator evaluates the file in that way. But when the expression was projected to partition filter, since the partition value of the file is determined, the matching logic should be determined. Please correct me if I understand wrongly.
| * @param expression An iceberg expression used to find deletes. | ||
| * @return this for method chaining | ||
| */ | ||
| ConvertEqualityDeleteFiles rowFilter(Expression expression); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we have just one filter, there is no need to add "row" in the method name. Let's change it back to filter to match RewriteDataFiles.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
This looks good to me. Let's merge it and follow up with PRs that consume these interfaces. We can make adjustments as needed later. |
|
Thanks @chenjunjiedada for the hard work. Thanks everyone for reviewing! |
This defines an action API for rewriting deletes as we mentioned in here. With this API, users could define their own rewrite strategy to handle deletes according to different features of different storage or engine.
This reuses parts of the idea from @RussellSpitzer 's rewrite framework. It is a simple version so that it could be implemented with the previous rewrite delete PRs.