Skip to content

Conversation

@chenjunjiedada
Copy link
Collaborator

@chenjunjiedada chenjunjiedada commented Jul 19, 2021

This defines an action API for rewriting deletes as we mentioned in here. With this API, users could define their own rewrite strategy to handle deletes according to different features of different storage or engine.

This reuses parts of the idea from @RussellSpitzer 's rewrite framework. It is a simple version so that it could be implemented with the previous rewrite delete PRs.

@chenjunjiedada
Copy link
Collaborator Author

@jackye1995 @RussellSpitzer Could you please take a look?

Copy link
Member

@RussellSpitzer RussellSpitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! We are gonna have a lot of fun new rewrites!

import org.apache.iceberg.Table;

public interface RewriteDeleteStrategy {

Copy link
Contributor

@jackye1995 jackye1995 Jul 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we also need the methods similar to RewriteStrategy such as name, validOptions

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't add validOptions because we haven't finalized what options are valid and it can be added late. But I'm fine to add them now.

For name, It looks like the class name already includes the strategy name. Do we still need it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name is mostly used for logging purpose. I agree it is already in the class name, but for consistency with RewriteStrategy I think we can still have it here.

*
* @return iterable of original delete file to be replaced.
*/
Iterable<DeleteFile> selectDeletes();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you explain a bit about how these 2 interfaces will be used? For RewriteStrategy the method Iterable<FileScanTask> selectFilesToRewrite(Iterable<FileScanTask> dataFiles) was quite straightforward, but here the method has no arguments so I am not very sure how would it be used to actually select deletes.

I think it goes to the fundamental question of what are some potential rewrite delete strategies? For rewrite data files we know there are well-known operations like bin-packing. For deletes, what are some strategies we plan to implement?

Copy link
Collaborator Author

@chenjunjiedada chenjunjiedada Jul 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The argument of selectFilesToRewrite(Iterable<FileScanTask> dataFiles) comes from planFileGroups and the filter logic. I think we could build similar logic by storing the iterable of FileScanTask as a local variable in the concrete RewriteDeleteStrategy class, and grouping them in rewriteDeletes() API. As for filter logic, we could add it later and apply it in selectDeletes() API.

As for strategies, it depends on different scenarios. In our cases, which mostly use HDFS as storage, we care more about the small files. they could impact the planning, reading performance and also bring overhead to the name node. So the strategy we want badly is to merge deletes. So we are planning two strategies at first: 1) merge all position deletes. 2) read all equality deletes under the partition and convert them into one position delete. I think the first one is kind of bin-packing, the second one is convert-and-bin-packing. In the future, we might also want to sort the merged position delete by file name and break it into pieces so that executors could read only the delete contents that match the data files.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. The case you described are also what I am interested in. I think for RewriteStrategy we first had a design doc so when there was the PR everything was in place. I am fine if we are taking incremental progress for delete strategies to gradually evolve the interface and add configuration options.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @chenjunjiedada , thanks for tracking the v2's compaction for long time, It's really challenging. I think we reviewers were distracted by other internal or external things , which blocked the v2 compaction progress for a long time. While the fact is: the current v2 compaction feature is the biggest blocker for v2 right now, I'd like to push this feature forward, so that we don't block v2 forever. (And thanks for your bandwidth @jackye1995, you always come out to help when the things get stuck).

About this PR, since it's a API definition, I'd like to take a look the whole delete rewrite design or preview PR if you have one because in that case I could evaluate whether the API abstraction is perfectly matching the real implementation. Of course, we could introduce a basic API, and then iterate the interface design when we get more implementation. It's looks better for me to reach a global agreement before we starting to provide a specific design, that can help us reduce context sync on the core paths.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for a general design. And please let me know if there is anything that can be parallelized for implementation once we reached an agreement on this. I am currently mostly spending time trying to push the work on Trino side, but should have some bandwidth to take some implementation work for this if necessary. As openinx said, this is the biggest blocker for v2.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a design doc right now, but most of the ideas come from discussions in #2216 and #2364. I'm working on rebasing #2216 to apply #2372 (API definition) and #2841 (delete row reader). That would be a full implementation and hopefully could give you the full picture of how these APIs are used. I will post in the following days.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great, let me read through those, thank you!

*
* @return iterable of original delete file to be replaced.
*/
Iterable<DeleteFile> selectDeletes();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. The case you described are also what I am interested in. I think for RewriteStrategy we first had a design doc so when there was the PR everything was in place. I am fine if we are taking incremental progress for delete strategies to gradually evolve the interface and add configuration options.

@chenjunjiedada
Copy link
Collaborator Author

@RussellSpitzer @jackye1995 Is this ready to merge now?

@chenjunjiedada
Copy link
Collaborator Author

@rdblue @aokolnychyi @openinx You may want to take a look as well.

@chenjunjiedada
Copy link
Collaborator Author

@openinx @jackye1995 I rebased the #2364, now you can take a quick look at how the APIs are used.

cc @rdblue @aokolnychyi @RussellSpitzer @yyanyy

@jackye1995
Copy link
Contributor

Thank you! I reviewed the PR. I will paste my major comment also here:

After reading ConvertEqDeletesStrategy.rewriteDeletes, I think we can make the RewriteDeleteStrategy interface closer to RewriteStrategy interface. What we have there is basically the equivalent of planFileGroups plus rewriteFiles in RewriteStrategy. So I would propose we have the following methods in RewriteDeleteStrategy to be more aligned:

Iterable<DeleteFile> selectDeletesToRewrite(Iterable<FileScanTask> dataFiles);

Iterable<List<FileScanTask>> planDeleteGroups(Iterable<DeleteFile> deleteFiles);

Set<DeleteFile> rewriteDeletes(List<DeleteFile> deleteFilesToRewrite);

And we can get the partition StructLike directly from the list of scan tasks instead of passing it through the task pair in EqualityDeleteRewriter. In this way, we can also enable partial progress for commits.

Any thoughts?

@yyanyy
Copy link
Contributor

yyanyy commented Jul 28, 2021

+1 to Jack's comment on making the delete strategy API mimic the one for RewriteStrategy since I believe the intention there was so that the rewrites can happen in batches/people can decide if they want to rewrite a certain partition/subset of files. If we directly take no argument for select files, we are assuming users will always want to rewrite the entire table.

*
* @return this for method chaining
*/
RewriteDeletes rewriteEqDeletes();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious on the reason for having rewriteEqDeletes and rewritePosDeletes here, looks like in RewriteDataFiles we define methods based on the exact strategy to use, and for rewriting delete files there could be strategies like eqDelete->posDelete, combine posDelete into one, and even take both delete and data files and return data files (I think the last one should be a separate interface). Here it seems like the user can decide whether to rewrite certain type of delete files, but I would think strategy might be the thing that the user want to control more? Sorry I've been away from the community conversations for a while so might lack a lot of context if this was previously discussed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rewriteEqDeletes and rewritePosDeletes is a top-level choice for users to choose which kind of delete he or she wants to rewrite. Here's an example: actionsProvider().rewriteDeletes(table).rewriteEqDeletes().execute();, users could clearly know what kind of deletes are handled. But you remind me that, we might also don't need these two as well, use one strategy API to select its implementation is enough.

The strategy options are open and we could pass options from options interface from the specified strategy implementation.

Copy link
Contributor

@jackye1995 jackye1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the change, this looks good to me!

*
* @return this for method chaining
*/
RewriteDeletes strategy(String strategyImpl);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible for us to pass in RewriteDeleteStrategy here rather than a free formed string?

Copy link
Collaborator Author

@chenjunjiedada chenjunjiedada Aug 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RewriteDeleteStrategy is defined in the core package, while this is in the api package. This is like the pattern for the Catalog, as the usage of loading catalog:

Catalog catalog = CatalogUtil.loadCatalog(HiveCatalog.class.getName(), "hive", ImmutableMap.of(), hadoopConf);

we could have a similar usage like:

actionsProvider()
  .rewriteDeletes(table)
  .strategy(ConvertEqDeletesStrategy.class.getName())
  .execute();

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For data file rewrites, we didn't allow custom extension because we don't think that there's a lot of value in it just yet and we want to keep things simple if there isn't a use for dynamically loaded classes. I think we should probably take the same approach here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for not having a custom impl here because if you think some strategy is valuable, it's likely also valuable for other users. Or maybe could you provide more context about what customization you would like to do in addition?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we all agree that eq->pos is the one that can share with others. Let me update this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 here too. We debated this while designing the new data compaction API.

Copy link
Contributor

@aokolnychyi aokolnychyi Sep 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think another common strategy would be to just compact delete files. For example, take 100 position deletes and write them as one (within a partition). Another use case is compacting a large number of small global deletes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. That is the next action, I have verified internally that it could bring some benefit for query.

@chenjunjiedada
Copy link
Collaborator Author

@RussellSpitzer @jackye1995 @yyanyy @openinx , I think we are on the same page now, can we merge this to continue rewrite actions?

@chenjunjiedada
Copy link
Collaborator Author

cc @rdblue @aokolnychyi

@chenjunjiedada
Copy link
Collaborator Author

ping @rdblue @openinx @jackye1995 again... This might be lost in PRs.

/**
* Sets options to be used with this strategy
*/
RewriteDeleteStrategy options(Map<String, String> options);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: it would be nice to use the same method order as RewriteStrategy for consistency.

* @param dataFiles iterable of FileScanTasks for data files in a given partition
* @return iterable of original delete file to be replaced.
*/
Iterable<DeleteFile> selectDeletesToRewrite(Iterable<FileScanTask> dataFiles);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why accept FileScanTask here? Will the delete action only find delete files based on a scan?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was discussed in #2841 (comment). It is in order to align with RewriteStrategy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think as long as we follow a similar pattern it's fine, but it's more important to make the interface suitable for deletes. Here we can consider 2 cases:

  1. rewrite equality -> position: we actually need the reverse of the current information, which is a list of data files associated with the delete file.
  2. rewrite position -> data: the current interface using FileScanTask seems suitable because we need to generate something similar for rewrite anyway.

I see a few different ways we can go with this:

  1. use this current interface, the "scan task" is just a holder of delete and data files organized in the form ready for a scan. equality -> position can then reorganize information.
  2. invent a new interface suitable for both use cases, something like a Iterable<RewriteDeleteTask>
  3. instead of having a single rewrite strategy for both, having one for each, so that we can have something customized for equality delete.

Copy link
Contributor

@aokolnychyi aokolnychyi Sep 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a third use case: simply compacting position deletes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, will position -> data be something we take here or in RewriteDataFiles? I can see it being part of the data compaction. Like if we detect that certain data files have too many matching delete files, we want to rewrite them.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. This is the major compaction that we mentioned before. And I think this may be already available through RewriteDataFiles.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jackye1995, In order to make API more generic, maybe we can go back to use empty args and let the strategy implementation maintain the map like the implementation here. So that users could choose the deletes in their own way, for example, read from manifest directly. And since we are rewriting deletes, it looks more natural to return Iterable<DeleteFile>.

@rdblue @jackye1995 WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aokolnychyi thanks for the feedback!

I think there is a third use case: simply compacting position deletes.

Yes agree, that maps to the Hive minor compaction as well.

will position -> data be something we take here or in RewriteDataFiles? I can see it being part of the data compaction. Like if we detect that certain data files have too many matching delete files, we want to rewrite them.

I remember I had a discussion around this with Russell when he was implementing bin packing. When we rewrite data files, yes position delete information is merged to that, but delete files are not dropped, because the delete file might still reference other files that are not included during bin-packing. We discussed the possibility for also adding that feature in the data compaction action, the conclusion was to have another action specifically for this use case to (1) not make data compaction too complicated, (2) have separated actions for different purposes, (3) have better SLA control for different actions.

* @param deleteFilesToRewrite a group of files to be rewritten together
* @return iterable of delete files used to replace the original delete files.
*/
Set<DeleteFile> rewriteDeletes(Set<DeleteFile> deleteFilesToRewrite);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this strategy handle sequence numbers? Are all of the new delete files committed to the table at the next sequence number? If so, wouldn't that mean that this may only return position delete files?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! Besides converting to position deletes we could split equality delete into small ones so that each file scan can carry small equality deletes as Jack mentioned before. But as you pointed out, I think it would be hard to handle the sequence number since the committed snapshot should be immutable. Can we do the trick?

I'm not sure how much benefit that splitting can provide, but I believe small files bring worse problems. How about we make rewriteDeletes more specific? Like removeEqDeletes and rewritePosDeletes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I was talking about that use case I was also in experimental phase, but so far what I see is that in practice the 2 most commonly used ones are still (1) equality -> position, (2) position -> data. Splitting equality deletes is most of the time not worth the computation effort because it's already doing a planning and it's more cost efficient to just rewrite it to position deletes.

Going back to the conversation we are having for the interface for selectDeletesToRewrite, it feels like we should have 2 different interfaces for rewriting equality and position deletes so that their interfaces can be a bit more specialized for their own use cases?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the verification! @jackye1995. Now, I'm inclined to use more specific ones. @rdblue WDYT?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we can choose a specific rewrite strategy in RewriteDeleteFiles, and we don't have another strategy to handle equality deletes, I think we could keep this one as it is. Does that sound reasonable to you? @rdblue @jackye1995 .

@aokolnychyi
Copy link
Contributor

I'll do a pass today too.

import java.util.Set;
import org.apache.iceberg.DeleteFile;

public interface RewriteDeletes extends SnapshotUpdate<RewriteDeletes, RewriteDeletes.Result> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: shall we add some Javadoc similarly to RewriteDataFiles?

/**
 * An action for rewriting delete files according to a rewrite strategy.
 * Generally used for optimizing the sizing and layout of delete files within a table.
 */

import java.util.Set;
import org.apache.iceberg.DeleteFile;

public interface RewriteDeletes extends SnapshotUpdate<RewriteDeletes, RewriteDeletes.Result> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: RewriteDeletes -> RewriteDeleteFiles to match RewriteDataFiles for consistency?

* An action for converting the equality delete files according to a convert strategy.
* Generally used for optimizing the sizing and layout of delete files within a table.
*/
public interface ConvertDeleteFiles extends SnapshotUpdate<ConvertDeleteFiles, ConvertDeleteFiles.Result> {
Copy link
Contributor

@aokolnychyi aokolnychyi Oct 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any other conversions we plan to support other than equality -> position? According to the doc, it seems to be the only conversion we consider.

If so, would it make sense to call it ConvertEqualityDeleteFiles? Just asking. Then we can drop the method below.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense to me, the below function should be used in the original API definition which has both rewrite and converts stuff


/**
* An action for converting the equality delete files according to a convert strategy.
* Generally used for optimizing the sizing and layout of delete files within a table.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure the second sentence is accurate. It applies more to the rewrite action.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I think the first one should be enough, let me delete them.

/**
* Returns the count of the added position delete files.
*/
int addedDeleteFilesCount();
Copy link
Contributor

@aokolnychyi aokolnychyi Oct 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we think this action will be always about converting equality to position, then these methods can be a bit more specific just like their Javadoc.

For example, we can call them convertedEqualityDeleteFilesCount and addedPositionDeleteFilesCount.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense to me. Will do.

import org.apache.iceberg.expressions.Expression;

/**
* An action for rewriting the position delete files according to a rewrite strategy.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: In the doc, we are still debating whether a rewrite of equality deletes makes sense. I agree it is useful to have a generic name for this action and I like RewriteDeleteFiles. Should we ensure the Javadoc is generic too? What about dropping position in the class doc?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of make API and doc generic, how about make them more specific? I mean change RewriteDeleteFiles to RewritePositionDeleteFiles.

*
* @return this for method chaining
*/
RewriteDeleteFiles combinePositionDeletes();
Copy link
Contributor

@aokolnychyi aokolnychyi Oct 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question. There is consensus that bin-packing position deletes is probably the most useful rewrite. However, we may add a rewrite that would also reshuffle deletes, not only bin-pack them. We don't know whether that will be needed and when we will implement it but what about calling this binPackPositionDeletes similar to binPack in data compaction? That way, we can add another strategy later as combineXXX seems vague.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I call this combineXXX because I think bin-pack is a kind of combined action. But this indeed may be vague if we have something like reshuffle in the future. Let me make this more specific.

/**
* The action result that contains a summary of the execution.
*/
interface Result {
Copy link
Contributor

@aokolnychyi aokolnychyi Oct 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume we will compact each partition separately so we may want to provide some file group stats in the future, just like we do in data compaction. Too early to tell whether that will be useful, though. Just something to keep in mind.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. We could always improve the API later.

import org.apache.iceberg.FileScanTask;
import org.apache.iceberg.Table;

public interface ConvertDeleteStrategy {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RussellSpitzer, we may want to rename RewriteStrategy into something data-specific since we are about to add more rewrite strategies.

@aokolnychyi
Copy link
Contributor

Any other thoughts, @RussellSpitzer @rdblue @jackye1995?

extends SnapshotUpdate<ConvertEqualityDeleteFiles, ConvertEqualityDeleteFiles.Result> {

/**
* A filter for choosing the equality deletes to convert.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that "find" would be a better verb than "choose". This doesn't really allow choice, it limits the parts of the table that are considered.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, will update.

import org.apache.iceberg.expressions.Expression;

/**
* An action for converting the equality delete files to position delete files according to a convert strategy.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that there is a conversion strategy here, right? We can probably remove that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

* @param expression An iceberg expression used to choose deletes.
* @return this for method chaining
*/
ConvertEqualityDeleteFiles filter(Expression expression);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have a partition filter and a row/data filter, or just a row filter? Since equality delete files are stored by partition, I think that we will actually convert a filter to a partition filter and then rewrite any equality delete file in that partition. So it may make sense to allow both. If, for example, you have a table that is bucketed and you rewrite the deletes in 1/10th of the buckets every day. Then you'd want to be able to specify id_bucket IN (...) rather than supplying a data filter.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense to me. Even a partition filter is enough, but let me keep both.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue , Anton mentioned that we could always convert the data filter to a partition filter, just want to confirm that do we have any specific reason to keep the partition filter?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I guess you could use in(bucket("id"), 1, 2, 3). That would be fine. We can remove the partition filter.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, done!

/**
* An action for rewriting the position delete files.
* <p>
* Generally used for optimizing the sizing and layout of position delete files within a table.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: should be "size" not "sizing"

extends SnapshotUpdate<RewritePositionDeleteFiles, RewritePositionDeleteFiles.Result> {

/**
* bin-pack the position deletes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be capitalized.

*
* @return this for method chaining
*/
RewritePositionDeleteFiles binPackPositionDeletes();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why add PositionDeletes to the end of the method name and not just binPack()? Also, if we aren't exposing multiple strategies in the near term, should we simply remove this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense to remove this, ConvertEqualityDeleteFiles already remove the convertEqualityDelete().

extends SnapshotUpdate<ConvertEqualityDeleteFiles, ConvertEqualityDeleteFiles.Result> {

/**
* A row filter for finding the equality deletes to convert.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this should mention that this filter will be converted to a partition filter with an inclusive projection and that matching delete files will be rewritten. It should be clear that it isn't possible to match individual deletes, only whole files.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will add doc.

* @param expression An iceberg expression used to find deletes.
* @return this for method chaining
*/
RewritePositionDeleteFiles rowFilter(Expression expression);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we plan to convert this to a partition filter using an inclusive projection like the equality delete rewrite, correct? @jackye1995, is there a use case for matching files and then using stats to filter out position delete files that don't match? I can't think of one.

*
* @return iterable of original delete file to be converted.
*/
Iterable<DeleteFile> selectDeleteFiles();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the data file rewrite strategy, the equivalent method accepts an Iterable<FileScanTask> Should this be passed an Iterable<DeleteFile> so that this is selecting from the already loaded files rather than responsible for finding delete files itself?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my POC implementation, there is a local variable that holds the iterable. Accepting the Itrerable<FileScanTask> also sounds reasonable to me. Let up add it.

* @param dataFiles iterable of data files that contain the DeleteFile to be converted
* @return iterable of lists of FileScanTasks which will be processed together
*/
Iterable<Iterable<FileScanTask>> planDeleteFileGroups(Iterable<FileScanTask> dataFiles);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: The order of methods doesn't match RewriteStrategy for data files.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will update.

* A row filter for finding the equality deletes to convert.
* <p>
* The row filter will be converted to a partition filter with an inclusive projection, so that candidate deletes are
* selected if any row match the expression. The matching delete files will be converted to position delete files.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't quite correct. An inclusive projection matches if a file can contain rows that match. But it doesn't necessarily. I would change this to "Any file that may contain rows matching this filter will be used by the action."

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An inclusive metrics evaluator evaluates the file in that way. But when the expression was projected to partition filter, since the partition value of the file is determined, the matching logic should be determined. Please correct me if I understand wrongly.

* @param expression An iceberg expression used to find deletes.
* @return this for method chaining
*/
ConvertEqualityDeleteFiles rowFilter(Expression expression);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we have just one filter, there is no need to add "row" in the method name. Let's change it back to filter to match RewriteDataFiles.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@aokolnychyi
Copy link
Contributor

This looks good to me. Let's merge it and follow up with PRs that consume these interfaces. We can make adjustments as needed later.

@aokolnychyi aokolnychyi merged commit 1c62453 into apache:master Nov 2, 2021
@aokolnychyi
Copy link
Contributor

Thanks @chenjunjiedada for the hard work. Thanks everyone for reviewing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants