Core: add delete option for bin-packing #3454

jackye1995 · 2021-11-02T23:54:26Z

Add an option min-deletes-per-file to allow rewriting files with certain number of deletes. This does not remove the deletes. It is used to solve the situation that a file is already of optimized size and never included in for bin packing, but deletes are associated with it so the deletes cannot be expired.

jackye1995 · 2021-11-02T23:54:53Z

@RussellSpitzer @aokolnychyi @chenjunjiedada @rdblue

jackye1995 · 2021-11-02T23:56:27Z

...v3.2/spark/src/test/java/org/apache/iceberg/spark/actions/TestNewRewriteDataFilesAction.java

+        posDeleteWriter.delete(dataFile.path(), 1);
+        posDeleteWriter.close();
+        rowDelta.addDeletes(posDeleteWriter.toDeleteFile());
+      }


I know there are some repeated code here for generating deletes. So far I am still not sure what is the correct boundary to create util methods. I am planning to refactor after I add more tests for the RewriteDeleteStrategy

chenjunjiedada · 2021-11-03T04:00:39Z

core/src/main/java/org/apache/iceberg/actions/BinPackStrategy.java

+   * <p>
+   * Defaults to Integer.MAX_VALUE, which means this feature is not enabled by default.
+   */
+  public static final String MIN_DELETES_PER_FILE = "min-deletes-per-file";


What value is recommended to the user? Or how does the user compute the proper value? I'm thinking maybe adding an option to count filegroup valid if any file of it contains deletes. Because you don't know how many records match the equality delete, for example, delete a set of records in an area/province.

Nit: This is disabled by default but it is always comparing.

I think it's fine to just do this based on the number of files since the read penalty is directly related to the number of files and less so to the amount of actual rows deleted.

No strong feeling on the default since we already have the amount of delete files in the task information so I don't think the check is very expensive

Also, do we need to have specific settings for equality delete and position delete separately? Since the read penalties are different.

I think that is equivalent to setting this value to 0?

I think the exact value to set depends on the user's tolerance of read performance, because more deletes means worse read performance and potentially getting out of memory, so users can tune this value based on their system requirements.

I think that is equivalent to setting this value to 0?

Hm, that's it.

RussellSpitzer

This looks good to me based on our discussions of how the Binpack and Sort algorithms should be modified.

kbendick

This looks good!

A question for my own understanding, but overall +1. Thanks @jackye1995.

kbendick · 2021-11-03T20:29:45Z

core/src/test/java/org/apache/iceberg/actions/TestBinPackStrategy.java

 import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
 import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
 import org.apache.iceberg.relocated.com.google.common.collect.Iterables;
+import org.assertj.core.util.Lists;


Question / Nit: Is there a reason to use Lists from assertj instead of Lists from the relocated google commons library?

oh nice catch, should not use that. We need to add it to checkstyle.

kbendick · 2021-11-03T20:33:26Z

...v3.2/spark/src/test/java/org/apache/iceberg/spark/actions/TestNewRewriteDataFilesAction.java


+  @Test
+  public void testBinPackWithDeletes() throws Exception {
+    Table table = createTablePartitioned(4, 2);


Nit: Would it be worth while to add inline comments to explain these parameters (like createTablePartitioned(4 /* ??? */, 2 /* ??? */)? Right now, it's hard to tell immediately what these arguments are, but I'll leave the choice to you.

I think this is not needed, because you can see the meaning of the parameters in Intellij

chenjunjiedada

+1

jackye1995 · 2021-11-04T18:15:28Z

@aokolnychyi any additional comments? Otherwise I think this is mostly ready to be merged

jackye1995 · 2021-11-04T21:39:16Z

Looks like there are a few approvals here. Given that @RussellSpitzer who implemented this part of the code approved, I will wait until EOD today, if there is no additional comments, I will merge this, thanks everyone.

rdblue · 2021-11-07T21:05:42Z

core/src/test/java/org/apache/iceberg/actions/TestBinPackStrategy.java

+            BinPackStrategy.MIN_INPUT_FILES, Integer.toString(5),
+            BinPackStrategy.MAX_FILE_SIZE_BYTES, Long.toString(550 * MB),
+            BinPackStrategy.MIN_FILE_SIZE_BYTES, Long.toString(490 * MB),
+            BinPackStrategy.MIN_DELETES_PER_FILE, Integer.toString(2)


@kbendick, can you check our checkstyle config? It looks like lines that have incorrect indentation are getting through.

rdblue · 2021-11-07T21:11:23Z

@jackye1995, thanks for getting this done! Sorry I didn't have a chance to review it sooner.

My only comment is that I don't think that we are using "min" in a consistent way so the option names may be confusing. For file sizes, min and max are the min and max allowed file sizes. Anything that is larger than max or less than min gets rewritten. But min deletes per file is actually the minimum before taking action. In other words, it is the maximum allowed number of delete files.

I'd prefer making these consistent and using max-delete-files instead of min-deletes-per-file. That fixes the min/max issue and also is more clear that we are talking about delete files and not actual deletes in the files.

Core: add delete option for bin-packing

2cb9bd3

github-actions bot added core spark labels Nov 2, 2021

jackye1995 commented Nov 2, 2021

View reviewed changes

chenjunjiedada reviewed Nov 3, 2021

View reviewed changes

RussellSpitzer approved these changes Nov 3, 2021

View reviewed changes

kbendick approved these changes Nov 3, 2021

View reviewed changes

use relocated guava class

de1f755

chenjunjiedada approved these changes Nov 4, 2021

View reviewed changes

puneetzaroo approved these changes Nov 4, 2021

View reviewed changes

jackye1995 merged commit 5c8db54 into apache:master Nov 5, 2021

KnightChess pushed a commit to KnightChess/iceberg that referenced this pull request Nov 7, 2021

Core: add delete option for bin-packing (apache#3454)

0838ab2

rdblue reviewed Nov 7, 2021

View reviewed changes

jackye1995 mentioned this pull request Nov 9, 2021

Core: rename min-deletes-per-file in BinPackStrategy #3506

Merged

Initial-neko pushed a commit to Initial-neko/iceberg that referenced this pull request Nov 23, 2021

Core: add delete option for bin-packing (apache#3454)

fecc9f2

Core: add delete option for bin-packing #3454

Core: add delete option for bin-packing #3454

Uh oh!

Conversation

jackye1995 commented Nov 2, 2021

Uh oh!

jackye1995 commented Nov 2, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer left a comment

Choose a reason for hiding this comment

Uh oh!

kbendick left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chenjunjiedada left a comment

Choose a reason for hiding this comment

Uh oh!

jackye1995 commented Nov 4, 2021

Uh oh!

jackye1995 commented Nov 4, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue commented Nov 7, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants