Data: delete compaction optimization by bloom filter #5100

Zhangg7723 · 2022-06-21T08:51:32Z

Purpose
V2 table support equality deletes for row level delete, delete rows are loaded into memory when rewrite or query job is running, but this will cause OOM with too many delete rows in a real scenario, especially in flink upsert mode. As issue #4312 mentioned, flink rewrite jobs caused out of memory exception which also happen with spark or other engine support v2 table.

delete rows in hash set occupy most of the heap memory.

Goal
Reduce delete rows loaded in memory, optimize the performance of delete compaction by bloom filter, just for parquet format in this PR, thanks for the pull request of @huaxingao about the parquet bloom filter support #4831, and we are working on orc format.

How

Before reading the equality delete data, load the bloom filter of current data file, then filter delete rows not in this data file.

Verification
We verified the performance improvement by a test case.
Environment:
Job: spark rewrite job with 300 million data rows and 30 million delete rows
executor num：2
executor memory：2G
executor core：8

Before optimization:

JVM did full gc frequently，this job failed in the end.

After optimization:

Obviously, the memory pressure reduced, and the job finished successfully.

optimize the performance of delete compaction by bloom filter, just for parquet format in this PR.

close parquet reader

Zhangg7723 · 2022-06-22T08:43:27Z

@rdblue can you give a review? thank you.

chenjunjiedada · 2022-06-25T02:59:12Z

data/src/main/java/org/apache/iceberg/data/DeleteFilter.java


+    // load bloomfilter readers from data file
+    if (filePath.endsWith(".parquet")) {
+      parquetReader = ParquetUtil.openFile(getInputFile(filePath));


Can we use try-with-resource here?

Maybe it's a big change , I want to keep this reader open during the iteration of delete files, and considering orc format, the delete iteration need to be encapsulated in a new function, any advise for this change?

chenjunjiedada · 2022-06-25T03:01:46Z

data/src/main/java/org/apache/iceberg/data/DeleteFilter.java


      Schema deleteSchema = TypeUtil.select(requiredSchema, ids);

+      if (filePath.endsWith(".parquet") && parquetReader != null) {


I think we can change the ctor parameter from String to DataFile and then here we can check file.format().

yes, you are right, but dataFile is not passed into DeleteFilter as parameter, it was changed to filePath in #4381

I see, any concern if we change it to DataFile?

the change for constructor parameters is only for Trino supporting mor，Trino wrapped a dummy fileScanTask for data file currently, the author wants to remove fileScanTask implemented in Trino, and use the filePath parameter. If we change it to dataFile, compatibility is a problem.

chenjunjiedada · 2022-06-27T09:36:15Z

data/src/main/java/org/apache/iceberg/data/DeleteFilter.java

    }

+    // load bloomfilter readers from data file
+    if (filePath.endsWith(".parquet")) {


Do we want to check whether the bloom filter is turned on to avoid reading the footer if it is not?

You mean we check the bloom filter by table properties, right? but the bloom filter properties may be updated, the bloom filter in current file is unmatched with table properties.

…merge)

jiamin13579 · 2023-04-18T03:35:00Z

We also encountered the same problem

github-actions · 2024-08-16T00:13:42Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

github-actions · 2024-08-26T00:13:51Z

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

张敢 added 2 commits June 21, 2022 10:03

delete compaction optimization

1a4f959

optimize the performance of delete compaction by bloom filter, just for parquet format in this PR.

Update DeleteFilter.java

82c48aa

github-actions bot added data parquet labels Jun 21, 2022

close parquet reader

ce3393c

close parquet reader

chenjunjiedada reviewed Jun 25, 2022

View reviewed changes

chenjunjiedada reviewed Jun 27, 2022

View reviewed changes

Initial-neko pushed a commit to Initial-neko/iceberg that referenced this pull request Jul 18, 2022

Data: delete compaction optimization by bloom filter apache#5100(not …

81214c1

…merge)

Initial-neko pushed a commit to Initial-neko/iceberg that referenced this pull request Jul 18, 2022

Data: delete compaction optimization by bloom filter apache#5100(not …

204f16d

…merge)

Initial-neko pushed a commit to Initial-neko/iceberg that referenced this pull request Jul 20, 2022

Data: delete compaction optimization by bloom filter apache#5100(not …

bea536b

…merge)

Initial-neko pushed a commit to Initial-neko/iceberg that referenced this pull request Jul 25, 2022

Data: delete compaction optimization by bloom filter apache#5100(not …

e0a45c5

…merge)

github-actions bot added the stale label Aug 16, 2024

github-actions bot closed this Aug 26, 2024


		Schema deleteSchema = TypeUtil.select(requiredSchema, ids);

		if (filePath.endsWith(".parquet") && parquetReader != null) {

Data: delete compaction optimization by bloom filter #5100

Data: delete compaction optimization by bloom filter #5100

Uh oh!

Conversation

Zhangg7723 commented Jun 21, 2022

Uh oh!

Zhangg7723 commented Jun 22, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jiamin13579 commented Apr 18, 2023

Uh oh!

github-actions bot commented Aug 16, 2024

Uh oh!

github-actions bot commented Aug 26, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants