Flink: Auto compact file #2867

hameizi · 2021-07-26T03:44:43Z

This add one feature that flink write iceberg auto compact small files. And add config "write.auto-compact-files".
When we insert data into iceberg will generate much small files, so i try to auto compact small files when we use flink insert into iceberg.
In this PR, in the flink function IcebergFilesCommitter.snapshotState(the first step of flink checkpoint) will generate one rewriteaction. And we get all partitions this transcation related to generate partition filter, then set partition filter to this rewriteaction so that we will compact files group by partition.
The last step will add the rewriteaction in flink function IcebergFilesCommitter.notifyCheckpointComplete(the last step of flink checkpoint). In IcebergFilesCommitter.notifyCheckpointComplete will commit transcation first, then this function will execute rewriteaction.

kbendick · 2021-07-26T08:00:10Z

Hi @hameizi! Thank you for your contribution and interest in Iceberg!

This is a non-trivial functionality addition. Would it be possible for you to create a GitHub issue for us to track and discuss this? Github issues are how we normally track new work etc, and I think given that this functionality has not been discussed etc, it would benefit from going through the standard process of having an issue.

By no means do you need to close the PR, but it would help to follow the normal workflow (and also provides the benefit that people can search the issues to see the discussion). 🙂

kbendick · 2021-07-26T08:01:55Z

This add one feature that flink write iceberg auto compact small files.

Possibly I'm missing something, but I don't see any accounting for files that might be already near, or close to the optimal size. It's late and my eyes may deceive me, but this appears to be compacting all files to be ideally the target file size bytes, regardless of their existing size etc. In some cases, the cost of opening and rewriting provides less value than leaving the data as is. Can we account for this like we do in some other places? Or am I just missing the fact that that functionality is hidden elsewhere?

This would be a good topic to consider discussing in the mentioned GitHub issue :)

hameizi · 2021-07-26T08:28:52Z

Hi @hameizi! Thank you for your contribution and interest in Iceberg!

This is a non-trivial functionality addition. Would it be possible for you to create a GitHub issue for us to track and discuss this? Github issues are how we normally track new work etc, and I think given that this functionality has not been discussed etc, it would benefit from going through the standard process of having an issue.

By no means do you need to close the PR, but it would help to follow the normal workflow (and also provides the benefit that people can search the issues to see the discussion). 🙂

I add one issue #2869

hameizi · 2021-07-26T08:42:29Z

This add one feature that flink write iceberg auto compact small files.

Possibly I'm missing something, but I don't see any accounting for files that might be already near, or close to the optimal size. It's late and my eyes may deceive me, but this appears to be compacting all files to be ideally the target file size bytes, regardless of their existing size etc. In some cases, the cost of opening and rewriting provides less value than leaving the data as is. Can we account for this like we do in some other places? Or am I just missing the fact that that functionality is hidden elsewhere?

This would be a good topic to consider discussing in the mentioned GitHub issue :)

It will compact files who is the result by partition filter, so it will compact all files when there is no partitions. But in my work sence, there is no problem because we compact every transction that will generate little files. So compact files in every transcation is quickly and small cost and the time compact files will not more than transaction time too.

hameizi · 2021-07-28T03:41:19Z

@openinx Could you help review?

rdblue · 2021-08-01T17:23:28Z

@hameizi, could you update the PR description with details about this feature? Auto-compaction is not very specific so I'd like to hear how you implemented it and what that means.

hameizi · 2021-08-02T02:01:56Z

@hameizi, could you update the PR description with details about this feature? Auto-compaction is not very specific so I'd like to hear how you implemented it and what that means.

@rdblue Hi, i have update the PR description.

rdblue · 2021-08-02T15:37:24Z

@hameizi, could you please be more specific? How does this determine which files to rewrite? In what tasks are they rewritten? Does this introduce new operators? Are files rewritten before initial commit or afterward in a replace commit?

There are a lot of details for a feature like this that need to be clear.

hameizi · 2021-08-03T02:16:27Z

@hameizi, could you please be more specific? How does this determine which files to rewrite? In what tasks are they rewritten? Does this introduce new operators? Are files rewritten before initial commit or afterward in a replace commit?

There are a lot of details for a feature like this that need to be clear.

@rdblue Hi, i have update the PR description.

rdblue · 2021-08-03T16:13:30Z

From the description, it sounds like the rewrite happens in the committer task rather than in parallel. Is there a good way to make this happen in parallel instead?

What we discussed elsewhere was doing a compaction by adding a new parallel stage and second committer after the initial committer. The current commit task would output committed DataFile instances after the commit succeeds. Then those would be sent to compaction writers using keyBy and the partition. Once a compacted data file is large enough, the compaction writer will emit it as a DataFile along with the DataFile instances that were compacted. Those would be collected by the compaction committer, which would commit a rewrite every checkpoint where there is at least one compacted file.

I think that we should plan on having some parallelism here, or else this is not going to be a very useful feature.

hameizi · 2021-08-03T17:35:12Z

I have been make compaction in parallel formerly ,but i abort it because i think it will cause maybe be there is not just one compaction are executing because of current compaction overtime. In addition in this PR the compaction will not block handle data in flink job because the data channel will be not block just when snapshot function completed so for handle data compaction is in parallel in this PR.

stevenzwu · 2021-08-03T21:28:12Z

I share the same concern as @rdblue. It seems to me that this impl basically have a single committer task/thread that reads all all rows from a CombinedScanTask (files batched by BaseRewriteDataFilesAction) and writes them out. How is different to just configure the StreamFileWriter with parallelism of 1?

if we make it a truly parallel rewrite/compaction action, I am a little concerned about the complexity we are adding to the Flink streaming ingestion path.

hameizi · 2021-08-04T02:25:28Z

How is different to just configure the StreamFileWriter with parallelism of 1?

@stevenzwu
I think there is twodifferent point:
1.As i said before the compaction will not block handle data in flink job because the data channel will be not block just when snapshot function completed so for handle data that compaction is in parallel in this PR. So more parallelism of StreamFileWriter will benefit fast handle data in running time, but compaction just work in the time of notifyCheckpointComplete that not affect handle data.
2. In the scene that produce little data in every checkpoint will generate small files, but if we compact files in every checkpoint will all the small files for the table as result.

stevenzwu · 2021-08-04T05:23:58Z

if the writers are parallel (like 100) and the compactor is single parallelism, it is likely the compactor can't keep up with the workload. Even though compaction is running asynchronously with snapshotState(), it will eventually back up/block the threads executing notifyCheckpointComplete().

In the streaming ingestion path, here are a few things we can do or improve to mitigate the small files

writer parallelism
for partitioned tables, we can use DistributionMode to improve data clustering so that we can avoid that every writer task write to many partitions at the same time. Right now, Flink writer only has support for HASH mode. In the future, RANGE or BINPACKING might be more useful and general. They can also improve query performance with better data clustering.

Even with above changes, Flink streaming ingestion can still generate small files. The parallel compactors and 2nd committer that Ryan mentioned might be able to keep up with the throughput. However, personally I would rather not over-complicate the streaming ingestion path and make it less stable. Let's get the data into long-term data storage (like Iceberg tables) first. Other optimizations (like compaction or sorting) can happen in the background with scheduled (Spark) batch jobs.

hameizi · 2021-08-04T09:42:50Z

it is likely the compactor can't keep up with the workload.

@stevenzwu In theory it's will occur but the impl in this PR is similar with the custom policy in hive-commiter of flink. And user offen use custom policy to compact hive files so am i. But in my most scene there are no problem, and i am working for stress testing for this PR and not found cleraly errors up to now.

stevenzwu · 2021-08-05T04:04:08Z

@hameizi can you try some setup like this for the stress testing with your auto compaction change here?

a simple job just read data from Kafka and writes to Iceberg (e.g. partitioned by ingestion/processing time).
writer: parallelism of 10 (higher can be better)
I would expect each of the 10 writer tasks probably can write data in the order of MBs/sec.

stevenzwu · 2021-08-05T04:07:37Z

his appears to be compacting all files to be ideally the target file size bytes, regardless of their existing size etc. In some cases, the cost of opening and rewriting provides less value than leaving the data as is.

I would also echo @kbendick's comment above. Currently, we are reading everything in (regardless the file sizes). This assumes all/most files are small and can benefit from a compaction rewrite. But I am not sure if the assumption is valid for broad use cases

hameizi · 2021-08-05T07:50:39Z

@hameizi can you try some setup like this for the stress testing with your auto compaction change here?

a simple job just read data from Kafka and writes to Iceberg (e.g. partitioned by ingestion/processing time).

writer: parallelism of 10 (higher can be better)

I would expect each of the 10 writer tasks probably can write data in the order of MBs/sec.

I'm sorry that maybe my test can't up to the standard as you say. I can test one more later.

hameizi · 2021-08-13T03:23:00Z

@hameizi can you try some setup like this for the stress testing with your auto compaction change here?

a simple job just read data from Kafka and writes to Iceberg (e.g. partitioned by ingestion/processing time).

writer: parallelism of 10 (higher can be better)

I would expect each of the 10 writer tasks probably can write data in the order of MBs/sec.

I'm sorry that maybe my test can't up to the standard as you say. I can test one more later.

@stevenzwu I test one case, parallelism of 10, 30000+ records/sec(max 60000+/sec), partition by hour, 3GB memory of per taskmanager. It's no problem for me.

jackye1995 · 2021-09-13T22:33:31Z

I have also been following this thread although I did no make any comment. Let me add some thoughts since I see you are making some new changes.

I am mostly on the same line of thought as @stevenzwu, I am a bit worried about the scalability of the current implementation, and I think the parallel commit proposal that @rdblue proposed could work, but in the end running compaction in streaming pipeline is likely unnecessary complication.

So far we have been advocating for streaming pipelines to just commit new files to storage, and use a separated process to handle compaction at the same time. Having the streaming pipeline also do compaction would mean that there might be 2 compaction processes competing with each other. This becomes especially complicated and prone to error when you have both batch jobs and streaming pipelines running at the same time (e.g. normal streaming + daily loading of corrected and late data). I understand it is likely a good optimization for simple use cases, but I would expect it to be a feature with a lot of in-depth knowledge to use safely and correctly if we open it for general usage.

I wonder what is the initial drive behind this implementation. Do you just want to avoid a separated Spark cluster to run compaction in Spark? If we have Flink actions specifically for RewriteDataFiles and RewriteDeleteFiles that you can schedule on the same Flink cluster, would that solve the issue?

hameizi · 2021-09-14T02:21:08Z

I think the parallel commit proposal that @rdblue proposed could work

@jackye1995 In this PR the rewriteAction of flink is parallel, it will not make data deal slow down. Because when the function snapshot success flink will continue deal data but not wait the result of notifyCheckpointComplete.

I wonder what is the initial drive behind this implementation.

Auto compact file every checkpoint in flink will solve several question.

It will make query iceberg table fastly every time, because in our sence we find query table slowly although we have schedule compact file every day, but it is not enough.
It will slove the bug of there is duplicate rows in iceberg primary table when we compact file Handle the case that RewriteFiles and RowDelta commit the transaction at the same time #2308 . Because we strict commit one snashot and then compact file in order, so we will not cause there is one more snapshot is commit when we are compacting file.

stevenzwu · 2021-09-14T15:41:25Z

In this PR the rewriteAction of flink is parallel, it will not make data deal slow down.

@hameizi by parallel, we meant multiple executors/tasks executing the rewrite. Last time I checked, this PR runs the whole rewrite action in the single committer task synchronously. that is the main scalability concern we have.

Also notifyCheckpointComplete (and snapshotState) executes in the mailbox thread. if it takes a long time to finish the notifyCheckpointComplete/rewrite, it can delay the checkpoint execution.

I share the same philosophy as Jack on keep the streaming ingestion simple and stable. It is critical to reliably ingest data into long-term data storage (like Iceberg) first, as streaming input (like Kafka) typically has short retention.

Handle the case that RewriteFiles and RowDelta commit the transaction at the same time #2308

regarding this issue, I agree that the lock steps of commit + compaction can avoid the problem. But it is not a solution for the general problem, because other users probably have compaction jobs like Spark. There are other more sophisticated compaction/rewrite actions that probably can't be supported by single-task rewrite action at scale.

github-actions · 2024-07-18T00:13:30Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

github-actions · 2024-07-26T00:12:54Z

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

hameizi added 3 commits July 23, 2021 11:52

Flink: Auto compact files

4f15e92

check style

275d2ca

check style

03f310e

github-actions bot added core flink labels Jul 26, 2021

hameizi mentioned this pull request Jul 26, 2021

Auto compact small files #2869

Closed

fix

f6d0ffe

hameizi added 2 commits September 13, 2021 20:10

fix maybe cause file lose when there is multiple rewriteAction exec

3a83610

check style

e899c1b

jackye1995 mentioned this pull request Sep 16, 2021

Read delete files in parallel. #3118

Closed

Reo-LEI mentioned this pull request Sep 27, 2021

Flink: IcebergFilesCommitter validate dataFile exist start from last committed snapshot. #3103

Closed

Reo-LEI mentioned this pull request Oct 20, 2021

Flink: Incrementally rewrite data files in streaming. #3323

Closed

github-actions bot added the stale label Jul 18, 2024

github-actions bot closed this Jul 26, 2024

Flink: Auto compact file #2867

Flink: Auto compact file #2867

Uh oh!

Conversation

hameizi commented Jul 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kbendick commented Jul 26, 2021

Uh oh!

kbendick commented Jul 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hameizi commented Jul 26, 2021

Uh oh!

hameizi commented Jul 26, 2021

Uh oh!

hameizi commented Jul 28, 2021

Uh oh!

rdblue commented Aug 1, 2021

Uh oh!

hameizi commented Aug 2, 2021

Uh oh!

rdblue commented Aug 2, 2021

Uh oh!

hameizi commented Aug 3, 2021

Uh oh!

rdblue commented Aug 3, 2021

Uh oh!

hameizi commented Aug 3, 2021

Uh oh!

stevenzwu commented Aug 3, 2021

Uh oh!

hameizi commented Aug 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stevenzwu commented Aug 4, 2021

Uh oh!

hameizi commented Aug 4, 2021

Uh oh!

stevenzwu commented Aug 5, 2021

Uh oh!

stevenzwu commented Aug 5, 2021

Uh oh!

hameizi commented Aug 5, 2021

Uh oh!

hameizi commented Aug 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jackye1995 commented Sep 13, 2021

Uh oh!

hameizi commented Sep 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stevenzwu commented Sep 14, 2021

Uh oh!

github-actions bot commented Jul 18, 2024

Uh oh!

github-actions bot commented Jul 26, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

hameizi commented Jul 26, 2021 •

edited

Loading

kbendick commented Jul 26, 2021 •

edited

Loading

hameizi commented Aug 4, 2021 •

edited

Loading

hameizi commented Aug 13, 2021 •

edited

Loading

hameizi commented Sep 14, 2021 •

edited

Loading