Flink: fix flink unit test testHashDistributeMode #4117

zhongyujiang · 2022-02-14T14:01:11Z

This pr fix unit test TestFlinkSink#testHashDistributeMode which fails occassionally in Flink CI, have been discussed a lot in #2989 and #3365.
I think the root cause is the way notifyCheckpointComplete works in IcebergFilesCommitter:

iceberg/flink/v1.14/flink/src/main/java/org/apache/iceberg/flink/sink/IcebergFilesCommitter.java

Lines 182 to 195 in d43cb4c

    
           public void notifyCheckpointComplete(long checkpointId) throws Exception { 
        
             super.notifyCheckpointComplete(checkpointId); 
        
             // It's possible that we have the following events: 
        
             //   1. snapshotState(ckpId); 
        
             //   2. snapshotState(ckpId+1); 
        
             //   3. notifyCheckpointComplete(ckpId+1); 
        
             //   4. notifyCheckpointComplete(ckpId); 
        
             // For step#4, we don't need to commit iceberg table again because in step#3 we've committed all the files, 
        
             // Besides, we need to maintain the max-committed-checkpoint-id to be increasing. 
        
             if (checkpointId > maxCommittedCheckpointId) { 
        
               commitUpToCheckpoint(dataFilesPerCheckpoint, flinkJobId, checkpointId); 
        
               this.maxCommittedCheckpointId = checkpointId; 
        
             } 
        
           }

iceberg/flink/v1.14/flink/src/main/java/org/apache/iceberg/flink/sink/IcebergFilesCommitter.java

Lines 261 to 277 in ca46fc9

    
           private void commitDeltaTxn(NavigableMap<Long, WriteResult> pendingResults, String newFlinkJobId, long checkpointId) { 
        
             int deleteFilesNum = pendingResults.values().stream().mapToInt(r -> r.deleteFiles().length).sum(); 
        
             if (deleteFilesNum == 0) { 
        
               // To be compatible with iceberg format V1. 
        
               AppendFiles appendFiles = table.newAppend(); 
        
               int numFiles = 0; 
        
               for (WriteResult result : pendingResults.values()) { 
        
                 Preconditions.checkState(result.referencedDataFiles().length == 0, "Should have no referenced data files."); 
        
                 numFiles += result.dataFiles().length; 
        
                 Arrays.stream(result.dataFiles()).forEach(appendFiles::appendFile); 
        
               } 
        
               commitOperation(appendFiles, numFiles, 0, "append", newFlinkJobId, checkpointId); 
        
             } else {

As show above, results of multiple ckpts may be merged into one commit in streaming mode, and the checkpoint interval (400 ms) here is rather small, which makes this situation very likely.

Increasing checkpoint interval would reduce such failure, but it cannot be completely eliminated in theory. So I simply made this unit test only apply for batch mode, which is enough to validate Hash distribute mode in my opinion.

@openinx @szehon-ho could help review this? thanks!

szehon-ho · 2022-02-14T18:38:15Z

@stevenzwu what do you think?

rdblue · 2022-02-16T22:49:18Z

@zhongyujiang, I think I would prefer a fix that avoids the root cause but still runs the test in streaming mode. I understand your concern about not being able to necessarily guarantee we won't have a flaky test, but we can probably set that high enough (1s?) that we don't see it in practice.

zhongyujiang · 2022-02-17T05:52:46Z

@rdblue I have updated, not sure 1s is high enough but let's give it a try first.

zhongyujiang · 2022-02-17T06:04:23Z

flink/v1.14/flink/src/test/java/org/apache/iceberg/flink/TestFlinkTableSink.java

-        Assert.assertEquals("There should be 1 data file in partition 'ccc'", 1,
-            SimpleDataUtil.matchingPartitions(dataFiles, table.spec(), ImmutableMap.of("data", "ccc")).size());
+        Assert.assertTrue("There should be no more than 1 data file in partition 'aaa'",
+            SimpleDataUtil.matchingPartitions(dataFiles, table.spec(), ImmutableMap.of("data", "aaa")).size() < 2);


I changed the assert condition, because if there are multiple checkpoints, data may arrive in this way:
ck1: (1, "aaa")
ck2: (1, "bbb")
...
so I think we should assert each snapshot has no more than 1 file per partition, since it could be 0 file as well.

It is unclear to me how this change of assertion is related the potential cause you described where 2 checkpoint cycles can be committed in one shot. Then we can get 2 files for one partition. why would we get 0 file for a partition?

I didn't figure out a way to validate hash distribution when there have merged results of multiple ckpts, originally I simply disabled this test running in streaming mode.
This is an update for blue's comment,

I think I would prefer a fix that avoids the root cause but still runs the test in streaming mode. I understand your concern about not being able to necessarily guarantee we won't have a flaky test, but we can probably set that high enough (1s?) that we don't see it in practice.

I improved ck interval to 1000ms to reduce merge results possibility. And in streaming mode, I think the original assert is not right given the checkpoint scenario mentioned in my last comment.

The error you encountered is value is 2 (not 1). Hence I said this change from == 1 to < 2 won't even work around the error. anyway, it seems that other discussions in the PR already led us to the right root cause and solution.

stevenzwu · 2022-02-17T06:38:41Z

@zhongyujiang Flink default only has 1 concurrent checkpoint. could the scenario you described happen in this case?

openinx · 2022-02-17T06:51:48Z

@zhongyujiang , what's the current failure stacktrace you encountered ? I'd like to take a careful look to this problem, and hope we can fix this in this round work.

zhongyujiang · 2022-02-17T07:59:04Z

@zhongyujiang Flink default only has 1 concurrent checkpoint. could the scenario you described happen in this case?

I think it's not relevant to Flink checkpoint but to notifyCheckpointComplete working mechanism, quoted from method docs of notifyCheckpointComplete:

Notifies the listener that the checkpoint with the given checkpointId completed and was committed.
These notifications are "best effort", meaning they can sometimes be skipped. To behave properly, implementers need to follow the "Checkpoint Subsuming Contract". Please see the class-level JavaDocs for details.
Please note that checkpoints may generally overlap, so you cannot assume that the notifyCheckpointComplete() call is always for the latest prior checkpoint (or snapshot) that was taken on the function/operator implementing this interface. It might be for a checkpoint that was triggered earlier. Implementing the "Checkpoint Subsuming Contract" (see above) properly handles this situation correctly as well.
Please note that throwing exceptions from this method will not cause the completed checkpoint to be revoked. Throwing exceptions will typically cause task/job failure and trigger recovery.

IcebergFilesCommitter will commit to Iceberg once get notified, and such notifications does not have the same ordering guarantee as Flink checkpoint.

zhongyujiang · 2022-02-17T08:02:04Z

@zhongyujiang , what's the current failure stacktrace you encountered ? I'd like to take a careful look to this problem, and hope we can fix this in this round work.

Like this:

java.lang.AssertionError: There should be 1 data file in partition 'aaa' expected:<1> but was:<2>

Haven't encountered locally yet.

zhongyujiang · 2022-02-17T08:27:30Z

@openinx Found one in CI:

org.apache.iceberg.flink.TestFlinkTableSink > testHashDistributeMode[catalogName=testhadoop, baseNamespace=, format=ORC, isStreaming=true] FAILED
java.lang.AssertionError: There should be 1 data file in partition 'aaa' expected:<1> but was:<2>
at org.junit.Assert.fail(Assert.java:89)
at org.junit.Assert.failNotEquals(Assert.java:835)
at org.junit.Assert.assertEquals(Assert.java:647)
at org.apache.iceberg.flink.TestFlinkTableSink.testHashDistributeMode(TestFlinkTableSink.java:283)

openinx · 2022-02-17T08:42:13Z

Reconsidered this test case, I think @zhongyujiang is getting the root cause in the correct direction. Let's explain the cause here:

In the unit test case, we are trying to write the following records into apache iceberg table by shuffling by partition field data (The parallelism is 2):

(1, 'aaa'), (1, 'bbb'), (1, 'ccc')
(2, 'aaa'), (2, 'bbb'), (2, 'ccc')
(3, 'aaa'), (3, 'bbb'), (3, 'ccc')

As we may produces multiple checkpoints when the streaming job is running, Then it's possible that we write the records in the following checkpoints:

checkpoint#1
- (1, 'aaa')
- (1, 'bbb')
- (1, 'ccc')
checkpoint#2
- (2, 'aaa'),
- (2, 'bbb'),
- (2, 'ccc')
- (3, 'aaa'),
- (3, 'bbb'),
- (3, 'ccc')

Then it will produces a seperate data file for each partition in the given checkpoint. Let's say:

checkpoint#1
- produces data-file-1 for partition aaa
- produces data-file-2 for partition bbb
- produces data-file-3 for partition ccc
checkpoint#2
- produces data-file-4 for partition aaa
- produces data-file-5 for partition bbb
- produces data-file-6 for partition ccc

Assume the snapshotState & notifyCheckpointComplete are arrived as the following:

snapshotState(ckpt1);
snapshotState(ckpt2);
notifyCheckpointComplete(ckpt2); ( It's possible just as the flink javadoc said)
notifyCheckpointComplete(ckpt1);

Then in the step#3, it will commit one transaction with the alll the data files which comes from checkpoint#1 & checkpoint#2 (According to the this IcebergFilesCommitter implementation) , finally this latest snapshot will include all the data files from data-file-1 to data-file-6. That is why we encounter the failure assertion.

java.lang.AssertionError: There should be 1 data file in partition 'aaa' expected:<1> but was:<2>
at org.junit.Assert.fail(Assert.java:89)
at org.junit.Assert.failNotEquals(Assert.java:835)
at org.junit.Assert.assertEquals(Assert.java:647)
at org.apache.iceberg.flink.TestFlinkTableSink.testHashDistributeMode(TestFlinkTableSink.java:283)

openinx · 2022-02-17T09:03:07Z

But I generally don't think the current fix is in the correct direction, there are my points:

Indeed, increasing the checkpoint interval from 400ms to 1000ms reduce the probability to encounter this assertion failure, But it does not resolve the underlying real problem. So I don't think it's right to increase the checkpoint interval.
Assert that the snapshot's data file size is less than 2 does not change any thing in my view.

I think the real intention that we designed this unit test is: we want to ensure that there is only one generated data file in each given partition if we commit those rows in only one single deterministic iceberg transaction, once we enable the switch write.distribution-mode=hash in both flink streaming & batch jobs.

The current root cause is: we cannot make it trigger only one checkpoint for the given 9 rows in the flink streaming sql job. So I think the correct direction is: make only one checkpoint to write those 9 rows and finally we still assert there is only one data file in each given partition. To accomplish this goal, I think we can use the BoundedTestSource to reimplement this unit test. About the BoundedTestSource, here is a good example for how to producing multiple rows into a single checkpoint.

zhongyujiang · 2022-02-17T11:42:18Z

Assert that the snapshot's data file size is less than 2 does not change any thing in my view.

Changing the assertion condition is not intended to solve the validation problem when there are merged results actually, like you said, there could be more than one checkpoint in streaming mode, but there is no guarantee that each checkpoint contains exactly each partition's data. The situation could be like this:

checkpoint#1
- (1, 'aaa')
checkpoint#2
- (1, 'bbb')
  ...

When results of ck1 and ck2 are not merged, then snapshot of ck1 would have only 1 data file for partition aaa but 0 file for other partitions and snapshot of ck2 is also similar, that's why I changed the assertition condition.

To accomplish this goal, I think we can use the BoundedTestSource to reimplement this unit test. About the BoundedTestSource, here is a good example for how to producing multiple rows into a single checkpoint.

I also wanted to solve the problem by controlling the checkpoint in the beginning but didn't figure a convenient way to do so. Using BoundedTestSource seems like a feasible way, I'll try with it. @openinx Thanks for your advice.

stevenzwu · 2022-02-17T19:50:27Z

@openinx read the javadoc that you linked. seems that notifyCheckpointComplete can be skipped due to best effort. it also said that the notification can't be assumed for the latest snapshot. But it didn't say if they can come out of order. so the scenario could be

snapshotState(ckpt1);
// notifyCheckpointComplete(ckpt1); (missed)
snapshotState(ckpt2);
notifyCheckpointComplete(ckpt2);

Agree that precise control on the source could be the right solution here.

yittg · 2022-02-21T08:45:19Z

@openinx @rdblue i add some log here about this PR

github-actions bot added the flink label Feb 14, 2022

Fix flaky test testHashDistributeMode

0ac9cc7

zhongyujiang force-pushed the fix-flink-unit-test branch from 508f0f2 to 0ac9cc7 Compare February 17, 2022 05:49

zhongyujiang commented Feb 17, 2022

View reviewed changes

yittg mentioned this pull request Feb 21, 2022

Flakey flink unit tests TestFlinkTableSink#testHashDistributeMode #2575

Closed

openinx mentioned this pull request Feb 22, 2022

Flink 1.14: Fix the flaky testHashDistributeMode by ingesting all rows in one checkpoint cycle. #4189

Merged

zhongyujiang closed this Feb 24, 2022

zhongyujiang deleted the fix-flink-unit-test branch April 26, 2022 02:02

	public void notifyCheckpointComplete(long checkpointId) throws Exception {
	super.notifyCheckpointComplete(checkpointId);
	// It's possible that we have the following events:
	// 1. snapshotState(ckpId);
	// 2. snapshotState(ckpId+1);
	// 3. notifyCheckpointComplete(ckpId+1);
	// 4. notifyCheckpointComplete(ckpId);
	// For step#4, we don't need to commit iceberg table again because in step#3 we've committed all the files,
	// Besides, we need to maintain the max-committed-checkpoint-id to be increasing.
	if (checkpointId > maxCommittedCheckpointId) {
	commitUpToCheckpoint(dataFilesPerCheckpoint, flinkJobId, checkpointId);
	this.maxCommittedCheckpointId = checkpointId;
	}
	}

	private void commitDeltaTxn(NavigableMap<Long, WriteResult> pendingResults, String newFlinkJobId, long checkpointId) {
	int deleteFilesNum = pendingResults.values().stream().mapToInt(r -> r.deleteFiles().length).sum();

	if (deleteFilesNum == 0) {
	// To be compatible with iceberg format V1.
	AppendFiles appendFiles = table.newAppend();

	int numFiles = 0;
	for (WriteResult result : pendingResults.values()) {
	Preconditions.checkState(result.referencedDataFiles().length == 0, "Should have no referenced data files.");

	numFiles += result.dataFiles().length;
	Arrays.stream(result.dataFiles()).forEach(appendFiles::appendFile);
	}

	commitOperation(appendFiles, numFiles, 0, "append", newFlinkJobId, checkpointId);
	} else {

Flink: fix flink unit test testHashDistributeMode #4117

Flink: fix flink unit test testHashDistributeMode #4117

Uh oh!

Conversation

zhongyujiang commented Feb 14, 2022

Uh oh!

szehon-ho commented Feb 14, 2022

Uh oh!

rdblue commented Feb 16, 2022

Uh oh!

zhongyujiang commented Feb 17, 2022

Uh oh!

zhongyujiang Feb 17, 2022

Choose a reason for hiding this comment

Uh oh!

stevenzwu Feb 17, 2022

Choose a reason for hiding this comment

Uh oh!

zhongyujiang Feb 17, 2022

Choose a reason for hiding this comment

Uh oh!

stevenzwu Feb 17, 2022

Choose a reason for hiding this comment

Uh oh!

stevenzwu commented Feb 17, 2022

Uh oh!

openinx commented Feb 17, 2022

Uh oh!

zhongyujiang commented Feb 17, 2022

Uh oh!

zhongyujiang commented Feb 17, 2022

Uh oh!

zhongyujiang commented Feb 17, 2022

Uh oh!

openinx commented Feb 17, 2022

Uh oh!

openinx commented Feb 17, 2022

Uh oh!

zhongyujiang commented Feb 17, 2022

Uh oh!

stevenzwu commented Feb 17, 2022

Uh oh!

yittg commented Feb 21, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants