Skip to content

Conversation

@jshmchenxi
Copy link
Contributor

@jshmchenxi jshmchenxi commented May 18, 2021

For #2482
If parent snapshot is expired and starting snapshot is null, we should add current snapshot's all manifests and break. Thus we will not run into the exception "Cannot determine history ..."
Added some tests to cover Flink CDC write with expired snapshot.

@jshmchenxi jshmchenxi marked this pull request as draft May 18, 2021 07:37
@github-actions github-actions bot added the flink label May 19, 2021
@jshmchenxi jshmchenxi changed the title Core: Set snapshot parentId to null when parent is expired Core: Deal with expired parent snapshot in MergingSnapshotProducer#validateDataFilesExist May 19, 2021
@jshmchenxi jshmchenxi marked this pull request as ready for review May 20, 2021 02:41
@jshmchenxi
Copy link
Contributor Author

@openinx Hi, this can fix #2482. Would you have a look please?

// add all manifests of current snapshot and break
Long parentId = currentSnapshot.parentId();
boolean shouldAddManifestsAndBreak =
startingSnapshotId == null && parentId != null && ops.current().snapshot(parentId) == null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the logic using shouldAddManifestsAndBreak is incorrect. The situation this is trying to account for is when the entire table history should be considered (startingSnapshotId is null), but only partial history is available (parentId is not null but the snapshot is missing). I think the idea here is that this can account for lost history by looking across all manifests for a delete.

I don't think that this is implemented correctly for a few reasons:

  • If the oldest available snapshot is an append snapshot, then the outer check below for operation will prevent its manifests from being added as expected.
  • The manifest group that scans for deletes filters entries to just the snapshot ids that were new since the starting id (newSnapshots). When history is missing, that filter would need to be turned off or else the entries will be ignored even if you scan older manifests.
  • When reading the manifests to check for deletes, missing history could have removed those deletes in a manifest rewrite. If I delete file A, it will show up in a delete entry in a snapshot. But once I rewrite the manifest that encoded that delete, the delete is no longer kept. So expiring history can still cause a problem.

Because of that last flaw, I don't think that there is a way to do this safely by searching for deletes. The only way to do this validation without history is to scan the entire table for the required data files.

}

@Test
public void testChangeLogOnIdKeyWithSnapshotExpire() throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is updating a file in core, I would expect the tests to be more specific and located in core, not in the Flink integration.

@rdblue
Copy link
Contributor

rdblue commented Jun 15, 2021

@jshmchenxi, see my comments above for more detail, but this approach can't work for the situation you describe. If you want to support this situation, then you would need to scan all table metadata to check whether the files that need to exist still do using added and existing entries, not delete entries. Since delete entries can be lost when history is cleaned up, you have to check for the actual files.

I'm going to close this PR since the approach won't work. Feel free to open another one with an alternative implementation that checks for added or existing entries for the required files.

Thanks for working on this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants