Skip to content

Conversation

@hameizi
Copy link
Contributor

@hameizi hameizi commented Sep 29, 2021

This PR is handle the case that RewriteFiles and RowDelta commit the transaction at the same time what will cause duplicate record in table.
The approach is just like what discuss in #2308 (comment). We use the old seqnumber in manifest-list file same as the seqnumber of current snapshop when we start rewrite action. So when the rewrite action finished the replace snapshot will be set with one new seqnumber in metadata.json but one old seqnumber in manifest-list file. Then when we read the replace snapshot that these files what is commit during the rewrite action have bigger seqnumber than this replace snapshot, so if there is eq-delete file is commit they will apply into the replace snapshot.

@hameizi
Copy link
Contributor Author

hameizi commented Sep 29, 2021

@openinx Can you help take a look?

@hameizi
Copy link
Contributor Author

hameizi commented Sep 29, 2021

@yyanyy @stevenzwu @rdblue Can you help take a look?

@jackye1995
Copy link
Contributor

is this a duplicate of #3069 ?

@hameizi
Copy link
Contributor Author

hameizi commented Sep 30, 2021

is this a duplicate of #3069 ?

@jackye1995 it's not, this PR is focus on handle duplicate record when there is some delete file be commit during rewrite action. And i update description of this PR that describe how this PR handle this case.

}

@Test
public void testNewDeleteFile() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you removing tests? This is probably incorrect.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that this is related to removing the validation in BaseRewriteFiles. Like I noted there, we want to keep that validation and use it in certain cases.

if (replacedDataFiles.size() > 0) {
// if there are replaced data files, there cannot be any new row-level deletes for those data files
validateNoNewDeletesForDataFiles(base, startingSnapshotId, replacedDataFiles);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be removed. There is a valid use case for the operation to check whether there are conflicts instead of re-sequencing data files. If you want, you can add a configuration method to enable/disable the validation.

* @param sequenceNumber a sequenceNumber
* @return this for method chaining
*/
RewriteFiles setSequenceNumber(long sequenceNumber);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need better documentation for what this method does and when you would call it, since this affects correctness.

try (ManifestListWriter writer = ManifestLists.write(
ops.current().formatVersion(), manifestList, snapshotId(), parentSnapshotId, sequenceNumber)) {
ops.current().formatVersion(), manifestList, snapshotId(), parentSnapshotId,
operation().equals(DataOperations.REPLACE) && sequenceNumber() != null ? sequenceNumber() : sequenceNumber)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a clever way to fix the problem. What we had considered before is setting the sequence number of individual files rather than the sequence number used for the manifest list. This is an easier way to set the sequence number for all new data files in the snapshot.

I need to think about whether this is the right approach a bit more. As long as we have a static sequence number, using inheritance is only a convenience. We could set that static sequence number on the individual data or delete files that we add in the commit. I tend to lean toward that solution because it minimizes the places that use the overridden sequence number -- so we know that the manifest list and manifests all have the latest sequence number that is assigned to the snapshot rather than also being re-sequenced.

I'll think about the trade-off some more.

*
* @return a string operation
*/
protected Long sequenceNumber() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs a more specific name, like sequenceNumberOverride


@Override
protected Long sequenceNumber() {
return replaceSequenceNumber;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this approach, I think we need a validation that none of the data or delete files that are being replaced have sequence numbers newer than the override sequence number.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The override sequence number is init when compact action generate fileScanTasks in https://github.com/apache/iceberg/pull/3204/files#:~:text=long%20sequenceNumber%20%3D%20table.currentSnapshot().sequenceNumber()%3B, so it can guarantee that none of the data or delete files that are being replaced have sequence numbers newer than the override sequence number.

@hameizi
Copy link
Contributor Author

hameizi commented Oct 11, 2021

@rdblue I commit one new commit for your suggestion, can you help review one more time? The fix detail as follows:
1.add one configuration to config whether vaildate there is new delete files be added when compacting files.
2.change name sequenceNumber to sequenceNumberOverride
3.update documentation for method setSequenceNumber

With this approach, I think we need a validation that none of the data or delete files that are being replaced have sequence numbers newer than the override sequence number.

The override sequence number is init when compact action generate fileScanTasks in https://github.com/apache/iceberg/pull/3204/files#:~:text=long%20sequenceNumber%20%3D%20table.currentSnapshot().sequenceNumber()%3B, so it can guarantee that none of the data or delete files that are being replaced have sequence numbers newer than the override sequence number.

@hameizi
Copy link
Contributor Author

hameizi commented Oct 18, 2021

@rdblue @openinx Could you help take a look again?

@rdblue
Copy link
Contributor

rdblue commented Oct 19, 2021

Running CI.

@github-actions github-actions bot added the spark label Oct 20, 2021
@hameizi
Copy link
Contributor Author

hameizi commented Oct 25, 2021

@rdblue Hello, is there any progress ?

@rdblue
Copy link
Contributor

rdblue commented Oct 25, 2021

I've been thinking about this case and I think that the right way to do this is to set the sequence number on individual files rather than at the snapshot level. I don't think that we should change the sequence number of the snapshot or manifest list. We should just set the sequence number of individual data files. Basically, I agree with @yyanyy's comment:

I think there are two seqNum concepts here: seqNum for the table/commit and seqNum for the file. I think it's a reasonable approach to mark the rewritten files with the old seqNum, but I'm not sure if we necessarily need to use an old sequence number for the commit since they are stored as part of the snapshot, and suddenly have an old seqNum within the snapshot could be confusing even if there's no other implication.

I think we need to set the file sequence numbers. That raises the question: what do we set them to? Ideally, we would use the latest, but the delete file commit has claimed that number so we need to go with the sequence number that is less than any other commits. That would be the sequence number that was current when rewrite operation started. It would be nice to find a way around reusing a sequence number from a different snapshot, but I don't see a good way to do that right now. We can possibly fix that up later by skipping sequence numbers.

@hameizi
Copy link
Contributor Author

hameizi commented Oct 26, 2021

@rdblue I think import file sequence numbers will cause unnecessary complexity, and the sequence number of manifest list can mean file commit sequence, the data file sequence number just describe different sequence of snapshot and data file, but manifest list sequence numbers can do this too, because there is snapshotid in manifest list file ,so wo don't need hold same sequence numbers for one snapshot in the snapshot and manifest list. So even if we import file sequence numbers it just cause the same effect as the sequence number of manifest list.

@rdblue
Copy link
Contributor

rdblue commented Nov 7, 2021

@hameizi, can you help review and test #3480? That's an alternative approach to what you're doing here that sets the sequence number per data file. I think that change is actually really important. While I was reviewing this, I thought that it was probably not a good idea to set the sequence number by reusing inheritance. Now that we've thought through the use case more, we've come up with a good reason not to: it makes it so we can't recover the sequence number where files were added, not just the sequence number where the data lives in time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants