Skip to content

Fix incorrect results when writing deletion vectors in Delta Lake#23231

Merged
ebyhr merged 4 commits intomasterfrom
ebi/delta-deletion-vectors-correctness
Sep 18, 2024
Merged

Fix incorrect results when writing deletion vectors in Delta Lake#23231
ebyhr merged 4 commits intomasterfrom
ebi/delta-deletion-vectors-correctness

Conversation

@ebyhr
Copy link
Copy Markdown
Member

@ebyhr ebyhr commented Sep 2, 2024

Description

Fixes #23229

Release notes

# Delta Lake
* Fix incorrect results when writing deletion vectors. ({issue}`23229`)

@cla-bot cla-bot bot added the cla-signed label Sep 2, 2024
@github-actions github-actions bot added the delta-lake Delta Lake connector label Sep 2, 2024
@ebyhr ebyhr force-pushed the ebi/delta-deletion-vectors-correctness branch from 5eb9e07 to e1cd4ba Compare September 2, 2024 03:56
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a test that fails for this case deterministically ?
I don't get why this failure was flaky

@ebyhr
Copy link
Copy Markdown
Member Author

ebyhr commented Sep 2, 2024

The test is still flaky. Probably, Databricks/Delta Lake suspects deletion vectors is used even for rewriting all rows.

@ebyhr ebyhr force-pushed the ebi/delta-deletion-vectors-correctness branch from e1cd4ba to ff6fcb0 Compare September 2, 2024 05:15
@ebyhr ebyhr marked this pull request as draft September 2, 2024 06:19
@ebyhr ebyhr force-pushed the ebi/delta-deletion-vectors-correctness branch from 570ea53 to 513a53a Compare September 17, 2024 23:12
@ebyhr ebyhr force-pushed the ebi/delta-deletion-vectors-correctness branch from 513a53a to 6b47720 Compare September 18, 2024 00:29
@ebyhr ebyhr marked this pull request as ready for review September 18, 2024 08:12
@ebyhr
Copy link
Copy Markdown
Member Author

ebyhr commented Sep 18, 2024

Pushed a new commit to fix correctness issue when rewriting all rows.

long rowCount = parquetMetadata.getBlocks().stream().map(BlockMetadata::rowCount).mapToLong(Long::longValue).sum();
RoaringBitmapArray rowsRetained = new RoaringBitmapArray();
rowsRetained.addRange(0, rowCount);
rowsRetained.addRange(0, rowCount - 1);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a test which detects this issue ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following line throws an exception without this change:

        assertThat(getEntriesFromJson(3, tableLocation + "/_delta_log", FILE_SYSTEM).orElseThrow().get(1).getRemove().deletionVector().orElseThrow())
                .isEqualTo(deletionVector);

(Not ideal test, but I assume it's enough)

@ebyhr ebyhr merged commit 4c0e301 into master Sep 18, 2024
@ebyhr ebyhr deleted the ebi/delta-deletion-vectors-correctness branch September 18, 2024 11:52
long rowCount = parquetMetadata.getBlocks().stream().map(BlockMetadata::rowCount).mapToLong(Long::longValue).sum();
RoaringBitmapArray rowsRetained = new RoaringBitmapArray();
rowsRetained.addRange(0, rowCount);
rowsRetained.addRange(0, rowCount - 1);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was happening as a consequence of having a too long range before?

@raunaqmorarka
Copy link
Copy Markdown
Member

@ebyhr are tables which received writes using this feature in previous releases broken ?
We should add a note in previous releases notes about this and steps to fix if needed.
cc: @mosabua

{
String sourceRelativePath = relativePath(rootTableLocation.toString(), sourcePath);
DeltaLakeMergeResult result = new DeltaLakeMergeResult(deletion.partitionValues(), Optional.of(sourceRelativePath), Optional.empty());
DeletionVectorEntry deletionVector = deletionVectors.get(sourceRelativePath);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

follow-up: do we need any special handling for deletion vectors from shallowly cloned tables?

long deletionTimestamp,
boolean dataChange)
boolean dataChange,
Optional<DeletionVectorEntry> deletionVector)
Copy link
Copy Markdown
Contributor

@findinpath findinpath Sep 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does delta lake need deletionVector in the remove file entries?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noticed

        Optional<DeletionVectorEntry> deletionVector = Optional.empty();
        if (deletionVectorsEnabled) {
            deletionVector = Optional.ofNullable(remove.getRow("deletionVector"))
                    .map(row -> parseDeletionVectorFromParquet(session, row, removeDeletionVectorType.orElseThrow()));
        }

@github-actions github-actions bot added this to the 459 milestone Sep 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

Flaky TestDeltaLakeDeleteCompatibility.testDeletionVectors

3 participants