Skip to content

Add support for writing deletion vectors in Delta Lake#22102

Merged
ebyhr merged 3 commits intomasterfrom
ebi/delta-write-deletion-vector
Aug 14, 2024
Merged

Add support for writing deletion vectors in Delta Lake#22102
ebyhr merged 3 commits intomasterfrom
ebi/delta-write-deletion-vector

Conversation

@ebyhr
Copy link
Copy Markdown
Member

@ebyhr ebyhr commented May 24, 2024

Description

Fixes #17063

Release notes

(x) Release notes are required, with the following suggested text:

# Delta Lake
* Add support for writing [deletion vectors](https://docs.delta.io/latest/delta-deletion-vectors.html). ({issue}`17063`)

@cla-bot cla-bot bot added the cla-signed label May 24, 2024
@github-actions github-actions bot added the delta-lake Delta Lake connector label May 24, 2024
@ebyhr ebyhr force-pushed the ebi/delta-write-deletion-vector branch from 72af1af to 8ea6475 Compare May 24, 2024 06:57
@ebyhr ebyhr force-pushed the ebi/delta-write-deletion-vector branch 3 times, most recently from a55bc53 to 7d52f10 Compare May 27, 2024 06:43
@ebyhr ebyhr force-pushed the ebi/delta-write-deletion-vector branch from 7d52f10 to e3ce7cb Compare June 10, 2024 09:07
@ebyhr ebyhr force-pushed the ebi/delta-write-deletion-vector branch from e3ce7cb to 8f8cddc Compare June 26, 2024 00:54
@ebyhr ebyhr force-pushed the ebi/delta-write-deletion-vector branch 4 times, most recently from 4152170 to 8ab2964 Compare July 17, 2024 08:20
@ebyhr ebyhr force-pushed the ebi/delta-write-deletion-vector branch 4 times, most recently from 5242022 to 42769ca Compare July 25, 2024 04:40
@ebyhr ebyhr marked this pull request as ready for review July 25, 2024 04:52
@ebyhr ebyhr force-pushed the ebi/delta-write-deletion-vector branch 2 times, most recently from 37eaaa8 to c9bc2da Compare July 30, 2024 08:49
@ebyhr ebyhr force-pushed the ebi/delta-write-deletion-vector branch from c9bc2da to 70ead35 Compare July 30, 2024 10:46
@ebyhr ebyhr force-pushed the ebi/delta-write-deletion-vector branch from 70ead35 to b2de0c2 Compare July 30, 2024 15:23
deletedRows.or(deletion.rowsDeletedByDelete());
deletedRows.or(deletion.rowsDeletedByUpdate());

TrinoInputFile inputFile = fileSystem.newInputFile(Location.of(path.toStringUtf8()));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know size of deletion vector in advance from io.trino.plugin.deltalake.transactionlog.DeletionVectorEntry#sizeInBytes or any other metadata ?
If we do, then using it in newInputFile would probably save a FS call.

Copy link
Copy Markdown
Member Author

@ebyhr ebyhr Aug 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This inputFile is a Parquet file, not deletion vector. I think that's possible, but it requires some refactoring. Let me handle in a follow-up.

@ebyhr ebyhr force-pushed the ebi/delta-write-deletion-vector branch 2 times, most recently from 6883a04 to a1587b5 Compare August 13, 2024 12:13
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we write a new deletion vector, is it mandatory as per the spec that it should be a union of all deleted/updated rows so far, or is that just how we've implemented it ?
Is cleanup of old deletion vectors something already handled by some optimize/vaccum procedure in Trino ?

Copy link
Copy Markdown
Member Author

@ebyhr ebyhr Aug 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's mandatory for compatiblity with Delta Lake. Otherwise, they return the wrong results if I remember correctly.

The cleanup should be handled in #22809

@ebyhr ebyhr force-pushed the ebi/delta-write-deletion-vector branch from a1587b5 to 37622e1 Compare August 14, 2024 08:27
@ebyhr ebyhr force-pushed the ebi/delta-write-deletion-vector branch from 37622e1 to e9ebfcb Compare August 14, 2024 09:08
@ebyhr
Copy link
Copy Markdown
Member Author

ebyhr commented Aug 14, 2024

Addressed comments.

@ebyhr ebyhr merged commit c8491f7 into master Aug 14, 2024
@ebyhr ebyhr deleted the ebi/delta-write-deletion-vector branch August 14, 2024 10:11
@github-actions github-actions bot added this to the 454 milestone Aug 14, 2024
Comment on lines +359 to +361
RoaringBitmapArray deletedRows = loadDeletionVector(Location.of(path.toStringUtf8()));
deletedRows.or(deletion.rowsDeletedByDelete());
deletedRows.or(deletion.rowsDeletedByUpdate());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a discussion with @findepi about whether If the amount of rows deleted from file depasses a certain quota, we should consider to proactively rewrite the file.
Is there a potential follow-up here?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a potential follow-up task. I don't expect we will handle it shortly though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla-signed delta-lake Delta Lake connector

Development

Successfully merging this pull request may close these issues.

Use Delta Deletion Vectors for row-level deletes

4 participants