Skip to content

Support Databricks 12.2 LTS and disallow unsupported reader versions and features in Delta Lake#16905

Merged
ebyhr merged 5 commits intomasterfrom
ebi/delta-databricks-12.2
Apr 16, 2023
Merged

Support Databricks 12.2 LTS and disallow unsupported reader versions and features in Delta Lake#16905
ebyhr merged 5 commits intomasterfrom
ebi/delta-databricks-12.2

Conversation

@ebyhr
Copy link
Copy Markdown
Member

@ebyhr ebyhr commented Apr 6, 2023

Description

Support Databricks 12.2 LTS and disallow unsupported reader versions and features in Delta Lake
Fixes #16884

Release notes

(x) Release notes are required, with the following suggested text:

# Delta Lake
* Add support for Databricks 12.2 LTS. ({issue}`16905`)
* Disallow reading tables with [deletion vectors](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#deletion-vectors). Previously, it returned incorrect results. ({issue}`16884`)

@cla-bot cla-bot bot added the cla-signed label Apr 6, 2023
@github-actions github-actions bot added delta-lake Delta Lake connector docs tests:hive labels Apr 6, 2023
@ebyhr ebyhr force-pushed the ebi/delta-databricks-12.2 branch 6 times, most recently from 9804a65 to b2fe20a Compare April 10, 2023 04:11
@ebyhr ebyhr marked this pull request as ready for review April 11, 2023 03:20
@ebyhr ebyhr force-pushed the ebi/delta-databricks-12.2 branch 2 times, most recently from cb639f4 to 96a34cd Compare April 11, 2023 03:33
@ebyhr ebyhr requested review from findinpath and krvikash April 11, 2023 03:34
@ebyhr ebyhr force-pushed the ebi/delta-databricks-12.2 branch from 96a34cd to 50c321e Compare April 11, 2023 05:32
@ebyhr ebyhr force-pushed the ebi/delta-databricks-12.2 branch 2 times, most recently from fac5148 to 30c8056 Compare April 12, 2023 09:38
@ebyhr ebyhr force-pushed the ebi/delta-databricks-12.2 branch 2 times, most recently from 47047bf to daf7173 Compare April 13, 2023 01:23
@ebyhr ebyhr force-pushed the ebi/delta-databricks-12.2 branch from daf7173 to 196027c Compare April 13, 2023 08:21
@ebyhr ebyhr force-pushed the ebi/delta-databricks-12.2 branch from 196027c to 813d33d Compare April 13, 2023 08:31
Copy link
Copy Markdown
Contributor

@findinpath findinpath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ebyhr ebyhr requested a review from findepi April 13, 2023 08:50
@ebyhr ebyhr force-pushed the ebi/delta-databricks-12.2 branch from 813d33d to dcde28f Compare April 13, 2023 23:11
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

List only the files under _change_data . In this way you can keep the initial assertion

Assertions.assertThat(s3.listObjectsV2(bucketName, changeDataPrefix).getObjectSummaries()).hasSize(0);
            List<S3ObjectSummary> summaries = s3.listObjectsV2(new ListObjectsV2Request()
                            .withBucketName(bucketName)
                            .withPrefix(changeDataPrefix)
                            .withDelimiter("/"))
                    .getObjectSummaries();

We could use a constant for "/" - PATH_SEPARATOR

Reference implementation:

ListObjectsV2Request request = new ListObjectsV2Request()
.withBucketName(getBucketName(uri))
.withPrefix(key)
.withDelimiter(recursive ? null : PATH_SEPARATOR)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

List only the files under _change_data . In this way you can keep the initial assertion

It would be better to avoid hiding the fact of _change_data directory existence in my opinion.

We could use a constant for "/" - PATH_SEPARATOR

Disagreed. There's no benefit to replace the single character (it's a path separator obviously) with such a constant.

Copy link
Copy Markdown
Contributor

@findinpath findinpath Apr 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do note that keeping _change_data may be related to

// Ignore directories (and symlinks, etc.). If a partition directory is old and empty (or made empty by us), removing it could break concurrent writes
// on file systems with directories (e.g. HDFS).
// TODO Note: Databricks can delete directories during vacuum on s3. This might need to be revisited.

#12018

Probably not relevant though - as this is relevant for HDFS

@ebyhr ebyhr force-pushed the ebi/delta-databricks-12.2 branch from dcde28f to 464f999 Compare April 14, 2023 06:08
@ebyhr ebyhr merged commit 602cd0f into master Apr 16, 2023
@ebyhr ebyhr deleted the ebi/delta-databricks-12.2 branch April 16, 2023 23:32
@ebyhr ebyhr mentioned this pull request Apr 16, 2023
@github-actions github-actions bot added this to the 414 milestone Apr 16, 2023
Comment on lines +443 to +451
if (protocolEntry.getMinReaderVersion() > MAX_READER_VERSION) {
LOG.debug("Skip %s because the reader version is unsupported: %d", dataTableName, protocolEntry.getMinReaderVersion());
return null;
}
Set<String> unsupportedReaderFeatures = unsupportedReaderFeatures(protocolEntry.getReaderFeatures().orElse(ImmutableSet.of()));
if (!unsupportedReaderFeatures.isEmpty()) {
LOG.debug("Skip %s because the table contains unsupported reader features: %s", dataTableName, unsupportedReaderFeatures);
return null;
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should fail instead or reporting "table not found".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Development

Successfully merging this pull request may close these issues.

Delta Lake connector returns incorrect results when table has deletion vectors

4 participants