Skip to content

Test Delta Lake connector against Databricks runtime#12896

Merged
ebyhr merged 10 commits intotrinodb:masterfrom
findinpath:delta-product-tests-databricks
Jul 4, 2022
Merged

Test Delta Lake connector against Databricks runtime#12896
ebyhr merged 10 commits intotrinodb:masterfrom
findinpath:delta-product-tests-databricks

Conversation

@findinpath
Copy link
Copy Markdown
Contributor

@findinpath findinpath commented Jun 18, 2022

Delta Lake Databricks Product Tests infrastructure

We currently have in Trino Delta Lake product tests for Delta OSS.
This PR aims to add support for Delta Lake product tests for Databricks environment as well.

The product test environment used to test Delta Lake connector functionality on top of Databricks relies on:

  • AWS S3: a S3 bucket which will hold the content of the delta tables
  • AWS Glue: metastore for the delta tables which can be accessed also by Trino (opposed to the internal Databricks default metastore)
  • Databricks

SuiteDeltaLake has been renamed to SuiteDeltaLakeOss and contains virtually no changes compared to master branch.

SuiteDeltaLakeDatabricks contains the Delta Lake tests to be executed on top of Databricks infrastructure.
This suite needs special credentials in order to perform operations on top of Databricks:

  • credentials for AWS
  • credentials for Databricks

A new job delta-lake-databricks-pt has been added to ci.yml.
This job is to be executed only when the credentials required by the SuiteDeltaLakeDatabricks are provided.

@cla-bot cla-bot bot added the cla-signed label Jun 18, 2022
@findinpath findinpath marked this pull request as draft June 18, 2022 05:58
@findinpath findinpath force-pushed the delta-product-tests-databricks branch from 1305140 to 5b35f90 Compare June 20, 2022 08:31
@findinpath findinpath marked this pull request as ready for review June 20, 2022 09:23
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This variable name can be a bit misleading.

I'd advocate for changing the name of the variable to : DATABRICKS104_TEST_JDBC_URL and also changing the class name EnvSinglenodeDeltaLakeDatabricks to EnvSinglenodeDeltaLakeDatabricks104

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename constant to match new meaning. Separate commit.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before the change, what did this code execute against?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code was never executed before in the context of Trino.
This code has been open sourced along with the other product tests, but not used until now.

@findinpath findinpath force-pushed the delta-product-tests-databricks branch 3 times, most recently from 59a2adc to 7c9b827 Compare June 21, 2022 09:06
Comment on lines 750 to 751
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you take the TRINO_ prefix off aws clients will pick them up automatically

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intention was to avoid colliding with AWS credentials used by default for the CI runner for other product tests.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have AWS credentials used by default for the CI?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still may want to configure env for the containers only, and not the PTL, but it seems like a redundant complication to me.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was under the impression that there are s3 / glue related product tests, but now after looking after any product test environments that have these needs, I didn't find any.
I can change the name of the variables, but they still need to be passed to the corresponding containers.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was under the impression that there are s3 / glue related product tests,

They should exist, but they don't yet #5426

Once we have them, we will be invoked explicitly in ci.yml (my assumption)
and then they can set same env vars (with potentially different values)

we would not have ability to define one suite that exercises Hive+Glue and Delta+Databricks, but i don't know whether it will be a loss actually.

I can change the name of the variables, but they still need to be passed to the corresponding containers.

of course

@findinpath findinpath force-pushed the delta-product-tests-databricks branch from 7c9b827 to 141baa3 Compare June 22, 2022 04:47
@findinpath findinpath force-pushed the delta-product-tests-databricks branch from 141baa3 to 631bd69 Compare June 23, 2022 07:11
@findepi
Copy link
Copy Markdown
Member

findepi commented Jun 24, 2022

Test PR with secrets: #12973
(cc @ilfrin @nineinchnick for potentially streamlining this)

@nineinchnick
Copy link
Copy Markdown
Member

Test PR with secrets: #12973 (cc @ilfrin @nineinchnick for potentially streamlining this)

#12817 is waiting for your review

@findinpath findinpath force-pushed the delta-product-tests-databricks branch 2 times, most recently from fd783e3 to 3291132 Compare June 27, 2022 04:47
@findinpath
Copy link
Copy Markdown
Contributor Author

Rebased on master to include the latest Delta Lake product tests in the build.

@findinpath findinpath force-pushed the delta-product-tests-databricks branch from de441ae to 6a07350 Compare June 28, 2022 15:22
@findinpath findinpath force-pushed the delta-product-tests-databricks branch 2 times, most recently from 63d5852 to 6f3425b Compare June 29, 2022 16:14
As part of the open sourcing process of the product tests for the `trino-delta-lake`
connector, the product tests dedicated for the Databricks runtime were dependent
on Hive Metastore Service.

However, using AWS Glue comes with the advantage of not managing an extra service
for running product tests against the Databricks runtime. This is why the product
tests are now adapted to run with AWS Glue as metastore service for the
Databricks runtime.
When trying to create tables on Databricks on the
schema `extraordinary`, the creation of the tables
was failing with the message:

```
Error in SQL statement: IllegalArgumentException: Path must be absolute: test100-__PLACEHOLDER__
```

Specifying the location for the custom schema helps
avoiding the above-mentioned issue.
The tests in the class TestHiveAndDeltaLakeRedirect run on both
Delta OSS and Databricks infrastructure and not only on Databricks.
Avoid running into a test failure when dealing with a
test depending on Databricks AWS S3 and AWS Glue environment.

Consult for further details:

trinodb#13017
@findinpath findinpath force-pushed the delta-product-tests-databricks branch from 6f3425b to d63456f Compare July 1, 2022 10:46
@findinpath findinpath changed the title Adapt Delta product tests for execution on top of Databricks backed by AWS Glue Test Delta Lake connector against Databricks runtime Jul 1, 2022
@findinpath findinpath force-pushed the delta-product-tests-databricks branch from d63456f to 4b428f9 Compare July 1, 2022 13:45
Tables created for testing purposes on AWS Glue / AWS S3
have to be eventually removed.
Use the prefix "test_" to identify easier the tests
in the routine responsible for removing the test Glue tables.
@findinpath findinpath force-pushed the delta-product-tests-databricks branch from 4b428f9 to e73186f Compare July 1, 2022 13:59
@findinpath findinpath force-pushed the delta-product-tests-databricks branch from e73186f to bd49f3c Compare July 4, 2022 09:44
@ebyhr ebyhr merged commit e4093da into trinodb:master Jul 4, 2022
@ebyhr
Copy link
Copy Markdown
Member

ebyhr commented Jul 4, 2022

Merged, thanks!

@github-actions github-actions bot added this to the 389 milestone Jul 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

5 participants