Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 30 additions & 1 deletion .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -528,6 +528,7 @@ jobs:
runs-on: ubuntu-latest
outputs:
have_azure_secrets: ${{ steps.check-secrets.outputs.have_azure_secrets }}
have_databricks_secrets: ${{ steps.check-databricks-secrets.outputs.have_databricks_secrets }}
steps:
- uses: actions/checkout@v2
with:
Expand All @@ -553,6 +554,17 @@ jobs:
echo "::set-output name=have_azure_secrets::false"
fi
id: check-secrets
- name: Check Delta Databricks secrets
id: check-databricks-secrets
run: |
if [[ "${{ secrets.DATABRICKS_TOKEN }}" != "" ]]; \
then
echo "Secrets to run Delta Databricks product tests were configured in the repo"
echo "::set-output name=have_databricks_secrets::true"
else
echo "Secrets to run Delta Databricks product tests were not configured in the repo"
echo "::set-output name=have_databricks_secrets::false"
fi
- name: Maven Install
run: |
export MAVEN_OPTS="${MAVEN_INSTALL_OPTS}"
Expand Down Expand Up @@ -598,6 +610,7 @@ jobs:
# suite-4 does not exist
- suite-5
- suite-azure
- suite-delta-lake-databricks
jdk:
- 11
exclude:
Expand All @@ -623,6 +636,14 @@ jobs:
ignore exclusion if: >-
${{ needs.build-pt.outputs.have_azure_secrets == 'true' }}

- suite: suite-delta-lake-databricks
config: cdh5
- suite: suite-delta-lake-databricks
config: hdp3
- suite: suite-delta-lake-databricks
ignore exclusion if: >-
${{ needs.build-pt.outputs.have_databricks_secrets == 'true' }}

ignore exclusion if:
# Do not use this property outside of the matrix configuration.
#
Expand Down Expand Up @@ -681,7 +702,7 @@ jobs:
jdk: 11
# this suite is not meant to be run with different configs
- config: default
suite: suite-delta-lake
suite: suite-delta-lake-oss
jdk: 11
# PT Launcher's timeout defaults to 2h, add some margin
timeout-minutes: 130
Expand Down Expand Up @@ -720,6 +741,14 @@ jobs:
ABFS_CONTAINER: ${{ secrets.AZURE_ABFS_CONTAINER }}
ABFS_ACCOUNT: ${{ secrets.AZURE_ABFS_ACCOUNT }}
ABFS_ACCESS_KEY: ${{ secrets.AZURE_ABFS_ACCESSKEY }}
S3_BUCKET: trino-ci-test
AWS_REGION: us-east-2
DATABRICKS_AWS_ACCESS_KEY_ID: ${{ secrets.DATABRICKS_AWS_ACCESS_KEY_ID }}
DATABRICKS_AWS_SECRET_ACCESS_KEY: ${{ secrets.DATABRICKS_AWS_SECRET_ACCESS_KEY }}
DATABRICKS_73_JDBC_URL: ${{ secrets.DATABRICKS_73_JDBC_URL }}
DATABRICKS_91_JDBC_URL: ${{ secrets.DATABRICKS_91_JDBC_URL }}
DATABRICKS_LOGIN: token
DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN }}
run: |
testing/bin/ptl suite run \
--suite ${{ matrix.suite }} \
Expand Down
178 changes: 178 additions & 0 deletions plugin/trino-delta-lake/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,178 @@
# Delta Lake Connector Developer Notes

The Delta Lake connector can be used to interact with [Delta Lake](https://delta.io/) tables.

Trino has product tests in place for testing its compatibility with the
following Delta Lake implementations:

- Delta Lake OSS
- Delta Lake Databricks


## Delta Lake OSS Product tests

Testing against Delta Lake OSS is quite straightforward by simply spinning up
the corresponding product test environment:

```
testing/bin/ptl env up --environment singlenode-delta-lake-oss
```


## Delta Lake Databricks Product tests

At the time of this writing, Databricks Delta Lake and OSS Delta Lake differ in functionality provided.

In order to setup a Databricks testing environment there are several steps to be performed.

### Delta Lake Databricks on AWS

Start by setting up a Databricks account via https://databricks.com/try-databricks and after
filling your contact details, choose *AWS* as preferred cloud provider.

Create an AWS S3 bucket to be used for storing the content of the Delta Lake tables managed
by the Databricks runtime.

Follow the guideline [Secure access to S3 buckets using instance profiles](https://docs.databricks.com/administration-guide/cloud-configurations/aws/instance-profiles.html)
for allowing the Databricks cluster to access the AWS S3 bucket on which the Delta Lake tables are stored or AWS Glue
table metastore.

In order to make sure that the setup has been done correctly, proceed via Databricks Web UI to
create a notebook on which a simple table could be created:

```
%sql
CREATE TABLE default.test1 (
a_bigint BIGINT)
USING DELTA LOCATION 's3://my-s3-bucket/test1'
```

### Use AWS Glue Data Catalog as the metastore for Databricks Runtime

[AWS Glue](https://aws.amazon.com/glue) is the metastore of choice for Databricks Delta Lake product tests
on Trino because it is a managed solution which allows connectivity to the metastore backing the
Databricks runtime from Trino as well while executing the product tests.

Follow the guideline [Use AWS Glue Data Catalog as the metastore for Databricks Runtime](https://docs.databricks.com/data/metastores/aws-glue-metastore.html#configure-glue-data-catalog-as-the-metastore)
for performing the setup of Glue as Data Catalog on your Databricks Cluster.

After performing successfully this step you should be able to perform any of the statements:

```
show databases;

show tables;
```

The output of the previously mentioned statements should be the same as the one seen on the
AWS Glue administration Web UI.


### Create AWS user to be used by Trino for managing Delta Lake tables

Trino needs a set of security credentials for successfully connecting to the AWS infrastructure
in order to perform create/drop tables on AWS Glue and read/modify table content on AWS S3.

Create via AWS IAM a user which has the appropriate policies for interacting with AWS.
Below are presented a set of simplistic permission policies which can be configured on this
user:

`GlueAccess`

```
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "GrantCatalogAccessToGlue",
"Effect": "Allow",
"Action": [
"glue:BatchCreatePartition",
"glue:BatchDeletePartition",
"glue:BatchGetPartition",
"glue:CreateDatabase",
"glue:CreateTable",
"glue:CreateUserDefinedFunction",
"glue:DeleteDatabase",
"glue:DeletePartition",
"glue:DeleteTable",
"glue:DeleteUserDefinedFunction",
"glue:GetDatabase",
"glue:GetDatabases",
"glue:GetPartition",
"glue:GetPartitions",
"glue:GetTable",
"glue:GetTables",
"glue:GetUserDefinedFunction",
"glue:GetUserDefinedFunctions",
"glue:UpdateDatabase",
"glue:UpdatePartition",
"glue:UpdateTable",
"glue:UpdateUserDefinedFunction"
],
"Resource": [
"*"
]
}
]
}
```

`S3Access`

```
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetBucketLocation",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::my-s3-bucket"
]
},
{
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetObject",
"s3:DeleteObject",
"s3:PutObjectAcl"
],
"Resource": [
"arn:aws:s3:::my-s3-bucket/*"
]
}
]
}
```

For the newly created AWS IAM user, make sure to retrieve the security credentials because they
are to be used by Trino in communicating with AWS.


### Setup token authentication on the Databricks cluster

Follow the guideline [Authentication using Databricks personal access tokens](https://docs.databricks.com/dev-tools/api/latest/authentication.html)
for setting up your Databricks personal access token.


### Test the functionality of the Databricks Delta Lake product test environment


Run the following command for spinning up the Databricks 9.1 Delta Lake product test
environment for Trino:

```
env S3_BUCKET=my-s3-bucket \
AWS_REGION=us-east-2 \
AWS_SECRET_ACCESS_KEY=xxx \
AWS_ACCESS_KEY_ID=xxx \
DATABRICKS_91_JDBC_URL='xxx' \
DATABRICKS_LOGIN=token \
DATABRICKS_TOKEN=xxx \
testing/bin/ptl env up --environment singlenode-delta-lake-databricks91
```
Original file line number Diff line number Diff line change
Expand Up @@ -47,18 +47,17 @@ public AbstractSinglenodeDeltaLakeDatabricks(Standard standard, DockerFiles dock
public void extendEnvironment(Environment.Builder builder)
{
String databricksTestJdbcUrl = databricksTestJdbcUrl();
String databricksTestJdbcDriverClass = requireNonNull(System.getenv("DATABRICKS_TEST_JDBC_DRIVER_CLASS"), "Environment DATABRICKS_TEST_JDBC_DRIVER_CLASS was not set");
String databricksTestLogin = requireNonNull(System.getenv("DATABRICKS_TEST_LOGIN"), "Environment DATABRICKS_TEST_LOGIN was not set");
String databricksTestToken = requireNonNull(System.getenv("DATABRICKS_TEST_TOKEN"), "Environment DATABRICKS_TEST_TOKEN was not set");
String hiveMetastoreUri = requireNonNull(System.getenv("HIVE_METASTORE_URI"), "Environment HIVE_METASTORE_URI was not set");
String databricksTestLogin = requireNonNull(System.getenv("DATABRICKS_LOGIN"), "Environment DATABRICKS_LOGIN was not set");
String databricksTestToken = requireNonNull(System.getenv("DATABRICKS_TOKEN"), "Environment DATABRICKS_TOKEN was not set");
String awsRegion = requireNonNull(System.getenv("AWS_REGION"), "Environment AWS_REGION was not set");
String s3Bucket = requireNonNull(System.getenv("S3_BUCKET"), "Environment S3_BUCKET was not set");
DockerFiles.ResourceProvider configDir = dockerFiles.getDockerFilesHostDirectory("conf/environment/singlenode-delta-lake-databricks");

builder.configureContainer(COORDINATOR, dockerContainer -> exportAWSCredentials(dockerContainer)
.withEnv("HIVE_METASTORE_URI", hiveMetastoreUri)
.withEnv("DATABRICKS_TEST_JDBC_URL", databricksTestJdbcUrl)
.withEnv("DATABRICKS_TEST_LOGIN", databricksTestLogin)
.withEnv("DATABRICKS_TEST_TOKEN", databricksTestToken));
.withEnv("AWS_REGION", awsRegion)
.withEnv("DATABRICKS_JDBC_URL", databricksTestJdbcUrl)
.withEnv("DATABRICKS_LOGIN", databricksTestLogin)
.withEnv("DATABRICKS_TOKEN", databricksTestToken));
builder.addConnector("hive", forHostPath(configDir.getPath("hive.properties")));
builder.addConnector(
"delta-lake",
Expand All @@ -67,23 +66,22 @@ public void extendEnvironment(Environment.Builder builder)

builder.configureContainer(TESTS, container -> exportAWSCredentials(container)
.withEnv("S3_BUCKET", s3Bucket)
.withEnv("DATABRICKS_TEST_JDBC_DRIVER_CLASS", databricksTestJdbcDriverClass)
.withEnv("DATABRICKS_TEST_JDBC_URL", databricksTestJdbcUrl)
.withEnv("DATABRICKS_TEST_LOGIN", databricksTestLogin)
.withEnv("DATABRICKS_TEST_TOKEN", databricksTestToken)
.withEnv("HIVE_METASTORE_URI", hiveMetastoreUri));
.withEnv("AWS_REGION", awsRegion)
.withEnv("DATABRICKS_JDBC_URL", databricksTestJdbcUrl)
.withEnv("DATABRICKS_LOGIN", databricksTestLogin)
.withEnv("DATABRICKS_TOKEN", databricksTestToken));

configureTempto(builder, configDir);
}

private DockerContainer exportAWSCredentials(DockerContainer container)
{
container = exportAWSCredential(container, "AWS_ACCESS_KEY_ID", true);
container = exportAWSCredential(container, "AWS_SECRET_ACCESS_KEY", true);
return exportAWSCredential(container, "AWS_SESSION_TOKEN", false);
container = exportAWSCredential(container, "DATABRICKS_AWS_ACCESS_KEY_ID", "AWS_ACCESS_KEY_ID", true);
container = exportAWSCredential(container, "DATABRICKS_AWS_SECRET_ACCESS_KEY", "AWS_SECRET_ACCESS_KEY", true);
return exportAWSCredential(container, "DATABRICKS_AWS_SESSION_TOKEN", "AWS_SESSION_TOKEN", false);
}

private DockerContainer exportAWSCredential(DockerContainer container, String credentialEnvVariable, boolean required)
private DockerContainer exportAWSCredential(DockerContainer container, String credentialEnvVariable, String containerEnvVariable, boolean required)
{
String credentialValue = System.getenv(credentialEnvVariable);
if (credentialValue == null) {
Expand All @@ -92,6 +90,6 @@ private DockerContainer exportAWSCredential(DockerContainer container, String cr
}
return container;
}
return container.withEnv(credentialEnvVariable, credentialValue);
return container.withEnv(containerEnvVariable, credentialValue);
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -21,19 +21,19 @@
import static java.util.Objects.requireNonNull;

@TestsEnvironment
public class EnvSinglenodeDeltaLakeDatabricks
public class EnvSinglenodeDeltaLakeDatabricks73
extends AbstractSinglenodeDeltaLakeDatabricks

{
@Inject
public EnvSinglenodeDeltaLakeDatabricks(Standard standard, DockerFiles dockerFiles)
public EnvSinglenodeDeltaLakeDatabricks73(Standard standard, DockerFiles dockerFiles)
{
super(standard, dockerFiles);
}

@Override
String databricksTestJdbcUrl()
{
return requireNonNull(System.getenv("DATABRICKS_TEST_JDBC_URL"), "Environment DATABRICKS_TEST_JDBC_URL was not set");
return requireNonNull(System.getenv("DATABRICKS_73_JDBC_URL"), "Environment DATABRICKS_73_JDBC_URL was not set");
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,6 @@ public EnvSinglenodeDeltaLakeDatabricks91(Standard standard, DockerFiles dockerF
@Override
String databricksTestJdbcUrl()
{
return requireNonNull(System.getenv("DATABRICKS_91_TEST_JDBC_URL"), "Environment DATABRICKS_91_TEST_JDBC_URL was not set");
return requireNonNull(System.getenv("DATABRICKS_91_JDBC_URL"), "Environment DATABRICKS_91_JDBC_URL was not set");
}
}
Loading