Skip to content

Test Delta Lake on GCS#12800

Merged
ebyhr merged 3 commits intotrinodb:masterfrom
alexjo2144:alexjo/test-delta-on-gcs
Aug 23, 2022
Merged

Test Delta Lake on GCS#12800
ebyhr merged 3 commits intotrinodb:masterfrom
alexjo2144:alexjo/test-delta-on-gcs

Conversation

@alexjo2144
Copy link
Copy Markdown
Member

Description

Integration and Product tests for running Delta Lake on Google Cloud Storage.

Is this change a fix, improvement, new feature, refactoring, or other?

Test only

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

Delta Lake connector

How would you describe this change to a non-technical end user or system administrator?

Not important

Related issues, pull requests, and links

GCS was enabled in #12144

Documentation

(x) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

(x) No release notes entries required.
( ) Release notes entries required with the following suggested text:

}
}
catch (FileNotFoundException e) {
return ImmutableList.of();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not fail in case that the directory is not present?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are tests that rely on this behavior. Directory xyz doesn't exist, and the test checks how many files are in xyz. Seems like a reasonable contract to me

@findepi findepi mentioned this pull request Jun 13, 2022
@alexjo2144 alexjo2144 self-assigned this Jun 14, 2022
@alexjo2144 alexjo2144 requested a review from findinpath June 14, 2022 17:20
@alexjo2144
Copy link
Copy Markdown
Member Author

@findepi @findepi pushed some fixups when you get a chance

Copy link
Copy Markdown
Member

@hashhar hashhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did not look at SmokeTests yet and the changes in DeltaLake plugin.

Comment on lines +10 to +11
prepare_statement:
- USE ${databases.hive.schema}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question unrelated to this PR - can we use this to for example run CREATE SCHEMA against trino runner?

e.g. File HMS doesn't allow creating views in non-existent schemas so I have some PTs where the test itself runs CREATE SCHEMA IF NOT EXISTS and I'd maybe want the schema creation to happen here.

private final Path gcpCredentialsFile;
private final FileSystem fileSystem;

@Parameters({"gcp-storage-bucket", "gcp-credentials-key"})
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding delta. prefix so that we can avoid conflicts in IntelliJ configuration template in other connectors?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about testing.? I was hoping to use the same parameters for Hive and Iceberg as well


builder.addConnector("hive", forHostPath(dockerFiles.getDockerFilesHostPath("conf/environment/multinode-gcs/hive.properties")), CONTAINER_PRESTO_HIVE_PROPERTIES);
builder.addConnector("delta", forHostPath(dockerFiles.getDockerFilesHostPath("conf/environment/multinode-gcs/delta.properties")), CONTAINER_PRESTO_ETC + "/catalog/delta.properties");
builder.addConnector("iceberg", forHostPath(dockerFiles.getDockerFilesHostPath("conf/environment/multinode-gcs/iceberg.properties")), CONTAINER_PRESTO_ETC + "/catalog/iceberg.properties");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hive & iceberg connectors look unused. Is it for future work?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I was going to put tests for Hive and Iceberg up next. I also just thought it'd be handy to have for development on GCS as well, but I can remove them until the next PR if you'd like.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. There is no need to remove them.

@alexjo2144 alexjo2144 force-pushed the alexjo/test-delta-on-gcs branch from f6ca780 to 58ea1ed Compare June 16, 2022 17:01
@alexjo2144
Copy link
Copy Markdown
Member Author

Applied comments above, and included a commit to reduce the test parallelism. I'm hoping that will help the OOM issues on the test branch #12663 (comment)

If anyone has better ideas though, let me know

@alexjo2144
Copy link
Copy Markdown
Member Author

I ran the suite with JMX logging, and looking at the heap usage it increases gradually up to about 3GB and then I guess GC runs and the heap goes back down to about 0.75GB.

@findinpath
Copy link
Copy Markdown
Contributor

If anyone has better ideas though, let me know

@alexjo2144 please see #12663 (comment)

Increasing air.test.jvmsize could be also an option.

@findepi
Copy link
Copy Markdown
Member

findepi commented Jun 22, 2022

@alexjo2144 please see #12663 (comment)

Increasing air.test.jvmsize could be also an option.

That commit suggests 4g, which sounds like a lot for non-stress tests.
I am concerned by the fact that tests seem to consume lot more memory when run on CI vs locally.
Are you sure you're running with same configs?

@ebyhr
Copy link
Copy Markdown
Member

ebyhr commented Jul 4, 2022

GCS test succeeded in #12663, but it took 59 min (test timeout is 60min). We need to increase the timeout or find a way to shorten the time.

@findinpath
Copy link
Copy Markdown
Contributor

+1 for increasing the the timeout-minutes for cloud tests

@alexjo2144 alexjo2144 force-pushed the alexjo/test-delta-on-gcs branch 3 times, most recently from fe408fb to a2e60bf Compare August 3, 2022 21:02
@alexjo2144
Copy link
Copy Markdown
Member Author

@ebyhr @findepi I rebased and added a tmp commit that runs the Delta suite a bunch of times. Can we test with secrets when you get a chance?

@ebyhr
Copy link
Copy Markdown
Member

ebyhr commented Aug 4, 2022

Updated #12663

@alexjo2144
Copy link
Copy Markdown
Member Author

@findepi @ebyhr pushed one more commit to put the GCS tests in a separate step. Can we try that on CI when you get a chance?

@findepi findepi mentioned this pull request Aug 11, 2022
@alexjo2144 alexjo2144 force-pushed the alexjo/test-delta-on-gcs branch from e280bc2 to 5102b80 Compare August 11, 2022 16:21
@alexjo2144
Copy link
Copy Markdown
Member Author

Sorry, need one more mirror. Didn't notice the new merge tests during the rebase

@alexjo2144
Copy link
Copy Markdown
Member Author

One more thing to test. I switched the bucket to a regioned one in us-east so it should be close to the AWS region we run CI in

@alexjo2144 alexjo2144 force-pushed the alexjo/test-delta-on-gcs branch from 11001bb to 9c37bce Compare August 18, 2022 15:39
@alexjo2144
Copy link
Copy Markdown
Member Author

@ebyhr can we do another test with secrets when you get a chance?

@alexjo2144
Copy link
Copy Markdown
Member Author

@ebyhr the cloud tests passed all five times. The main Delta integration tests failed twice because of #13348 and the Databricks failure was just a cleanup failure, all the actual tests passed.

@alexjo2144 alexjo2144 force-pushed the alexjo/test-delta-on-gcs branch from 509a9a1 to 9c37bce Compare August 19, 2022 13:40
@alexjo2144 alexjo2144 force-pushed the alexjo/test-delta-on-gcs branch from 9c37bce to 85de93c Compare August 22, 2022 20:18
@alexjo2144
Copy link
Copy Markdown
Member Author

All set, thanks @ebyhr

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GCS suite looks not running. Probably, this should be steps.check-gcp-secrets.outputs.have_gcp_secrets.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fixed now.
Screen Shot 2022-08-23 at 13 52 56

Includes catalogs for the hive, delta lake, and iceberg connectors.
@ebyhr ebyhr force-pushed the alexjo/test-delta-on-gcs branch from 85de93c to ef59122 Compare August 23, 2022 02:41
@ebyhr ebyhr merged commit 52bf668 into trinodb:master Aug 23, 2022
@ebyhr
Copy link
Copy Markdown
Member

ebyhr commented Aug 23, 2022

Merged, thanks!

@github-actions github-actions bot added this to the 394 milestone Aug 23, 2022
@findepi
Copy link
Copy Markdown
Member

findepi commented Aug 23, 2022

🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

6 participants