Conversation
pom.xml
Outdated
There was a problem hiding this comment.
If you are using GCS connector only through HDFS/HCFS API you can depend on shaded JAR instead (<classifier>shaded</classifier>). This way you will have only one dependency instead of 4 dependencies now.
pom.xml
Outdated
There was a problem hiding this comment.
2.1.3 is the latest, should we use that instead?
pom.xml
Outdated
There was a problem hiding this comment.
No longer needed with the shaded gcs-connector jar.
7092b42 to
29fb8ce
Compare
pom.xml
Outdated
There was a problem hiding this comment.
All these dependencies should be redundant if you are using shaded GCS connector.
There was a problem hiding this comment.
They are there due to enforcer errors. Without those dependencies amazon-aws-sdk and a few other artifacts complain about upper bounds.
There was a problem hiding this comment.
If the versions for gson, httpclient and httpcore dependencies are not explicitly defined in dependency management, maven-enforcer fails with the following upper bound dependency violations:
[WARNING] Rule 0: org.apache.maven.plugins.enforcer.RequireUpperBoundDeps failed with message:
Failed while enforcing RequireUpperBoundDeps. The error(s) are [
Require upper bound dependencies error for org.apache.httpcomponents:httpclient:4.5.9 paths to dependency are:
+-io.prestosql:presto-hive:334-SNAPSHOT
+-com.amazonaws:aws-java-sdk-core:1.11.749
+-org.apache.httpcomponents:httpclient:4.5.9
and
+-io.prestosql:presto-hive:334-SNAPSHOT
+-com.google.cloud.bigdataoss:gcs-connector:hadoop2-2.1.3
+-com.google.oauth-client:google-oauth-client:1.30.6
+-com.google.http-client:google-http-client:1.34.2
+-org.apache.httpcomponents:httpclient:4.5.11
,
Require upper bound dependencies error for org.apache.httpcomponents:httpcore:4.4.11 paths to dependency are:
+-io.prestosql:presto-hive:334-SNAPSHOT
+-com.amazonaws:aws-java-sdk-core:1.11.749
+-org.apache.httpcomponents:httpclient:4.5.9
+-org.apache.httpcomponents:httpcore:4.4.11
and
+-io.prestosql:presto-hive:334-SNAPSHOT
+-com.google.cloud.bigdataoss:gcs-connector:hadoop2-2.1.3
+-com.google.oauth-client:google-oauth-client:1.30.6
+-com.google.http-client:google-http-client:1.34.2
+-org.apache.httpcomponents:httpcore:4.4.13
,
Require upper bound dependencies error for com.google.code.gson:gson:2.2.4 paths to dependency are:
+-io.prestosql:presto-hive:334-SNAPSHOT
+-com.google.cloud.bigdataoss:gcsio:2.1.3
+-com.google.code.gson:gson:2.2.4
and
+-io.prestosql:presto-hive:334-SNAPSHOT
+-com.google.cloud.bigdataoss:gcsio:2.1.3
+-com.google.protobuf:protobuf-java-util:3.11.4
+-com.google.code.gson:gson:2.8.6
and
+-io.prestosql:presto-hive:334-SNAPSHOT
+-com.google.cloud.bigdataoss:gcsio:2.1.3
+-io.grpc:grpc-alts:1.29.0
+-io.grpc:grpc-core:1.29.0
+-com.google.code.gson:gson:2.8.6
There was a problem hiding this comment.
If the gcsio version is not explicitly defined in dependencyManagement, maven-enforcer fails with the following upper bound dependency violations:
[WARNING] Rule 0: org.apache.maven.plugins.enforcer.RequireUpperBoundDeps failed with message:
Failed while enforcing RequireUpperBoundDeps. The error(s) are [
Require upper bound dependencies error for io.grpc:grpc-context:1.22.1 paths to dependency are:
+-io.prestosql:presto-hive:334-SNAPSHOT
+-com.google.cloud.bigdataoss:gcs-connector:hadoop2-2.1.3
+-com.google.oauth-client:google-oauth-client:1.30.6
+-com.google.http-client:google-http-client:1.34.2
+-io.opencensus:opencensus-api:0.24.0
+-io.grpc:grpc-context:1.22.1
and
+-io.prestosql:presto-hive:334-SNAPSHOT
+-com.google.cloud.bigdataoss:gcs-connector:hadoop2-2.1.3
+-com.google.cloud.bigdataoss:gcsio:2.1.3
+-io.grpc:grpc-auth:1.29.0
+-io.grpc:grpc-api:1.29.0
+-io.grpc:grpc-context:1.29.0
There was a problem hiding this comment.
If util-hadoop is not explicitly defined then presto-hive fails with an used but undeclared dependency error.
There was a problem hiding this comment.
I think it depends on next Presto release date, if it scheduled after new GCS connector release in 2 weeks or so, then it makes sense to wait and update straight to shaded GCS connector 2.1.4. But if next Presto release is earlier, then we can go ahead with unshaded artifacts for now, I think.
There was a problem hiding this comment.
I'll update the pr with non-shaded artifacts and save this one, can always go back to it - we'll see which one happens first:)
There was a problem hiding this comment.
GCS connector 2.1.4 was just released, so you can update this PR to use it now.
7bfe57c to
d398a5a
Compare
d398a5a to
d8b2693
Compare
|
Hi @medb, update: tried 2.1.4 locally, will post issues - one of them is that |
Yep, if you directly depend on Why there a need to directly use classes in |
d8b2693 to
4ce8a43
Compare
|
It's only. used to get the scheme in GcsConfigurationProvider, i.e. Also, I noticed that if I select with a limit then I see the following stack trace in the logs, but the query succeeds, is there some config that would avoid the stack trace? Here it is: |
4ce8a43 to
9510f5b
Compare
|
Yeah, we don't have |
presto-hive/src/main/java/io/prestosql/plugin/hive/gcs/GoogleGcsConfigurationInitializer.java
Outdated
Show resolved
Hide resolved
presto-hive/src/main/java/io/prestosql/plugin/hive/gcs/GoogleGcsConfigurationInitializer.java
Outdated
Show resolved
Hide resolved
9510f5b to
c13d592
Compare
c13d592 to
b82079f
Compare
|
Looks like 2.1.5 was just released, will update shortly. |
b82079f to
453fff8
Compare
516fc87 to
4bba8f1
Compare
|
Sent #13969 within the repository. |
|
Hey @ebyhr @brokenjacobs @medb - I did some testing and cross cloud (from trino in azure -> gcs) there was a drastic latency increase in gcs calls due to keep alive cache. I added the Would you know if there is a way to disable the keepalive cache - found this - but I tried 2.1.6 and it had similar behavior. Anyone else familiar with this issue? This affected metadata calls most, not reads - and the use case was for reading tons of tiny files that make up hudi timelines (on the coordinator). For reading data there was no degradation. |
|
Is it http or tcp keepalives causing your issue? That link seems to be referring to tcp keepalives and the sslsocketfactory not http keepalives. (Http 1.1 feature). It seems odd that http keepalives (implied by the custom header) would be causing delays? |
|
By disabling http keepalive (custom header) the connection is closed after 1 request. This causes the next request to fail and the code will create a new http client which eliminates the latency but is unreliable. Keep alive cache is causing locking which results in delays (specifically for the hudi use case w tons of tiny metadata files). |
|
So main issue is that all threads which use the http client are blocked except one due to the keep alive cache. The custom headers, etc. was just for testing but showed that the latency is due to the synchronized method which puts connections back in the keep alive cache |
|
Also, to be clear: since the behavior seems to be in older versions also, it is not a regresssion wtih 2.2.7. |
Maybe we split this conversation into a separate GH Issue then? |
|
@alexjo2144 absolutely, I will create an issue for it. Might not be something people encounter outside of this specific use case (tons of tiny files, firewall, cross cloud, etc.). |
pom.xml
Outdated
| <dep.drift.version>1.14</dep.drift.version> | ||
| <dep.tempto.version>189</dep.tempto.version> | ||
| <dep.gcs.version>2.0.0</dep.gcs.version> | ||
| <dep.gcs.version>2.2.7</dep.gcs.version> |
pom.xml
Outdated
There was a problem hiding this comment.
I think it is because we can have issues with other dependencies without shading.
| </plugin> | ||
| <plugin> | ||
| <groupId>org.basepom.maven</groupId> | ||
| <artifactId>duplicate-finder-maven-plugin</artifactId> |
There was a problem hiding this comment.
What it takes to fix issues instead of ignoring them? Can you please list issues that got without this?
There was a problem hiding this comment.
Here are the issues:
[WARNING] Found duplicate (but equal) resources in [com.google.cloud.bigdataoss:gcs-connector:hadoop2-2.2.8:jar:shaded, org.apache.httpcomponents:httpclient:4.5.13]:
[WARNING] mozilla/public-suffix-list.txt
[WARNING] Found duplicate classes/resources in compile classpath.
[WARNING] Found duplicate (but equal) resources in [com.google.cloud.bigdataoss:gcs-connector:hadoop2-2.2.8:jar:shaded, org.apache.httpcomponents:httpclient:4.5.13]:
[WARNING] mozilla/public-suffix-list.txt
[WARNING] Found duplicate classes/resources in runtime classpath.
[WARNING] Found duplicate (but equal) resources in [com.google.cloud.bigdataoss:gcs-connector:hadoop2-2.2.8:jar:shaded, org.apache.httpcomponents:httpclient:4.5.13]:
[WARNING] mozilla/public-suffix-list.txt
[WARNING] Found duplicate classes/resources in test classpath.
There was a problem hiding this comment.
Second warning is from the trino-faulttolerant-tests module:
[WARNING] Found duplicate and different resources in [com.google.api:gax:2.17.0, com.google.cloud.bigdataoss:gcs-connector:hadoop2-2.2.8:jar:shaded]:
[WARNING] dependencies.properties
[WARNING] Found duplicate classes/resources in test classpath.
|
@elonazoulay Are you working on this actively or would like me or somebody else to take it over? |
pom.xml
Outdated
| @@ -1124,33 +1125,53 @@ | |||
| <groupId>org.slf4j</groupId> | |||
| <artifactId>slf4j-log4j12</artifactId> | |||
| </exclusion> | |||
There was a problem hiding this comment.
Can we use a wildcard exclusion here?
There was a problem hiding this comment.
@kokosing either way works: I think the changes in 2.2.8 will help our use case and we were able to deploy 2.2.7 (this pr) recently and it worked as good or better than previous version we were using (2.2.2). Either way I will be testing 2.2.8 in the next day or so - lmk what works best for you. If you want to use the other pr I can test that as well.
There was a problem hiding this comment.
Updated. I will remove the duplicate finder and paste the results as well and update after we deploy it. And if you want to use the other pr I will test it also.
4bba8f1 to
2e2b668
Compare
|
CI hit: #14248, retrying to verify if is truly flaky. EDIT it is truly flaky. |
|
Merged, thanks! |
This fixes the issue with performance cache: GoogleCloudDataproc/hadoop-connectors#342