Conversation
7f134b4 to
2b3ac04
Compare
2b3ac04 to
81e5215
Compare
|
We should remove the retry logic for materialized view fetches. I have no idea why it was added there -- none of the other operations I looked at have it. |
81e5215 to
26ed395
Compare
Did you mean remove retries from or both? I see that When it comes to |
|
If seems fairly arbitrary why many of these exceptions are made non-retriable. For example, why couldn't you retry on a listing failure? |
I followed the logic that if we catch raw SDK exception, it means that underlying library has failed to retry (which is does by default on exceptions that can be retried), so there is no point trying again I checked that all AWS/Azure/Gcp clients are set up with default retry policies that we are not overriding, so we have retries for free, trying to add additional on top is just making things fail after more time Described it more here: #22678 (comment) |
26ed395 to
cd96b7e
Compare
|
Added new commit on top of previous ones with changing iceberg retries @electrum |
|
"Change iceberg related retries" looks good |
lib/trino-filesystem-azure/src/main/java/io/trino/filesystem/azure/AzureFileSystem.java
Outdated
Show resolved
Hide resolved
lib/trino-filesystem-azure/src/main/java/io/trino/filesystem/azure/AzureUtils.java
Outdated
Show resolved
Hide resolved
lib/trino-filesystem-gcs/src/main/java/io/trino/filesystem/gcs/GcsUtils.java
Outdated
Show resolved
Hide resolved
lib/trino-filesystem-gcs/src/main/java/io/trino/filesystem/gcs/GcsUtils.java
Outdated
Show resolved
Hide resolved
lib/trino-filesystem-s3/src/main/java/io/trino/filesystem/s3/S3FileSystem.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/metastore/glue/GlueMetastoreModule.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/metastore/glue/v1/GlueClientUtil.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/metastore/glue/v1/GlueClientUtil.java
Outdated
Show resolved
Hide resolved
cd96b7e to
be998b5
Compare
lib/trino-filesystem/src/main/java/io/trino/filesystem/UnrecoverableIOException.java
Outdated
Show resolved
Hide resolved
lib/trino-filesystem-s3/src/main/java/io/trino/filesystem/s3/S3InputFile.java
Outdated
Show resolved
Hide resolved
lib/trino-filesystem/src/main/java/io/trino/filesystem/UnrecoverableIOException.java
Outdated
Show resolved
Hide resolved
lib/trino-filesystem/src/main/java/io/trino/filesystem/UnrecoverableIOException.java
Outdated
Show resolved
Hide resolved
.../trino-hive/src/main/java/io/trino/plugin/hive/metastore/SemiTransactionalHiveMetastore.java
Outdated
Show resolved
Hide resolved
lib/trino-filesystem-azure/src/main/java/io/trino/filesystem/azure/AzureFileSystem.java
Outdated
Show resolved
Hide resolved
lib/trino-filesystem-gcs/src/main/java/io/trino/filesystem/gcs/GcsUtils.java
Outdated
Show resolved
Hide resolved
lib/trino-filesystem-gcs/src/main/java/io/trino/filesystem/gcs/GcsUtils.java
Outdated
Show resolved
Hide resolved
lib/trino-filesystem/src/main/java/io/trino/filesystem/UnrecoverableIOException.java
Outdated
Show resolved
Hide resolved
lib/trino-filesystem/src/main/java/io/trino/filesystem/UnrecoverableIOException.java
Outdated
Show resolved
Hide resolved
lib/trino-filesystem/src/main/java/io/trino/filesystem/UnrecoverableIOException.java
Outdated
Show resolved
Hide resolved
lib/trino-filesystem/src/main/java/io/trino/filesystem/UnrecoverableIOException.java
Outdated
Show resolved
Hide resolved
plugin/trino-bigquery/src/main/java/io/trino/plugin/bigquery/ReadSessionCreator.java
Outdated
Show resolved
Hide resolved
lib/trino-filesystem/src/main/java/io/trino/filesystem/UnrecoverableIOException.java
Outdated
Show resolved
Hide resolved
lib/trino-filesystem/src/main/java/io/trino/filesystem/UnrecoverableIOException.java
Outdated
Show resolved
Hide resolved
Makes sense. Thanks for explaining. |
be998b5 to
e3f4909
Compare
e3f4909 to
7b650fb
Compare
7b650fb to
7ce30b1
Compare
There was a problem hiding this comment.
I checked and we can drop the throws clause from that interface, as none of the implementations throw checked exceptions. We could actually replace it with UnaryOperator<Table>.
There was a problem hiding this comment.
The handling needs to be updated, as there is no longer a FailsafeException wrapping something. We should treat these like other Glue exceptions, probably by changing this to
catch (SdkException e) {
throw new TrinoException(HIVE_METASTORE_ERROR, e);
}This outer try block can be removed, and this handling moved to the block above.
plugin/trino-hive/src/main/java/io/trino/plugin/hive/metastore/glue/GlueHiveMetastore.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
I don't think this is correct. This retry is in the Glue client and software.amazon.awssdk.services.glue.model.ConcurrentModificationException is a GlueException, so it should be thrown directly from the client.
The original retry code for Glue v1 dropTable and associated test seems to be based on a stack trace after the failure was wrapped in TrinoException, which wouldn't be the case. Although it would be correct for the later added updateTableStatistics as that retry happens outside of the wrapping. The test TestHiveConcurrentModificationGlueMetastore should be updated to throw ConcurrentModificationException directly.
Ideally, we can find an actual exception from the Glue service to validate this.
There was a problem hiding this comment.
I think it is fine, looking at v2 client retries, it uses AwsRetryStrategy#configureStrategy which uses same mechanism, adding some default handling of AwsServiceException classes, so adding handling of another children of this class (ConcurrentModificationException) would just extend this behavior
But Throwables.getRootCause is not needed here, as it can be checked with instanceof directly, as we are not wrapping it with trino exception at this point
There was a problem hiding this comment.
When it comes to v1, it is analogical, com.amazonaws.retry.PredefinedRetryPolicies also check if exception is AmazonServiceException when performing retries, so its not only using it externally to the client, but also internally, so its perfectly fine to check it in lambda, but again, we don't need to check root cause but just plain instanceof should do it
There was a problem hiding this comment.
there is article on that here, that external retries shouldn't be done, but retry customizers should be used instead:
https://docs.aws.amazon.com/codeguru/detector-library/java/aws-custom-retries/
There was a problem hiding this comment.
TestHiveConcurrentModificationGlueMetastore and proxying updateTable does not hold anymore, as those retries are happening deeper than in GlueClient
I think way to test this is to build glue client and assert that retry strategy that is set will yield to try when ConcurrentModificationException is encountered
There was a problem hiding this comment.
I changed second commit Do not rely on Failsafe in glue metastore to test if retry policy is set up correctly on glue client, instead of proxying method with throwing exception, as it doesn't work anymore due to exception being handled deeper
There was a problem hiding this comment.
@electrum can I have re-review after latest changes? I'd like to proceed with this PR
plugin/trino-hive/src/main/java/io/trino/plugin/hive/metastore/glue/v1/GlueClientUtil.java
Outdated
Show resolved
Hide resolved
lib/trino-filesystem/src/main/java/io/trino/filesystem/TrinoFileSystem.java
Outdated
Show resolved
Hide resolved
lib/trino-filesystem/src/main/java/io/trino/filesystem/TrinoFileSystem.java
Outdated
Show resolved
Hide resolved
lib/trino-filesystem/src/main/java/io/trino/filesystem/TrinoFileSystemException.java
Outdated
Show resolved
Hide resolved
lib/trino-filesystem/src/main/java/io/trino/filesystem/TrinoFileSystem.java
Outdated
Show resolved
Hide resolved
7ce30b1 to
f7db45d
Compare
1bdf40d to
f3dd5e7
Compare
Utilize it in filesystems to mark operations that are terminal and shouldn't be retried
Add retries to Glue v1 and v2 client for ConcurrentModificationException instead of relying on custom retries with Failsafe Change tests related to ConcurrentModificationException as those are now handled by glue client internally Instead of proxying client to throw this exception, check if glue client's retry policy is able to retry on this exception
Instead of aborting Failsafe's retries on certain conditions, make retries happen under specific circumstances This should yield to more predictable retries, that are explicitly set in code Abort retries on TrinoFileSystemException, as it's not meant to be retried
Remove retry in AbstractTrinoCatalog, as it never can catch exception on which retry was set up Reduce amount of retries in AbstractIcebergTableOperations, as 20 retries with max time of 10 minutes seems way too big
f3dd5e7 to
ade596b
Compare
Retry only when certain conditions happen, instead of aborting when they not happen
ade596b to
7540ae9
Compare
|
Thanks! |
|
@oskar-szwajkowski |
Looking on it |
Description
This PR improves retries/exception handling around file system operations, additional context can be found in commit messages and in #22678
Changed places in S3/Azure/Gcs filesystems where we throw
IOException, and useUnrecoverableIOExceptionwhenever it makes sense (it extends fromIOExceptionso is backward compatible)All SDK exceptions are wrapped with
UnrecoverableIOException, as third party clients have default retry logic, which we shouldn't further retry whenever it failsAdditional context and related issues
Closes #22678
Release notes
( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(X) Release notes are required, with the following suggested text:
* Skip retrying some file system operations that are meant to fail, failing fast instead.