Improve retry logic by oskar-szwajkowski · Pull Request #22814 · trinodb/trino

oskar-szwajkowski · 2024-07-25T13:42:47Z

Description

This PR improves retries/exception handling around file system operations, additional context can be found in commit messages and in #22678

Changed places in S3/Azure/Gcs filesystems where we throw IOException, and use UnrecoverableIOException whenever it makes sense (it extends from IOException so is backward compatible)

All SDK exceptions are wrapped with UnrecoverableIOException, as third party clients have default retry logic, which we shouldn't further retry whenever it fails

Additional context and related issues

Closes #22678

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(X) Release notes are required, with the following suggested text:

* Skip retrying some file system operations that are meant to fail, failing fast instead.

electrum · 2024-07-25T21:20:56Z

We should remove the retry logic for materialized view fetches. I have no idea why it was added there -- none of the other operations I looked at have it.

oskar-szwajkowski · 2024-07-25T21:30:16Z

We should remove the retry logic for materialized view fetches. I have no idea why it was added there -- none of the other operations I looked at have it.

Did you mean remove retries from AbstractIcebergTableOperations#refreshFromMetadataLocation or AbstractTrinoCatalog#getMaterializedView ?

or both?

I see that AbstractTrinoCatalog#getMaterializedView is being retried only on MaterializedViewMayBeBeingRemovedException, which is never instantiated so it can be removed rather safely

When it comes to AbstractIcebergTableOperations#refreshFromMetadataLocation, it seems like its retrying on more number of cases, but 20 retries seems like a lot for this operation

electrum · 2024-07-25T21:30:24Z

If seems fairly arbitrary why many of these exceptions are made non-retriable. For example, why couldn't you retry on a listing failure?

oskar-szwajkowski · 2024-07-25T21:32:56Z

If seems fairly arbitrary why many of these exceptions are made non-retriable. For example, why couldn't you retry on a listing failure?

I followed the logic that if we catch raw SDK exception, it means that underlying library has failed to retry (which is does by default on exceptions that can be retried), so there is no point trying again

I checked that all AWS/Azure/Gcp clients are set up with default retry policies that we are not overriding, so we have retries for free, trying to add additional on top is just making things fail after more time

Described it more here: #22678 (comment)

oskar-szwajkowski · 2024-07-25T22:21:53Z

Added new commit on top of previous ones with changing iceberg retries @electrum

electrum · 2024-07-28T21:13:44Z

"Change iceberg related retries" looks good

lib/trino-filesystem-azure/src/main/java/io/trino/filesystem/azure/AzureFileSystem.java

lib/trino-filesystem-azure/src/main/java/io/trino/filesystem/azure/AzureUtils.java

lib/trino-filesystem-gcs/src/main/java/io/trino/filesystem/gcs/GcsUtils.java

lib/trino-filesystem-s3/src/main/java/io/trino/filesystem/s3/S3FileSystem.java

plugin/trino-hive/src/main/java/io/trino/plugin/hive/metastore/glue/GlueMetastoreModule.java

plugin/trino-hive/src/main/java/io/trino/plugin/hive/metastore/glue/v1/GlueClientUtil.java

lib/trino-filesystem/src/main/java/io/trino/filesystem/UnrecoverableIOException.java

lib/trino-filesystem-s3/src/main/java/io/trino/filesystem/s3/S3InputFile.java

lib/trino-filesystem/src/main/java/io/trino/filesystem/UnrecoverableIOException.java

.../trino-hive/src/main/java/io/trino/plugin/hive/metastore/SemiTransactionalHiveMetastore.java

lib/trino-filesystem-azure/src/main/java/io/trino/filesystem/azure/AzureFileSystem.java

lib/trino-filesystem-gcs/src/main/java/io/trino/filesystem/gcs/GcsUtils.java

lib/trino-filesystem/src/main/java/io/trino/filesystem/UnrecoverableIOException.java

plugin/trino-bigquery/src/main/java/io/trino/plugin/bigquery/ReadSessionCreator.java

lib/trino-filesystem/src/main/java/io/trino/filesystem/UnrecoverableIOException.java

electrum · 2024-07-29T20:10:15Z

I followed the logic that if we catch raw SDK exception, it means that underlying library has failed to retry (which is does by default on exceptions that can be retried), so there is no point trying again

Makes sense. Thanks for explaining.

electrum · 2024-08-01T00:54:37Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/metastore/glue/GlueHiveMetastore.java

I checked and we can drop the throws clause from that interface, as none of the implementations throw checked exceptions. We could actually replace it with UnaryOperator<Table>.

electrum · 2024-08-01T00:58:20Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/metastore/glue/GlueHiveMetastore.java

The handling needs to be updated, as there is no longer a FailsafeException wrapping something. We should treat these like other Glue exceptions, probably by changing this to

catch (SdkException e) { throw new TrinoException(HIVE_METASTORE_ERROR, e); }

This outer try block can be removed, and this handling moved to the block above.

plugin/trino-hive/src/main/java/io/trino/plugin/hive/metastore/glue/GlueHiveMetastore.java

electrum · 2024-08-01T01:10:47Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/metastore/glue/GlueMetastoreModule.java

I don't think this is correct. This retry is in the Glue client and software.amazon.awssdk.services.glue.model.ConcurrentModificationException is a GlueException, so it should be thrown directly from the client.

The original retry code for Glue v1 dropTable and associated test seems to be based on a stack trace after the failure was wrapped in TrinoException, which wouldn't be the case. Although it would be correct for the later added updateTableStatistics as that retry happens outside of the wrapping. The test TestHiveConcurrentModificationGlueMetastore should be updated to throw ConcurrentModificationException directly.

Ideally, we can find an actual exception from the Glue service to validate this.

I think it is fine, looking at v2 client retries, it uses AwsRetryStrategy#configureStrategy which uses same mechanism, adding some default handling of AwsServiceException classes, so adding handling of another children of this class (ConcurrentModificationException) would just extend this behavior

But Throwables.getRootCause is not needed here, as it can be checked with instanceof directly, as we are not wrapping it with trino exception at this point

When it comes to v1, it is analogical, com.amazonaws.retry.PredefinedRetryPolicies also check if exception is AmazonServiceException when performing retries, so its not only using it externally to the client, but also internally, so its perfectly fine to check it in lambda, but again, we don't need to check root cause but just plain instanceof should do it

there is article on that here, that external retries shouldn't be done, but retry customizers should be used instead:
https://docs.aws.amazon.com/codeguru/detector-library/java/aws-custom-retries/

TestHiveConcurrentModificationGlueMetastore and proxying updateTable does not hold anymore, as those retries are happening deeper than in GlueClient

I think way to test this is to build glue client and assert that retry strategy that is set will yield to try when ConcurrentModificationException is encountered

I changed second commit Do not rely on Failsafe in glue metastore to test if retry policy is set up correctly on glue client, instead of proxying method with throwing exception, as it doesn't work anymore due to exception being handled deeper

@electrum can I have re-review after latest changes? I'd like to proceed with this PR

plugin/trino-hive/src/main/java/io/trino/plugin/hive/metastore/glue/v1/GlueClientUtil.java

lib/trino-filesystem/src/main/java/io/trino/filesystem/TrinoFileSystem.java

lib/trino-filesystem/src/main/java/io/trino/filesystem/TrinoFileSystemException.java

lib/trino-filesystem/src/main/java/io/trino/filesystem/TrinoFileSystem.java

Utilize it in filesystems to mark operations that are terminal and shouldn't be retried

Add retries to Glue v1 and v2 client for ConcurrentModificationException instead of relying on custom retries with Failsafe Change tests related to ConcurrentModificationException as those are now handled by glue client internally Instead of proxying client to throw this exception, check if glue client's retry policy is able to retry on this exception

Instead of aborting Failsafe's retries on certain conditions, make retries happen under specific circumstances This should yield to more predictable retries, that are explicitly set in code Abort retries on TrinoFileSystemException, as it's not meant to be retried

Remove retry in AbstractTrinoCatalog, as it never can catch exception on which retry was set up Reduce amount of retries in AbstractIcebergTableOperations, as 20 retries with max time of 10 minutes seems way too big

Retry only when certain conditions happen, instead of aborting when they not happen

electrum · 2024-08-13T05:15:35Z

Thanks!

ebyhr · 2024-08-14T01:32:56Z

@oskar-szwajkowski TestIcebergGlueCatalogConnectorSmokeTest.testDeleteRowsConcurrently looks very flaky after this PR. Could you investigate the failure? https://github.com/trinodb/trino/actions/runs/10377395729/job/28731540853

oskar-szwajkowski · 2024-08-14T12:10:42Z

@oskar-szwajkowski TestIcebergGlueCatalogConnectorSmokeTest.testDeleteRowsConcurrently looks very flaky after this PR. Could you investigate the failure? https://github.com/trinodb/trino/actions/runs/10377395729/job/28731540853

Looking on it

cla-bot bot added the cla-signed label Jul 25, 2024

oskar-szwajkowski mentioned this pull request Jul 25, 2024

Do not retry iceberg operations on unrecoverable exceptions #19307

Closed

github-actions bot added iceberg Iceberg connector hive Hive connector bigquery BigQuery connector labels Jul 25, 2024

oskar-szwajkowski force-pushed the osz/improve-retry-logic branch 5 times, most recently from 7f134b4 to 2b3ac04 Compare July 25, 2024 18:04

wendigo requested review from electrum, nineinchnick and pajaks July 25, 2024 18:11

oskar-szwajkowski force-pushed the osz/improve-retry-logic branch from 2b3ac04 to 81e5215 Compare July 25, 2024 18:37

oskar-szwajkowski force-pushed the osz/improve-retry-logic branch from 81e5215 to 26ed395 Compare July 25, 2024 21:21

oskar-szwajkowski force-pushed the osz/improve-retry-logic branch from 26ed395 to cd96b7e Compare July 25, 2024 22:21

nineinchnick reviewed Jul 29, 2024

View reviewed changes

pajaks reviewed Jul 29, 2024

View reviewed changes

plugin/trino-hive/src/main/java/io/trino/plugin/hive/metastore/glue/v1/GlueClientUtil.java Outdated Show resolved Hide resolved

plugin/trino-hive/src/main/java/io/trino/plugin/hive/metastore/glue/v1/GlueClientUtil.java Outdated Show resolved Hide resolved

oskar-szwajkowski force-pushed the osz/improve-retry-logic branch from cd96b7e to be998b5 Compare July 29, 2024 13:14

oskar-szwajkowski requested a review from nineinchnick July 29, 2024 13:14

nineinchnick reviewed Jul 29, 2024

View reviewed changes

electrum reviewed Jul 29, 2024

View reviewed changes

oskar-szwajkowski force-pushed the osz/improve-retry-logic branch from be998b5 to e3f4909 Compare July 30, 2024 11:01

oskar-szwajkowski requested review from electrum and nineinchnick July 30, 2024 11:02

oskar-szwajkowski force-pushed the osz/improve-retry-logic branch from e3f4909 to 7b650fb Compare July 30, 2024 11:55

nineinchnick approved these changes Jul 30, 2024

View reviewed changes

oskar-szwajkowski force-pushed the osz/improve-retry-logic branch from 7b650fb to 7ce30b1 Compare July 30, 2024 12:59

pajaks approved these changes Jul 30, 2024

View reviewed changes

electrum reviewed Aug 1, 2024

View reviewed changes

oskar-szwajkowski force-pushed the osz/improve-retry-logic branch from 7ce30b1 to f7db45d Compare August 1, 2024 13:14

oskar-szwajkowski requested a review from electrum August 1, 2024 13:14

github-actions bot added the delta-lake Delta Lake connector label Aug 1, 2024

oskar-szwajkowski force-pushed the osz/improve-retry-logic branch 2 times, most recently from 1bdf40d to f3dd5e7 Compare August 1, 2024 17:22

oskar-szwajkowski added 4 commits August 2, 2024 00:11

Introduce TrinoFileSystemException

e18c07e

Utilize it in filesystems to mark operations that are terminal and shouldn't be retried

Change iceberg related retries

2456811

Remove retry in AbstractTrinoCatalog, as it never can catch exception on which retry was set up Reduce amount of retries in AbstractIcebergTableOperations, as 20 retries with max time of 10 minutes seems way too big

oskar-szwajkowski force-pushed the osz/improve-retry-logic branch from f3dd5e7 to ade596b Compare August 1, 2024 22:14

Change semantics when it comes to retries

7540ae9

Retry only when certain conditions happen, instead of aborting when they not happen

oskar-szwajkowski force-pushed the osz/improve-retry-logic branch from ade596b to 7540ae9 Compare August 2, 2024 07:03

electrum approved these changes Aug 13, 2024

View reviewed changes

electrum merged commit 5f191af into trinodb:master Aug 13, 2024

github-actions bot added this to the 454 milestone Aug 13, 2024

oskar-szwajkowski mentioned this pull request Aug 14, 2024

Use exponential backoff retry strategy for glue client #23039

Merged

colebow mentioned this pull request Aug 14, 2024

Add Trino 454 release notes #22977

Merged

Conversation

oskar-szwajkowski commented Jul 25, 2024

Description

Additional context and related issues

Release notes

Uh oh!

electrum commented Jul 25, 2024

Uh oh!

oskar-szwajkowski commented Jul 25, 2024

Uh oh!

electrum commented Jul 25, 2024

Uh oh!

oskar-szwajkowski commented Jul 25, 2024

Uh oh!

oskar-szwajkowski commented Jul 25, 2024

Uh oh!

electrum commented Jul 28, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

electrum commented Jul 29, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

electrum Aug 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

electrum commented Aug 13, 2024

Uh oh!

ebyhr commented Aug 14, 2024

Uh oh!

oskar-szwajkowski commented Aug 14, 2024

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

electrum Aug 1, 2024 •

edited

Loading