Skip to content

Improve retry logic#22814

Merged
electrum merged 5 commits intotrinodb:masterfrom
oskar-szwajkowski:osz/improve-retry-logic
Aug 13, 2024
Merged

Improve retry logic#22814
electrum merged 5 commits intotrinodb:masterfrom
oskar-szwajkowski:osz/improve-retry-logic

Conversation

@oskar-szwajkowski
Copy link
Copy Markdown
Contributor

Description

This PR improves retries/exception handling around file system operations, additional context can be found in commit messages and in #22678

Changed places in S3/Azure/Gcs filesystems where we throw IOException, and use UnrecoverableIOException whenever it makes sense (it extends from IOException so is backward compatible)

All SDK exceptions are wrapped with UnrecoverableIOException, as third party clients have default retry logic, which we shouldn't further retry whenever it fails

Additional context and related issues

Closes #22678

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(X) Release notes are required, with the following suggested text:

* Skip retrying some file system operations that are meant to fail, failing fast instead.

@cla-bot cla-bot bot added the cla-signed label Jul 25, 2024
@github-actions github-actions bot added iceberg Iceberg connector hive Hive connector bigquery BigQuery connector labels Jul 25, 2024
@oskar-szwajkowski oskar-szwajkowski force-pushed the osz/improve-retry-logic branch 5 times, most recently from 7f134b4 to 2b3ac04 Compare July 25, 2024 18:04
@oskar-szwajkowski oskar-szwajkowski force-pushed the osz/improve-retry-logic branch from 2b3ac04 to 81e5215 Compare July 25, 2024 18:37
@electrum
Copy link
Copy Markdown
Member

We should remove the retry logic for materialized view fetches. I have no idea why it was added there -- none of the other operations I looked at have it.

@oskar-szwajkowski oskar-szwajkowski force-pushed the osz/improve-retry-logic branch from 81e5215 to 26ed395 Compare July 25, 2024 21:21
@oskar-szwajkowski
Copy link
Copy Markdown
Contributor Author

We should remove the retry logic for materialized view fetches. I have no idea why it was added there -- none of the other operations I looked at have it.

Did you mean remove retries from AbstractIcebergTableOperations#refreshFromMetadataLocation or AbstractTrinoCatalog#getMaterializedView ?

or both?

I see that AbstractTrinoCatalog#getMaterializedView is being retried only on MaterializedViewMayBeBeingRemovedException, which is never instantiated so it can be removed rather safely

When it comes to AbstractIcebergTableOperations#refreshFromMetadataLocation, it seems like its retrying on more number of cases, but 20 retries seems like a lot for this operation

@electrum
Copy link
Copy Markdown
Member

If seems fairly arbitrary why many of these exceptions are made non-retriable. For example, why couldn't you retry on a listing failure?

@oskar-szwajkowski
Copy link
Copy Markdown
Contributor Author

If seems fairly arbitrary why many of these exceptions are made non-retriable. For example, why couldn't you retry on a listing failure?

I followed the logic that if we catch raw SDK exception, it means that underlying library has failed to retry (which is does by default on exceptions that can be retried), so there is no point trying again

I checked that all AWS/Azure/Gcp clients are set up with default retry policies that we are not overriding, so we have retries for free, trying to add additional on top is just making things fail after more time

Described it more here: #22678 (comment)

@oskar-szwajkowski oskar-szwajkowski force-pushed the osz/improve-retry-logic branch from 26ed395 to cd96b7e Compare July 25, 2024 22:21
@oskar-szwajkowski
Copy link
Copy Markdown
Contributor Author

Added new commit on top of previous ones with changing iceberg retries @electrum

@electrum
Copy link
Copy Markdown
Member

"Change iceberg related retries" looks good

@electrum
Copy link
Copy Markdown
Member

I followed the logic that if we catch raw SDK exception, it means that underlying library has failed to retry (which is does by default on exceptions that can be retried), so there is no point trying again

Makes sense. Thanks for explaining.

@oskar-szwajkowski oskar-szwajkowski force-pushed the osz/improve-retry-logic branch from be998b5 to e3f4909 Compare July 30, 2024 11:01
@oskar-szwajkowski oskar-szwajkowski force-pushed the osz/improve-retry-logic branch from 7b650fb to 7ce30b1 Compare July 30, 2024 12:59
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked and we can drop the throws clause from that interface, as none of the implementations throw checked exceptions. We could actually replace it with UnaryOperator<Table>.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The handling needs to be updated, as there is no longer a FailsafeException wrapping something. We should treat these like other Glue exceptions, probably by changing this to

catch (SdkException e) {
    throw new TrinoException(HIVE_METASTORE_ERROR, e);
}

This outer try block can be removed, and this handling moved to the block above.

Copy link
Copy Markdown
Member

@electrum electrum Aug 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is correct. This retry is in the Glue client and software.amazon.awssdk.services.glue.model.ConcurrentModificationException is a GlueException, so it should be thrown directly from the client.

The original retry code for Glue v1 dropTable and associated test seems to be based on a stack trace after the failure was wrapped in TrinoException, which wouldn't be the case. Although it would be correct for the later added updateTableStatistics as that retry happens outside of the wrapping. The test TestHiveConcurrentModificationGlueMetastore should be updated to throw ConcurrentModificationException directly.

Ideally, we can find an actual exception from the Glue service to validate this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is fine, looking at v2 client retries, it uses AwsRetryStrategy#configureStrategy which uses same mechanism, adding some default handling of AwsServiceException classes, so adding handling of another children of this class (ConcurrentModificationException) would just extend this behavior

But Throwables.getRootCause is not needed here, as it can be checked with instanceof directly, as we are not wrapping it with trino exception at this point

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When it comes to v1, it is analogical, com.amazonaws.retry.PredefinedRetryPolicies also check if exception is AmazonServiceException when performing retries, so its not only using it externally to the client, but also internally, so its perfectly fine to check it in lambda, but again, we don't need to check root cause but just plain instanceof should do it

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is article on that here, that external retries shouldn't be done, but retry customizers should be used instead:
https://docs.aws.amazon.com/codeguru/detector-library/java/aws-custom-retries/

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TestHiveConcurrentModificationGlueMetastore and proxying updateTable does not hold anymore, as those retries are happening deeper than in GlueClient

I think way to test this is to build glue client and assert that retry strategy that is set will yield to try when ConcurrentModificationException is encountered

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed second commit Do not rely on Failsafe in glue metastore to test if retry policy is set up correctly on glue client, instead of proxying method with throwing exception, as it doesn't work anymore due to exception being handled deeper

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@electrum can I have re-review after latest changes? I'd like to proceed with this PR

@oskar-szwajkowski oskar-szwajkowski force-pushed the osz/improve-retry-logic branch from 7ce30b1 to f7db45d Compare August 1, 2024 13:14
@github-actions github-actions bot added the delta-lake Delta Lake connector label Aug 1, 2024
@oskar-szwajkowski oskar-szwajkowski force-pushed the osz/improve-retry-logic branch 2 times, most recently from 1bdf40d to f3dd5e7 Compare August 1, 2024 17:22
Utilize it in filesystems to mark operations
that are terminal and shouldn't be retried
Add retries to Glue v1 and v2 client for ConcurrentModificationException
instead of relying on custom retries with Failsafe

Change tests related to ConcurrentModificationException
as those are now handled by glue client internally

Instead of proxying client to throw this exception,
check if glue client's retry policy is able to retry on this exception
Instead of aborting Failsafe's retries on certain conditions,
make retries happen under specific circumstances

This should yield to more predictable retries, that are explicitly
set in code

Abort retries on TrinoFileSystemException, as it's
not meant to be retried
Remove retry in AbstractTrinoCatalog, as it never can
catch exception on which retry was set up

Reduce amount of retries in AbstractIcebergTableOperations,
as 20 retries with max time of 10 minutes seems way too big
@oskar-szwajkowski oskar-szwajkowski force-pushed the osz/improve-retry-logic branch from f3dd5e7 to ade596b Compare August 1, 2024 22:14
Retry only when certain conditions happen,
instead of aborting when they not happen
@oskar-szwajkowski oskar-szwajkowski force-pushed the osz/improve-retry-logic branch from ade596b to 7540ae9 Compare August 2, 2024 07:03
@electrum electrum merged commit 5f191af into trinodb:master Aug 13, 2024
@electrum
Copy link
Copy Markdown
Member

Thanks!

@github-actions github-actions bot added this to the 454 milestone Aug 13, 2024
@ebyhr
Copy link
Copy Markdown
Member

ebyhr commented Aug 14, 2024

@oskar-szwajkowski TestIcebergGlueCatalogConnectorSmokeTest.testDeleteRowsConcurrently looks very flaky after this PR. Could you investigate the failure? https://github.com/trinodb/trino/actions/runs/10377395729/job/28731540853

@oskar-szwajkowski
Copy link
Copy Markdown
Contributor Author

@oskar-szwajkowski TestIcebergGlueCatalogConnectorSmokeTest.testDeleteRowsConcurrently looks very flaky after this PR. Could you investigate the failure? https://github.com/trinodb/trino/actions/runs/10377395729/job/28731540853

Looking on it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bigquery BigQuery connector cla-signed delta-lake Delta Lake connector hive Hive connector iceberg Iceberg connector

Development

Successfully merging this pull request may close these issues.

Retrying filesystem operations doesn't make sense for terminal exceptions (like bucket not exists)

5 participants