Skip to content

Conversation

@fuatbasik
Copy link
Collaborator

Description of change

This is adopting Retry logic to the new physical IO. Now, we are merging timeout and retries into a unified place.

With this change we are letting consumers also to pass Timeout setting as a part of Retry strategy. If passed, AAL will not use timeout values from PhysicalIO configuration.

I also removed timeouts from Telemetry as they are no longer relevant. Failsafe now manages retries and timeouts.

Relevant issues

N/A

Does this contribution introduce any breaking changes to the existing APIs or behaviors?

No.

Does this contribution introduce any new public APIs or behaviors?

Yes, Please see description Javadocs are updated accordingly.

How was the contribution tested?

Added new unit tests, and modified existing unit tests.

Does this contribution need a changelog entry?

  • I have updated the CHANGELOG or README if appropriate

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license and I agree to the terms of the Developer Certificate of Origin (DCO).

@fuatbasik fuatbasik temporarily deployed to integration-tests August 4, 2025 16:17 — with GitHub Actions Inactive
@fuatbasik fuatbasik changed the title Adopt retries to new PhysicalIO Adapt retries to new PhysicalIO Aug 4, 2025
Copy link
Contributor

@ozkoca ozkoca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Added a few minor requests

* @param durationInMillis Timeout duration for reading from storage
* @param retryCount Number of times to retry if Timeout Exceeds
*/
void timeout(long durationInMillis, int retryCount);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can be renamed sthg like setTimeoutPolicy()

@fuatbasik fuatbasik temporarily deployed to integration-tests August 6, 2025 12:35 — with GitHub Actions Inactive
Copy link
Collaborator

@ahmarsuhail ahmarsuhail left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @fuatbasik really like how simple this is becoming. left a few questions.

Also - Do we already have an ITest for the retry behaviour from a previous PR/or is there another test that covers this behaviour?

It would be good to test the number of test GET requests made when there is a timeout (maybe the gray failure ones cover this?). And also that when timeout is set in openStream information, that overrides the physicalIo configuration (sorry if these tests already exist and I've just missed them!)

boolean shouldEvict = false;

// Check for IO errors while reading data
if (e instanceof IOException
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we no longer evicting on a tiemout?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

retry should retry the timeout blocks. I am not sure if Timeout should be considered as a persistent change (like etag has changed) therefore, evicting already existing blocks sound wrong to me.

} catch (IOException e) {
LOG.error("IOException while reading blocks", e);
setErrorOnBlocksAndRemove(blocks, e);
if (objectContent == null) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not related to your PR, but in what scenario do we expect ObjectContent to be null?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually we dont. i think this is just and additional check to avoid NPE.

provided = new DefaultRetryStrategyImpl();
}

if (this.physicalIOConfiguration.getBlockReadTimeout() > 0) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok so the major change is that failsafe handles the timeout for us, if processReadTask() takes longer than X seconds for whatever reason, it gets retried. this is so much simpler now, nice :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ArgumentCaptor<GetRequest> requestCaptor = ArgumentCaptor.forClass(GetRequest.class);
verify(objectClient, timeout(1_000).times(3)).getObject(requestCaptor.capture(), any());

List<GetRequest> getRequestList = requestCaptor.getAllValues();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did you need to update this test?

Copy link
Collaborator Author

@fuatbasik fuatbasik Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this test we know we will make 3 reads (2 5MB and 1 3MB) but we do not know their order due to async. behaviour. Therefore, updated the tests.

@fuatbasik fuatbasik temporarily deployed to integration-tests August 7, 2025 10:36 — with GitHub Actions Inactive
openStreamInfo);

// Asserting there is timeout
assertThrows(IOException.class, overrideStream::read);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be good to also assert that the cause of the IoException is Timeout

Copy link
Collaborator

@ahmarsuhail ahmarsuhail left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM

@fuatbasik fuatbasik merged commit 30393bd into awslabs:main Aug 7, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants