Skip to content

Conversation

@Sparks0219
Copy link
Contributor

Object fetch's may take a while when in the presence of transient network errors due to long reconstruction chains + when push/pull fails we need to resend over all the chunks again. Hence we were seeing some ObjectFetchTimeout errors in the cross AZ transient network error release tests due to this, bumping up the timeout from 10 min to 30 min accordingly.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request increases the object fetch timeout to 30 minutes for a release test that simulates transient network errors. This is a reasonable change to fix test flakiness. My feedback includes one suggestion to add a comment to the timeout value for better readability.

- RAY_health_check_timeout_ms=100000
- RAY_health_check_failure_threshold=10
- RAY_gcs_rpc_server_connect_timeout_s=60
- RAY_fetch_fail_timeout_milliseconds=1800000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For better readability and maintainability, consider adding a comment to clarify that this value represents 30 minutes. This helps other developers understand the value at a glance without having to perform the calculation.

        - RAY_fetch_fail_timeout_milliseconds=1800000 # 30 minutes

@ray-gardener ray-gardener bot added core Issues that should be addressed in Ray Core release-test release test labels Jan 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core release-test release test

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant