Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[serve][tests] Add a timeout for resnet app image request #51569

Merged

Conversation

akyang-anyscale
Copy link
Contributor

@akyang-anyscale akyang-anyscale commented Mar 20, 2025

Why are these changes needed?

The request for the image in the resnet50 application could hang indefinitely. This could block the event loop and make tests flaky. This PR adds a 5s timeout to the get call.

Replica is stuck for some reason. As a result, requests are not making progress and the client is disconnecting due to timeout.

2025-03-20, 0:02:18.331 | replica | d4c5cca8-c654-4d45-b212-ae0061194a86 | ip-10-0-41-135 | 6v27b6by | GET / CANCELLED 60001.2ms
-- | -- | -- | -- | -- | --
I | 2025-03-20, 0:02:19.247 | replica | d486af45-d50e-468f-903f-d4bf723ea143 | ip-10-0-41-135 | 6v27b6by | GET / CANCELLED 59998.3ms
I | 2025-03-20, 0:02:19.785 | replica | 173f5209-2324-43f4-a59d-d4e5ee974462 | ip-10-0-41-135 | 6v27b6by | GET / CANCELLED 59998.1ms
I | 2025-03-20, 0:02:22.534 | proxy | edeba18e-7af1-4881-a584-7f528f6d9ed2 | ip-10-0-43-65 |   | Replica(id='6v27b6by', deployment='Model', app='default') rejected request because it is at max capacity of 5 ongoing requests. Retrying request edeba18e-7af1-4881-a584-7f528f6d9ed2.
I | 2025-03-20, 0:03:16.520 | replica | 100a8f5d-b8be-4263-ada3-518419eb6673 | ip-10-0-41-135 | 6v27b6by | GET / CANCELLED 59995.2ms
I | 2025-03-20, 0:03:17.747 | replica | d433ecfa-a55b-4a56-bbad-0de5c6f1f6d9 | ip-10-0-41-135 | 6v27b6by | GET / CANCELLED 59999.2ms
I | 2025-03-20, 0:03:19.081 | replica | ed2a9925-5e9b-474c-a644-ff701c8c7899 | ip-10-0-41-135 | 6v27b6by | GET / CANCELLED 59998.9ms
I | 2025-03-20, 0:03:19.273 | replica | f67b7e3a-d33c-43e6-88a8-415c9aa6be69 | ip-10-0-41-135 | 6v27b6by | GET / CANCELLED 59998.4ms
I | 2025-03-20, 0:03:20.918 | replica | 696ca5d9-f9b4-4899-a813-c2640a24c1ff | ip-10-0-41-135 | 6v27b6by | GET / CANCELLED 59999.3ms
I | 2025-03-20, 0:03:22.772 | proxy | 5dafad1a-455e-46ab-b584-686fafb7f420 | ip-10-0-43-65 |   | Replica(id='6v27b6by', deployment='Model', app='default') rejected request because it is at max capacity of 5 ongoing requests. Retrying request 5dafad1a-455e-46ab-b584-686fafb7f420.
I | 2025-03-20, 0:04:16.590 | replica | 54e4994f-efaa-413e-b45b-11080b60573f | ip-10-0-41-135 | 6v27b6by | GET / CANCELLED 59996.8ms
I | 2025-03-20, 0:04:18.386 | replica | cc74890f-9369-4664-9dd7-4bbadc979245 | ip-10-0-41-135 | 6v27b6by | GET / CANCELLED 59998.7ms
I | 2025-03-20, 0:04:20.177 | proxy | 3d09d80c-9180-47cb-b8d2-16e54420447a | ip-10-0-43-65 |   | Replica(id='6v27b6by', deployment='Model', app='default') rejected request because it is at max capacity of 5 ongoing requests. Retrying request 3d09d80c-9180-47cb-b8d2-16e54420447a.
I | 2025-03-20, 0:04:20.922 | replica | 769f085e-17d2-4fec-ab78-9051cb2ae1ee | ip-10-0-41-135 | 6v27b6by | GET / CANCELLED 59999.1ms
I | 2025-03-20, 0:04:21.137 | replica | a1515707-2daf-4c85-9894-9303351ac64f | ip-10-0-41-135 | 6v27b6by | GET / CANCELLED 59997.8ms
I | 2025-03-20, 0:04:21.486 | replica | 9db28a80-a45a-4a0a-a209-ccd8f7d4a559 | ip-10-0-41-135 | 6v27b6by | GET / CANCELLED 59998.7ms
I | 2025-03-20, 0:04:29.288 | proxy | 58cb1d1e-fb68-4367-8b71-a704c140f8ab | ip-10-0-43-65 |   | Replica(id='6v27b6by', deployment='Model', app='default') rejected request because it is at max capacity of 5 ongoing requests. Retrying request 58cb1d1e-fb68-4367-8b71-a704c140f8ab.
I | 2025-03-20, 0:05:16.701 | replica | 868d7cd3-e205-4a8b-9747-7a36a362be0c | ip-10-0-41-135 | 6v27b6by | GET / CANCELLED 59997.5ms
I | 2025-03-20, 0:05:20.062 | replica | 413193a9-1162-4dfd-a4bd-714cc3973cfe | ip-10-0-41-135 | 6v27b6by | GET / CANCELLED 59999.1ms
I | 2025-03-20, 0:05:21.596 | replica | 7a5109e9-d975-4319-b8b1-412b6b90b6c7 | ip-10-0-41-135 | 6v27b6by | GET / CANCELLED 60001.3ms
I | 2025-03-20, 0:05:21.946 | replica | 2dd3a7d2-0988-46db-abed-088241b1f065 | ip-10-0-41-135 | 6v27b6by | GET / CANCELLED 59999.3ms
I | 2025-03-20, 0:05:22.908 | proxy | d95da992-9e5e-40f6-8746-afcf8f3cbc34 | ip-10-0-43-65 |   | Replica(id='6v27b6by', deployment='Model', app='default') rejected request because it is at max capacity of 5 ongoing requests. Retrying request d95da992-9e5e-40f6-8746-afcf8f3cbc34.
I | 2025-03-20, 0:05:23.046 | replica | 65460812-87b2-4410-8360-c82521233400 | ip-10-0-41-135 | 6v27b6by | GET / CANCELLED 59997.2ms
I | 2025-03-20, 0:05:24.479 | proxy | a8bc058a-04f6-41c8-9c66-49956f32dad0 | ip-10-0-41-135 |   | Replica(id='6v27b6by', deployment='Model', app='default') rejected request because it is at max capacity of 5 ongoing requests. Retrying request a8bc058a-04f6-41c8-9c66-49956f32dad0.
I | 2025-03-20, 0:06:16.765 | replica | b34ad934-8662-4e4c-a109-9cb44c636613 | ip-10-0-41-135 | 6v27b6by | GET / CANCELLED 60000.9ms
I | 2025-03-20, 0:06:17.227 | proxy | a571adce-3fe6-4799-a72c-d2ad11594edc | ip-10-0-41-135 |   | Replica(id='6v27b6by', deployment='Model', app='default') rejected request because it is at max capacity of 5 ongoing requests. Retrying request a571adce-3fe6-4799-a72c-d2ad11594edc.
I | 2025-03-20, 0:06:20.594 | replica | 3175da08-d8ed-4928-ba62-1a9dc774b690 | ip-10-0-41-135 | 6v27b6by | GET / CANCELLED 59999.2ms
I | 2025-03-20, 0:06:21.329 | proxy | 2b6cb246-0ff3-4a03-a129-4e5c69e99d6e | ip-10-0-43-65 |   | Replica(id='6v27b6by', deployment='Model', app='default') rejected request because it is at max capacity of 5 ongoing requests. Retrying request 2b6cb246-0ff3-4a03-a129-4e5c69e99d6e.
I | 2025-03-20, 0:06:22.382 | replica | b8adf8db-a21a-490f-98aa-ce4e2eb347fd | ip-10-0-41-135 | 6v27b6by | GET / CANCELLED 60000.3ms
I | 2025-03-20, 0:06:22.476 | replica | 03b6fe72-d630-479a-8072-0e1665192db9 | ip-10-0-41-135 | 6v27b6by | GET / CANCELLED 60000.7ms
I | 2025-03-20, 0:06:24.476 | replica | e260d657-0975-45dd-8503-98a53d8f201e | ip-10-0-41-135 | 6v27b6by | GET / CANCELLED 59999.4ms

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: akyang-anyscale <[email protected]>
@akyang-anyscale akyang-anyscale added the go add ONLY when ready to merge, run all tests label Mar 20, 2025
@akyang-anyscale akyang-anyscale requested a review from zcin March 20, 2025 23:46
Copy link
Contributor

@zcin zcin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there anything else we might want to potentially guard against like the torch ml APIs to be completely sure the bug isn't within serve

Signed-off-by: akyang-anyscale <[email protected]>
@akyang-anyscale
Copy link
Contributor Author

is there anything else we might want to potentially guard against like the torch ml APIs to be completely sure the bug isn't within serve

yeah it's not the prettiest but it could be worth to surface any potential serve issues faster. wrapped them in a thread w/ timeout

Signed-off-by: akyang-anyscale <[email protected]>
Copy link
Contributor

@abrarsheikh abrarsheikh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i have a feeling that this is a symptom of a larger issue and this change is masking some underlying problem.

Okay to ship it as a stopgap, but let's file a follow up and link that ticket in the diff.

Also, please include tracebacks that were observed in the test failure.

@akyang-anyscale
Copy link
Contributor Author

akyang-anyscale commented Mar 21, 2025

i have a feeling that this is a symptom of a larger issue and this change is masking some underlying problem.

Okay to ship it as a stopgap, but let's file a follow up and link that ticket in the diff.

Yeah there may be a larger problem we don't know yet. Hopefully this PR will tell us if the flakiness is caused by resnet_50 application code or some hidden bug in Serve. This PR focuses on eliminating the former as a possibility, so if the test still flakes, then there's likely a bug in the replica Serve code. The theory right now is that the resnet_50 application is blocking (either requests.get or some torch operation), causing the replica to get hang and the test to occasionally fail.

@zcin zcin merged commit a42e658 into ray-project:master Mar 21, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants