-
-
Notifications
You must be signed in to change notification settings - Fork 11.8k
Drop flaky test_healthcheck_response_time #22539
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Drop flaky test_healthcheck_response_time #22539
Conversation
This test is flaky in CI. https://buildkite.com/organizations/vllm/analytics/suites/ci-1/tests/c4322c0f-95c8-8ab8-a80f-01a25f114d5e?period=28days This test is very timing sensitive and may fail depending on what system it is run on, or what else is going on in the system under test. If we ever try to run more tests in parallel, it will make this worse. One example failure has a message like this: [2025-08-04T10:45:00Z] > assert load_response_time < 0.1 [2025-08-04T10:45:00Z] E assert 0.10481084599996393 < 0.1 The error here is that the response time took just over 0.1 seconds. We could adjust the timing parameters, but I don't think adjusting to where it won't be flaky will result in a test we get enough value out of. I propose just dropping it. Signed-off-by: Russell Bryant <[email protected]>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request removes a flaky test test_healthcheck_response_time. While I understand the motivation to remove flaky tests from CI, I believe this particular test serves a valuable purpose in guarding against a potential denial-of-service vulnerability. Instead of removing it, I recommend fixing the flakiness. My review provides a detailed comment on why the test is important and how it could be improved.
|
Agreed. There can be a better way to test this. In addition, the behavior it is testing shouldn’t be an issue anymore as we don’t run blocking ops in api server. |
|
cc @njhill |
Signed-off-by: Russell Bryant <[email protected]> Signed-off-by: Jinzhen Lin <[email protected]>
Signed-off-by: Russell Bryant <[email protected]> Signed-off-by: Noam Gat <[email protected]>
Signed-off-by: Russell Bryant <[email protected]> Signed-off-by: Paul Pak <[email protected]>
Signed-off-by: Russell Bryant <[email protected]> Signed-off-by: Diego-Castan <[email protected]>
Signed-off-by: Russell Bryant <[email protected]>
Signed-off-by: Russell Bryant <[email protected]>
Signed-off-by: Russell Bryant <[email protected]> Signed-off-by: Xiao Yu <[email protected]>
Signed-off-by: Russell Bryant <[email protected]>
This test is flaky in CI.
https://buildkite.com/organizations/vllm/analytics/suites/ci-1/tests/c4322c0f-95c8-8ab8-a80f-01a25f114d5e?period=28days
This test is very timing sensitive and may fail depending on what
system it is run on, or what else is going on in the system under
test. If we ever try to run more tests in parallel, it will make this
worse.
One example failure has a message like this:
[2025-08-04T10:45:00Z] > assert load_response_time < 0.1
[2025-08-04T10:45:00Z] E assert 0.10481084599996393 < 0.1
The error here is that the response time took just over 0.1 seconds.
We could adjust the timing parameters, but I don't think adjusting to
where it won't be flaky will result in a test we get enough value out
of. I propose just dropping it.
Signed-off-by: Russell Bryant [email protected]