-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Release test long_running_many_actor_tasks failed #40568
Comments
@vitsai is this blocking ray28 release? |
I believe all failed release tests on the release branch are blockers, yes. |
https://console.anyscale-staging.com/o/anyscale-internal/jobs/prodjob_e9jfg33ubd76vpgh6vjtuctfft |
Logs for OOM after a good 6h of working on the
The The node has a steady 15.6GB mem usage until the very last minute (2023-10-21 10:38) the mem jumped to 27.8GB and Ray OOMs. Q:
Note: this grafana screenshot is for the unit test control plane session, not the real workload session. |
@rynewang is it possible to run this test in 2.7 and if it shows a similar mem usage? |
I think the risk is that it is a memory leak introduced in 2.8 |
running side by side workload 2.7.1optimized vs 3.0(a random commit on master) |
Btw, for the answers;
Maybe you can also post the log of gcs_server.out when this happens? |
Was there any failure or sth like that from actors? When actors only run tasks, it should not touch GCS at all |
Here is the full range of commits, with this being the most likely culprit: 25b57d0 add dask version (#40537) |
So it's confirmed 3e8278d is the root cause. Reason:
The only unknown is why on the metric page, this doesn't show up as a slowly increase but a burst. But the bisection is kind of conclusive:
|
This is merged into release branch, can we close it soon? |
https://buildkite.com/ray-project/release-tests-branch/builds/2282#018b5007-4248-48e1-b2af-40f41b7ba51f
The text was updated successfully, but these errors were encountered: