Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release test long_running_many_actor_tasks failed #40568

Closed
vitsai opened this issue Oct 23, 2023 · 12 comments
Closed

Release test long_running_many_actor_tasks failed #40568

vitsai opened this issue Oct 23, 2023 · 12 comments
Assignees
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core P0 Issues that should be fixed in short order ray 2.8 release-blocker P0 Issue that blocks the release

Comments

@vitsai
Copy link
Contributor

vitsai commented Oct 23, 2023

https://buildkite.com/ray-project/release-tests-branch/builds/2282#018b5007-4248-48e1-b2af-40f41b7ba51f

@vitsai vitsai added release-blocker P0 Issue that blocks the release triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Oct 23, 2023
@anyscalesam anyscalesam added the core Issues that should be addressed in Ray Core label Oct 23, 2023
@anyscalesam
Copy link
Contributor

@vitsai is this blocking ray28 release?

@jjyao jjyao added the P0 Issues that should be fixed in short order label Oct 23, 2023
@jjyao jjyao assigned rynewang and unassigned scv119 Oct 23, 2023
@vitsai
Copy link
Contributor Author

vitsai commented Oct 23, 2023

I believe all failed release tests on the release branch are blockers, yes.

@hora-anyscale hora-anyscale removed the triage Needs triage (eg: priority, bug/not-bug, and owning component) label Oct 24, 2023
@jjyao jjyao added the ray 2.8 label Oct 24, 2023
@rynewang
Copy link
Contributor

ray.exceptions.OutOfMemoryError: Task was killed due to the node running low on memory.

https://console.anyscale-staging.com/o/anyscale-internal/jobs/prodjob_e9jfg33ubd76vpgh6vjtuctfft

@jjyao jjyao added the bug Something that is supposed to be working; but isn't label Oct 24, 2023
@rynewang
Copy link
Contributor

rynewang commented Oct 24, 2023

Logs for OOM after a good 6h of working on the many_actor_tasks.py work:

Memory on the node (IP: 10.0.30.108, ID: 5a2a082d5ae843d5f7a75365957cbc63ea1ee5c18b112b7e568553b8) where the task (actor ID: 7610140d413102328133bef401000000, name=Actor.__init__, pid=1900, memory used=0.06GB) was running was 27.57GB / 28.80GB (0.95745), which exceeds the memory usage threshold of 0.95. Ray killed this worker (ID: a05567e1855b961abd91ad70da14b4c14774c76da94cd9bdcbd8b556) because it was the most recently scheduled task; to see more information about memory usage on this node, use `ray logs raylet.out -ip 10.0.30.108`. To see the logs of the worker, use `ray logs worker-a05567e1855b961abd91ad70da14b4c14774c76da94cd9bdcbd8b556*out -ip 10.0.30.108. Top 10 memory users:
PID     MEM(GB) COMMAND
912     22.01   /home/ray/anaconda3/lib/python3.8/site-packages/ray/core/src/ray/gcs/gcs_server --log_dir=/tmp/ray/s...
897     0.23    python workloads/many_actor_tasks.py
179     0.18    /home/ray/anaconda3/lib/python3.8/site-packages/ray/core/src/ray/gcs/gcs_server --log_dir=/tmp/ray/s...
288     0.08    /home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.8/site-packages/ray/dashboard/agen...
85      0.07    /home/ray/anaconda3/bin/python /home/ray/anaconda3/bin/anyscale session web_terminal_server --deploy...
290     0.07    /home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/runti...
230     0.06    /home/ray/anaconda3/bin/python /home/ray/anaconda3/lib/python3.8/site-packages/ray/dashboard/dashboa...
836     0.06    ray::JobSupervisor
73      0.06    /home/ray/anaconda3/bin/python /home/ray/anaconda3/bin/jupyter-lab --allow-root --ip=127.0.0.1 --no-...
1899    0.06    ray::Actor
Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. Set max_restarts and max_task_retries to enable retry when the task crashes due to OOM. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero.
Unexpected error occurred: Task was killed due to the node running low on memory.

The gcs_server eats 22GB memory which OOMs the whole system. Looking at the dash:

image

The node has a steady 15.6GB mem usage until the very last minute (2023-10-21 10:38) the mem jumped to 27.8GB and Ray OOMs.

Q:

  • why is OOM killer reporting 22GB vs 27.8GB?
  • why is gcs server eating a lot more mem at that time, since the workload is steady (infinite loop of making hundreds of 1MB method calls)?

Note: this grafana screenshot is for the unit test control plane session, not the real workload session.

@rkooo567
Copy link
Contributor

rkooo567 commented Oct 24, 2023

@rynewang is it possible to run this test in 2.7 and if it shows a similar mem usage?

@rkooo567
Copy link
Contributor

I think the risk is that it is a memory leak introduced in 2.8

@rynewang
Copy link
Contributor

running side by side workload 2.7.1optimized vs 3.0(a random commit on master)

@rkooo567
Copy link
Contributor

Btw, for the answers;

why is OOM killer reporting 22GB vs 27.8GB?

27.57GB / 28.80GB (0.95745), -> actually this seems correct.

why is gcs server eating a lot more mem at that time, since the workload is steady (infinite loop of making hundreds of 1MB method calls)?

Maybe you can also post the log of gcs_server.out when this happens?

@rkooo567
Copy link
Contributor

Was there any failure or sth like that from actors? When actors only run tasks, it should not touch GCS at all

@rynewang rynewang assigned rickyyx and unassigned rynewang Oct 26, 2023
@rickyyx
Copy link
Contributor

rickyyx commented Oct 26, 2023

Here is the full range of commits, with this being the most likely culprit:
3e8278d [core][dashboard] Task backend GC policy - GCS refactor [2/3] (#38792)

25b57d0 add dask version (#40537)
0fc52f7 Change version numbers in 2.8 release (#40515)
dfdd43c pick of #40525
e6c6705 pick of #40525
96efc33 make sure tests run serially within each docker (#40509)
add9561 Increase timeout for test basic 4 (#40492)
3e8278d [core][dashboard] Task backend GC policy - GCS refactor [2/3] (#38792)
cfdc6e0 [ml] remove alpa release tests (#40510)
66cfec9 Revert "[core] Fix placement groups scheduling when no resources are specified (#39946)" (#40506)
a1ac74f [ci] Change the owner of cluster launcher related tests to clusters team. (#40424)
2d72609 [Doc] [KubeRay] [RayJob] Add info about submitter pod template (#40158)
3c0476a [ci] support gpu core assignment per test shard
af332f4 [core] Fix placement groups scheduling when no resources are specified (#39946)
b5ef0ae [serve] Add microbenchmarks for streaming HTTP and DeploymentHandle calls (#40498)
f9de855 [Data] Move _fetch_metadata_parallel to file_meta_provider (#40295)
318fd57 [Data] Fix bug where _StatsActor errors with PandasBlock (#40481)
d8f2527 [ci] move train/serve/default minimal tests to civ2 (#40454)
1fb6147 [data] add dataset name (#40430)
f3bc522 [Data] Remove deprecated do_write (#40422)
7c44833 [RLlib] Issue 39586: Fix dict space restoration from serialized (ordered dict vs normal dict provided by user). (#39627)
7fa1c28 [serve] Migrate workflow tests using v1 api (#40472)
f4b5f6b [owners] remove code owners that are no longer active. (#40476)
b83b591 [Cluster launcher] [vSphere] avoid to fetch private ip (#40204)
9ba85ae [Serve] Get rid of ray cluster setup for test_schema test (#40469)
bdc9f83 [RLlib] Add on_checkpoint_loaded callback AND also store eval workers' policy_mapping_fn in algo state. (#40350)
4f6c28f [Serve] [Docs] Update Serve docs to use the dashboard head instead of the agent (#40474)
1845f1c [Serve] Support arg builders with non-Pydantic type hints (#40471)
a6bc5ac [RLlib] Issue 40312: Better documentation on how to do inference with DreamerV3 (once trained). (#40448)
d28f645 [core] Error check Redis get requests (#40333)
56f6adc [core] Fix placement group invariant of PlacementResources being superset of Resources (#40038)
149536d migrate rllib gpu tests to civ2 (#40439)
f077834 [ci] support debug builds (#40466)
1a08c16 [Train][Templates] Add LoRA support to Llama-2 finetuning example (#37794)
d6baf12 [core] Fix session key check (#40468)
f905171 [Doc] Fix streaming generator doc code #40447
bf1c581 [Data][Docs] Add Dataset.write_sql to API reference (#40473)
779c08a [Train] Update Lightning RayDDPStrategy docstring (#40376)
819733f [Data] Improve error message when reading HTTP files (#40462)
8e86f25 [serve] Fix linkcheck + remove deprecated rest api (#40464)
4d93e37 [Data] Deflake Data CI test suites: test_stats, test_streaming_executor, test_object_gc (#40457)
b60c172 [Core] [runtime env] Fix get_wheel_filename being out of date (#39965)
d80fd1d [data] link dataset ids in constructor, return correct metrics id for materialize (#40413)
7c5b275 [ci] move ray on spark test to civ2 (#40438)
bfe026f [civ2][gpu/4] migrate rllib multi-gpu tests to civ2 (#40379)
c6347d5 [serve] Remove custom FastAPI encoders (#40449)
b5eae24 minimal (#40433)
3052a8d migrate ml tests to civ02 (#40440)
c44765a [ci] mark dataset_shuffle_push_based_random_shuffle_100tb.aws as unstable (#40437)
3e13e7c [serve] Remove extra comment (#40441)
8f5cd61 [data] ray data dashboard config (#40195)
2da60b7 [runtime_env]: Remove hypen from profiler config (#40395)
5f832b3 [core][streaming][python] Fix asyncio.wait coroutines args deprecated warnings #40292
8a7d674 [ci] migrate debug + asan core builds to civ2 (#40418)
512e6ad [civ2][gpu/3] create rllib gpu builds (#40364)
6a5215c [unjail] test_redis_tls (#40423)
6ba659f [Data] Cap op concurrency with exponential ramp-up (#40275)
929b445 [tune] remove test_client.py (#40415)
b3c1424 move serve release tests to civ2 (#40414)
5205a2d [ci] change to oss tags (#40428)
f10c259 Revert "[docker image] use buildkit to build ray image (#40365)" (#40427)
11fb194 [Data] Move BlockWritePathProvider to separate file (#40302)
56b72a5 [Data] Remove out-of-date Data examples (#40127)
c91ee0f [tune] remove TuneRichReporter (#40169)
4d0c05b release test infra still relies on python 3.7 (#40407)
9d9e7c3 [serve] Clean up test_metrics.py::test_queued_queries_disconnect (#40410)
58b2614 [serve] Migrate v1 api release tests (#40372)
ff58667 [serve] Outdated API cleanup in docs (#40404)
c16082f [ci] add special tag for ray and ray-ml image steps (#40394)
bc80271 Update CODEOWNERS (#40268)
ac73b15 [serve] Fix flaky test_autoscaling_policy on windows (#40411)
c4ad3d5 [jobs] Fix recovery race condition in JobManager (#40068)
7bd1d3a [Data] Deprecate extraneous Dataset parameters and methods (#40385)
405e82a [RLlib] Issue deprecation warnings for all rllib_contrib algos. (#40147)
e28e2a6 [serve] Deprecate DAG API (#40290)
f787843 [Doc][KubeRay] Add a section for Redis cleanup (#40308)
71e893c [doc] Add vSphere version requirement in user guide (#40284)
67ec447 [Doc] Logging: Add Fluent Bit DaemonSet and CloudWatch for persistence (#39895)
bde327f [Jobs] append error trace to job driver logs (#40380)
a8b24da [Serve] Fix the benchmark import error (#40381)
574eb54 [serve] Migrate v1 api tests (#40363)
16da484 [Core] Introduce AcceleratorManager interface (#40286)
56affb7 [RLlib] Fix BC release test failure. (#40371)
dc944fe [Dependencies] Remove pickle5 backport (#40338)
58cd807 [dashboard] Remove /api/snapshot endpoint (#40269)
a2ef28d [Core] Bugfix/runtime agent binding (#40092) (#40311)
8fa1565 [deflakey] Deflakey test_redis_tls (#40378)
820aad1 [civ2][gpu/2] migrate ml gpu tests to civ2 (#40362)
ad7e1fc [Train] Deprecate TransformersTrainer (#40277)
89eb6da [Train] Update checkpoint path for RayTrainReportCallbacks. (#40174)
dd6eb71 [Dependencies] Remove typing_extensions (#40336)
4113ab4 [runtime env]: Integrating Nsight to Ray worker process (#39998)
c6baff2 [Train] Fix lightning 2.0 import path (#40266)
199b6ca [RLlib][Docs] Add mobile-env to RLlib community examples (#37641)
1a286fd [Train] Deprecate AccelerateTrainer (#40274)
b3c7af5 fix (#40374)
40275a9 [KubeRay][Autoscaler] Make KubeRay CRD version configurable (#40357)
5c3f100 [serve] Deprecate single app config (#40329)
0c06bb9 [data] store ray dashboard metrics in _StatsActor (#40118)
941ac71 [serve] Fix deploy config edge case bug (#40326)
3d2d4fe [data] Allow setting target max block size per-op instead of per-Dataset and reduce for streaming maps (#39710)
da5046e [docker image] use buildkit to build ray image (#40365)
d9e24f2 [civ2][gpu/1] create ml gpu builds (#40322)
563a9bf Jail //python/ray/tests:test_redis_tls (#40366)
09d4f0a [Data] Fix return type and docstring for iter APIs (#40361)
c49b8ed [Data] Fix documentation link for local shuffle (#40291)
8310ce1 [Data] Remove BulkExecutor code path (#40200)
56337e0 [data] Add function arg params to map and flat_map (#40010)
ba581a3 [serve] Initial pydantic>=2.0 compatibility (#40222)
b31a5aa [serve] Remove v1 api (#40218)
4ab0ba0 [Data] Remove FileMetadataShuffler (#40341)
306c714 [Doc] Streaming generator alpha doc (#39914)
8d286f0 [RLlib-contrib] Dreamer(V1) (won't be moved into rllib_contrib, b/c we now have DreamerV3). (#36621)
f097cd4 [RLlib] Remove some deprecation warnings that should not be there. (#39984)

image

@rickyyx
Copy link
Contributor

rickyyx commented Oct 26, 2023

So it's confirmed 3e8278d is the root cause.

Reason:

  • We started tracking all lost tasks in the PR at GCS (so that we could report data loss at task level granularity)
  • With more than 100M actor tasks being generated throughout the lifetime of the job, these become larger and larger
  • We do have GC of such info, but only when the job finishes.
  • As a result, when a long running job generated many tasks, it will explode GCS memory gradually.

The only unknown is why on the metric page, this doesn't show up as a slowly increase but a burst. But the bisection is kind of conclusive:

@vitsai
Copy link
Contributor Author

vitsai commented Oct 26, 2023

This is merged into release branch, can we close it soon?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core P0 Issues that should be fixed in short order ray 2.8 release-blocker P0 Issue that blocks the release
Projects
None yet
Development

No branches or pull requests

8 participants