Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Serve microbenchmarks occasionally crash with segfault or invalid memory access #50802

Open
edoakes opened this issue Feb 21, 2025 · 3 comments
Assignees
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core P0 Issues that should be fixed in short order

Comments

@edoakes
Copy link
Contributor

edoakes commented Feb 21, 2025

The Serve serve_microbenchmark.aws test has been failing periodically with some very nasty stack traces related to either a segfault or SIGABRT due to a malloc-related issue.

Example: https://buildkite.com/ray-project/release/builds/30788#019487a6-dd46-4e84-bb84-66b09f24ab97/787-841

Through trial and error I've found:

@edoakes edoakes added bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core P0 Issues that should be fixed in short order labels Feb 21, 2025
@edoakes edoakes self-assigned this Feb 21, 2025
@edoakes
Copy link
Contributor Author

edoakes commented Feb 21, 2025

I am able to reproduce this issue by running the handle throughput microbenchmark in a loop using the same instance type as the release test.

From the core dump, here is the full stack trace:

#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=129941970589248) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=129941970589248) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=129941970589248, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x000076386e07c476 in __GI_raise (sig=6) at ../sysdeps/posix/raise.c:26
#4  <signal handler called>
#5  __pthread_kill_implementation (no_tid=0, signo=6, threadid=129941970589248) at ./nptl/pthread_kill.c:44
#6  __pthread_kill_internal (signo=6, threadid=129941970589248) at ./nptl/pthread_kill.c:78
#7  __GI___pthread_kill (threadid=129941970589248, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#8  0x000076386e07c476 in __GI_raise (sig=6) at ../sysdeps/posix/raise.c:26
#9  <signal handler called>
#10 __pthread_kill_implementation (no_tid=0, signo=6, threadid=129941970589248) at ./nptl/pthread_kill.c:44
#11 __pthread_kill_internal (signo=6, threadid=129941970589248) at ./nptl/pthread_kill.c:78
#12 __GI___pthread_kill (threadid=129941970589248, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#13 0x000076386e07c476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#14 0x000076386e0627f3 in __GI_abort () at ./stdlib/abort.c:79
#15 0x000076386e0c3677 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x76386e215b77 "%s\n") at ../sysdeps/posix/libc_fatal.c:156
#16 0x000076386e0dacfc in malloc_printerr (str=str@entry=0x76386e2187d8 "free(): invalid next size (normal)") at ./malloc/malloc.c:5664
#17 0x000076386e0dccdc in _int_free (av=0x762e7c000030, p=0x762e7c0928b0, have_lock=<optimized out>) at ./malloc/malloc.c:4596
#18 0x000076386e0df453 in __GI___libc_free (mem=<optimized out>) at ./malloc/malloc.c:3391
#19 0x000076386caeb76d in absl::lts_20230802::container_internal::raw_hash_set<absl::lts_20230802::container_internal::FlatHashSetPolicy<ray::ObjectID>, absl::lts_20230802::hash_internal::Hash<ray::ObjectID>, std::equal_to<ray::ObjectID>, std::allocator<ray::ObjectID> >::prepare_insert(unsigned long) () from /home/ray/anaconda3/lib/python3.9/site-packages/ray/_raylet.so
#20 0x000076386caeb8cc in std::pair<absl::lts_20230802::container_internal::raw_hash_set<absl::lts_20230802::container_internal::FlatHashSetPolicy<ray::ObjectID>, absl::lts_20230802::hash_internal::Hash<ray::ObjectID>, std::equal_to<ray::ObjectID>, std::allocator<ray::ObjectID> >::iterator, bool> absl::lts_20230802::container_internal::raw_hash_set<absl::lts_20230802::container_internal::FlatHashSetPolicy<ray::ObjectID>, absl::lts_20230802::hash_internal::Hash<ray::ObjectID>, std::equal_to<ray::ObjectID>, std::allocator<ray::ObjectID> >::EmplaceDecomposable::operator()<ray::ObjectID, ray::ObjectID const&>(ray::ObjectID const&, ray::ObjectID const&) const () from /home/ray/anaconda3/lib/python3.9/site-packages/ray/_raylet.so
#21 0x000076386caeba40 in ray::core::CoreWorker::AsyncDelObjectRefStream(ray::ObjectID const&) () from /home/ray/anaconda3/lib/python3.9/site-packages/ray/_raylet.so
#22 0x000076386c990489 in __pyx_pw_3ray_7_raylet_10CoreWorker_159async_delete_object_ref_stream(_object*, _object* const*, long, _object*) () from /home/ray/anaconda3/lib/python3.9/site-packages/ray/_raylet.so
#23 0x000076386c95c359 in __pyx_pw_3ray_7_raylet_18ObjectRefGenerator_43__del__(_object*, _object* const*, long, _object*) () from /home/ray/anaconda3/lib/python3.9/site-packages/ray/_raylet.so
#24 0x00005dac4dcf62a4 in _PyObject_VectorcallTstate (kwnames=0x0, nargsf=9223372036854775809, args=0x762e78ff7d28, callable=0x76386a0ede10, tstate=0x762e7c00cb60) at /usr/local/src/conda/python-3.9.21/Include/cpython/abstract.h:118
#25 PyObject_CallOneArg (func=0x76386a0ede10, arg=<optimized out>) at /usr/local/src/conda/python-3.9.21/Include/cpython/abstract.h:188
#26 0x00005dac4dd8cf2f in call_unbound_noarg (self=0x762e786d97f0, func=0x76386a0ede10, unbound=<optimized out>) at /usr/local/src/conda/python-3.9.21/Objects/typeobject.c:1529
#27 slot_tp_finalize (self=0x762e786d97f0) at /usr/local/src/conda/python-3.9.21/Objects/typeobject.c:7020
#28 0x00005dac4dcd397b in PyObject_CallFinalizer (self=0x762e786d97f0) at /usr/local/src/conda/python-3.9.21/Objects/object.c:195
#29 PyObject_CallFinalizerFromDealloc (self=0x762e786d97f0) at /usr/local/src/conda/python-3.9.21/Objects/object.c:213
#30 0x00005dac4dcd036f in subtype_dealloc (self=0x762e786d97f0) at /usr/local/src/conda/python-3.9.21/Objects/typeobject.c:1275
#31 0x00005dac4dcaad86 in _Py_Dealloc (op=<optimized out>) at /usr/local/src/conda/python-3.9.21/Objects/object.c:2209
#32 _Py_DECREF (op=<optimized out>) at /usr/local/src/conda/python-3.9.21/Include/object.h:430
#33 _Py_XDECREF (op=<optimized out>) at /usr/local/src/conda/python-3.9.21/Include/object.h:497
#34 dict_dealloc (mp=0x762e787d0f40) at /usr/local/src/conda/python-3.9.21/Objects/dictobject.c:2018
#35 0x00005dac4dcd01ba in _Py_Dealloc (op=<optimized out>) at /usr/local/src/conda/python-3.9.21/Objects/object.c:2209
#36 _Py_DECREF (op=<optimized out>) at /usr/local/src/conda/python-3.9.21/Include/object.h:430
#37 subtype_dealloc (self=0x762e787ec040) at /usr/local/src/conda/python-3.9.21/Objects/typeobject.c:1331
#38 0x00005dac4dcb67c7 in _Py_Dealloc (op=<optimized out>) at /usr/local/src/conda/python-3.9.21/Objects/object.c:2209
#39 _Py_DECREF (op=<optimized out>) at /usr/local/src/conda/python-3.9.21/Include/object.h:430
#40 frame_dealloc (f=0x762e50017230) at /usr/local/src/conda/python-3.9.21/Objects/frameobject.c:585
#41 0x00005dac4dd5eeca in _Py_Dealloc (op=0x762e50017230) at /usr/local/src/conda/python-3.9.21/Objects/object.c:2209
#42 _Py_DECREF (op=0x762e50017230) at /usr/local/src/conda/python-3.9.21/Include/object.h:430
#43 gen_send_ex (gen=0x762e78711840, arg=<optimized out>, exc=<optimized out>, closing=<optimized out>) at /usr/local/src/conda/python-3.9.21/Objects/genobject.c:272
#44 0x00005dac4dcb46d8 in _PyEval_EvalFrameDefault (tstate=<optimized out>, f=0x762e78712dd0, throwflag=<optimized out>) at /usr/local/src/conda/python-3.9.21/Python/ceval.c:2202
#45 0x00005dac4dd5ed92 in _PyEval_EvalFrame (throwflag=<optimized out>, f=0x762e78712dd0, tstate=0x762e7c00cb60) at /usr/local/src/conda/python-3.9.21/Include/internal/pycore_ceval.h:40
#46 gen_send_ex (gen=0x762e787117c0, arg=<optimized out>, exc=<optimized out>, closing=<optimized out>) at /usr/local/src/conda/python-3.9.21/Objects/genobject.c:215
#47 0x00005dac4dcb46d8 in _PyEval_EvalFrameDefault (tstate=<optimized out>, f=0x762e7c082090, throwflag=<optimized out>) at /usr/local/src/conda/python-3.9.21/Python/ceval.c:2202
#48 0x00005dac4dd5ed92 in _PyEval_EvalFrame (throwflag=<optimized out>, f=0x762e7c082090, tstate=0x762e7c00cb60) at /usr/local/src/conda/python-3.9.21/Include/internal/pycore_ceval.h:40
--Type <RET> for more, q to quit, c to continue without paging--
#49 gen_send_ex (gen=0x762e78795f40, arg=<optimized out>, exc=<optimized out>, closing=<optimized out>) at /usr/local/src/conda/python-3.9.21/Objects/genobject.c:215
#50 0x000076386b745175 in task_step_impl (exc=0x0, task=0x762e7873ad40) at /usr/local/src/conda/python-3.9.21/Modules/_asynciomodule.c:2669
#51 task_step (task=task@entry=0x762e7873ad40, exc=exc@entry=0x0) at /usr/local/src/conda/python-3.9.21/Modules/_asynciomodule.c:2969
#52 0x000076386b745a41 in task_wakeup (o=<optimized out>, task=0x762e7873ad40) at /usr/local/src/conda/python-3.9.21/Modules/_asynciomodule.c:3018
#53 TaskWakeupMethWrapper_call (o=o@entry=0x762e7879d5b0, args=args@entry=0x762e78682ca0, kwds=kwds@entry=0x0) at /usr/local/src/conda/python-3.9.21/Modules/_asynciomodule.c:1882
#54 0x000076386a160ade in __Pyx_PyObject_Call (func=0x762e7879d5b0, arg=0x762e78682ca0, kw=0x0) at uvloop/loop.c:171447
#55 0x000076386a23a98c in __pyx_f_6uvloop_4loop_6Handle__run (__pyx_v_self=__pyx_v_self@entry=0x762e78688ee0) at uvloop/loop.c:60747
#56 0x000076386a23e4d5 in __pyx_f_6uvloop_4loop_4Loop__on_idle (__pyx_v_self=0x762e7c0263c0) at uvloop/loop.c:14597
#57 0x000076386a23a8f0 in __pyx_f_6uvloop_4loop_6Handle__run (__pyx_v_self=__pyx_v_self@entry=0x762e8a0e19d0) at uvloop/loop.c:60773
#58 0x000076386a23fdc8 in __pyx_f_6uvloop_4loop_cb_idle_callback (__pyx_v_handle=<optimized out>) at uvloop/loop.c:79836
#59 0x000076386a251901 in uv__run_idle (loop=loop@entry=0x762e7c024040) at src/unix/loop-watcher.c:68
#60 0x000076386a24eded in uv_run (loop=0x762e7c024040, mode=mode@entry=UV_RUN_DEFAULT) at src/unix/core.c:438
#61 0x000076386a1a3563 in __pyx_f_6uvloop_4loop_4Loop___run (__pyx_v_self=0x762e7c0263c0, __pyx_v_mode=UV_RUN_DEFAULT) at uvloop/loop.c:15092
#62 0x000076386a1fdf7d in __pyx_f_6uvloop_4loop_4Loop__run (__pyx_v_self=0x762e7c0263c0, __pyx_v_mode=UV_RUN_DEFAULT) at uvloop/loop.c:15471
#63 0x000076386a1f2bd5 in __pyx_pf_6uvloop_4loop_4Loop_24run_forever (__pyx_v_self=0x762e7c0263c0) at uvloop/loop.c:28166
#64 __pyx_pw_6uvloop_4loop_4Loop_25run_forever (__pyx_v_self=0x762e7c0263c0, unused=<optimized out>) at uvloop/loop.c:27987
#65 0x00005dac4dcbc5c6 in cfunction_vectorcall_NOARGS (func=0x762e898a5ae0, args=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>) at /usr/local/src/conda/python-3.9.21/Objects/methodobject.c:489
#66 0x00005dac4dcb3e81 in do_call_core (kwdict=0x762e8a0b0f00, callargs=0x76386df9b040, func=0x762e898a5ae0, tstate=<optimized out>) at /usr/local/src/conda/python-3.9.21/Python/ceval.c:5097
#67 _PyEval_EvalFrameDefault (tstate=<optimized out>, f=0x762e898a9580, throwflag=<optimized out>) at /usr/local/src/conda/python-3.9.21/Python/ceval.c:3582
#68 0x00005dac4dcbed2a in _PyEval_EvalFrame (throwflag=0, f=0x762e898a9580, tstate=0x762e7c00cb60) at /usr/local/src/conda/python-3.9.21/Include/internal/pycore_ceval.h:40
#69 function_code_fastcall (tstate=0x762e7c00cb60, co=0x76386dba2df0, args=<optimized out>, nargs=1, globals=0x76386dcd3c80) at /usr/local/src/conda/python-3.9.21/Objects/call.c:330
#70 0x00005dac4dcae984 in _PyObject_VectorcallTstate (kwnames=0x0, nargsf=<optimized out>, args=0x762e898ab1b8, callable=0x76386dbb0310, tstate=0x762e7c00cb60) at /usr/local/src/conda/python-3.9.21/Include/cpython/abstract.h:118
#71 PyObject_Vectorcall (kwnames=0x0, nargsf=<optimized out>, args=0x762e898ab1b8, callable=0x76386dbb0310) at /usr/local/src/conda/python-3.9.21/Include/cpython/abstract.h:127
#72 call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>, tstate=0x762e7c00cb60) at /usr/local/src/conda/python-3.9.21/Python/ceval.c:5077
#73 _PyEval_EvalFrameDefault (tstate=<optimized out>, f=0x762e898ab040, throwflag=<optimized out>) at /usr/local/src/conda/python-3.9.21/Python/ceval.c:3506
#74 0x00005dac4dcbed2a in _PyEval_EvalFrame (throwflag=0, f=0x762e898ab040, tstate=0x762e7c00cb60) at /usr/local/src/conda/python-3.9.21/Include/internal/pycore_ceval.h:40
#75 function_code_fastcall (tstate=0x762e7c00cb60, co=0x76386dba5190, args=<optimized out>, nargs=1, globals=0x76386dcd3c80) at /usr/local/src/conda/python-3.9.21/Objects/call.c:330
#76 0x00005dac4dcae984 in _PyObject_VectorcallTstate (kwnames=0x0, nargsf=<optimized out>, args=0x762e898a9538, callable=0x76386dbb05e0, tstate=0x762e7c00cb60) at /usr/local/src/conda/python-3.9.21/Include/cpython/abstract.h:118
#77 PyObject_Vectorcall (kwnames=0x0, nargsf=<optimized out>, args=0x762e898a9538, callable=0x76386dbb05e0) at /usr/local/src/conda/python-3.9.21/Include/cpython/abstract.h:127
#78 call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>, tstate=0x762e7c00cb60) at /usr/local/src/conda/python-3.9.21/Python/ceval.c:5077
#79 _PyEval_EvalFrameDefault (tstate=<optimized out>, f=0x762e898a93c0, throwflag=<optimized out>) at /usr/local/src/conda/python-3.9.21/Python/ceval.c:3506
#80 0x00005dac4dcbed2a in _PyEval_EvalFrame (throwflag=0, f=0x762e898a93c0, tstate=0x762e7c00cb60) at /usr/local/src/conda/python-3.9.21/Include/internal/pycore_ceval.h:40
#81 function_code_fastcall (tstate=0x762e7c00cb60, co=0x76386dba2ea0, args=<optimized out>, nargs=1, globals=0x76386dcd3c80) at /usr/local/src/conda/python-3.9.21/Objects/call.c:330
#82 0x00005dac4dccc363 in _PyObject_VectorcallTstate (kwnames=0x0, nargsf=1, args=0x762e78ff8da8, callable=0x76386dbb03a0, tstate=0x762e7c00cb60) at /usr/local/src/conda/python-3.9.21/Include/cpython/abstract.h:118
#83 method_vectorcall (method=<optimized out>, args=0x76386df9b058, nargsf=<optimized out>, kwnames=0x0) at /usr/local/src/conda/python-3.9.21/Objects/classobject.c:61
#84 0x00005dac4dda0662 in t_bootstrap (boot_raw=0x762e898a8660) at /usr/local/src/conda/python-3.9.21/Modules/_threadmodule.c:1054
#85 0x00005dac4dda0614 in pythread_wrapper (arg=<optimized out>) at /usr/local/src/conda/python-3.9.21/Python/thread_pthread.h:245
#86 0x000076386e0ceac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#87 0x000076386e160850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

@edoakes
Copy link
Contributor Author

edoakes commented Feb 21, 2025

And another core dump:

#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=128538598417984) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=128538598417984) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=128538598417984, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x000074e820c25476 in __GI_raise (sig=6) at ../sysdeps/posix/raise.c:26
#4  <signal handler called>
#5  __pthread_kill_implementation (no_tid=0, signo=6, threadid=128538598417984) at ./nptl/pthread_kill.c:44
#6  __pthread_kill_internal (signo=6, threadid=128538598417984) at ./nptl/pthread_kill.c:78
#7  __GI___pthread_kill (threadid=128538598417984, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#8  0x000074e820c25476 in __GI_raise (sig=6) at ../sysdeps/posix/raise.c:26
#9  <signal handler called>
#10 __pthread_kill_implementation (no_tid=0, signo=6, threadid=128538598417984) at ./nptl/pthread_kill.c:44
#11 __pthread_kill_internal (signo=6, threadid=128538598417984) at ./nptl/pthread_kill.c:78
#12 __GI___pthread_kill (threadid=128538598417984, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#13 0x000074e820c25476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#14 0x000074e820c0b7f3 in __GI_abort () at ./stdlib/abort.c:79
#15 0x000074e820c6c677 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x74e820dbeb77 "%s\n") at ../sysdeps/posix/libc_fatal.c:156
#16 0x000074e820c83cfc in malloc_printerr (str=str@entry=0x74e820dc1790 "double free or corruption (out)") at ./malloc/malloc.c:5664
#17 0x000074e820c85e70 in _int_free (av=0x74e820dfdc80 <main_arena>, p=0x74de30075a80, have_lock=<optimized out>) at ./malloc/malloc.c:4588
#18 0x000074e820c88453 in __GI___libc_free (mem=<optimized out>) at ./malloc/malloc.c:3391
#19 0x000074e81f65bcca in std::_Function_handler<void (std::shared_ptr<ray::RayObject>), ray::core::CoreWorker::GetAsync(ray::ObjectID const&, std::function<void (std::shared_ptr<ray::RayObject>, ray::ObjectID, void*)>, void*)::{lambda(std::shared_ptr<ray::RayObject>)#1}>::_M_manager(std::_Any_data&, std::_Any_data const&, std::_Manager_operation) () from /home/ray/anaconda3/lib/python3.9/site-packages/ray/_raylet.so
#20 0x000074e81f539d07 in std::_Function_base::~_Function_base() () from /home/ray/anaconda3/lib/python3.9/site-packages/ray/_raylet.so
#21 0x000074e81f777c62 in std::_Function_handler<void (), ray::core::CoreWorkerMemoryStore::GetAsync(ray::ObjectID const&, std::function<void (std::shared_ptr<ray::RayObject>)>)::{lambda()#1}>::_M_manager(std::_Any_data&, std::_Any_data const&, std::_Manager_operation) ()
   from /home/ray/anaconda3/lib/python3.9/site-packages/ray/_raylet.so
#22 0x000074e81fa60fdc in std::_Function_handler<void (), instrumented_io_context::post(std::function<void ()>, std::string const&, long)::{lambda()#1}>::_M_manager(std::_Any_data&, std::_Any_data const&, std::_Manager_operation) ()
   from /home/ray/anaconda3/lib/python3.9/site-packages/ray/_raylet.so
#23 0x000074e81fa6194c in boost::asio::detail::completion_handler<std::function<void ()>, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> >::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long) ()
   from /home/ray/anaconda3/lib/python3.9/site-packages/ray/_raylet.so
#24 0x000074e820127ddb in boost::asio::detail::scheduler::do_run_one(boost::asio::detail::conditionally_enabled_mutex::scoped_lock&, boost::asio::detail::scheduler_thread_info&, boost::system::error_code const&) ()
   from /home/ray/anaconda3/lib/python3.9/site-packages/ray/_raylet.so
#25 0x000074e820129759 in boost::asio::detail::scheduler::run(boost::system::error_code&) () from /home/ray/anaconda3/lib/python3.9/site-packages/ray/_raylet.so
#26 0x000074e820129e62 in boost::asio::io_context::run() () from /home/ray/anaconda3/lib/python3.9/site-packages/ray/_raylet.so
#27 0x000074e81f650451 in ray::core::CoreWorker::RunIOService() () from /home/ray/anaconda3/lib/python3.9/site-packages/ray/_raylet.so
#28 0x000074e81fb6e9a0 in thread_proxy () from /home/ray/anaconda3/lib/python3.9/site-packages/ray/_raylet.so
#29 0x000074e820c77ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#30 0x000074e820d09850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

@edoakes
Copy link
Contributor Author

edoakes commented Feb 22, 2025

The first stack trace should be fixed by: #50740. Re-ran the tests many times with this fix and did not see any crashes. However, it doesn't seem related to the second stack trace.

Planning to downgrade this issue to a P1 after the linked PR is merged, then follow up and wrap the async callbacks we're passing to C++ in a unique_ptr or shared_ptr to proactively guard against errors like the second stack trace.

edoakes added a commit that referenced this issue Feb 23, 2025
The serve microbenchmark has been sporadically failing due to memory
corruption issues (see the linked issue). One of the tracebacks captured
pointed to the fact that the `deleted_generator_ids_` map was being
accessed concurrently by multiple threads. Fixed by adding a mutex.

Verified that it at least dramatically reduces the frequency of the
crashes.

I've also renamed a few fields for clarity.

## Related issue number

#50802

---------

Signed-off-by: Edward Oakes <[email protected]>
edoakes added a commit to edoakes/ray that referenced this issue Feb 23, 2025
…ject#50740)

The serve microbenchmark has been sporadically failing due to memory
corruption issues (see the linked issue). One of the tracebacks captured
pointed to the fact that the `deleted_generator_ids_` map was being
accessed concurrently by multiple threads. Fixed by adding a mutex.

Verified that it at least dramatically reduces the frequency of the
crashes.

I've also renamed a few fields for clarity.

## Related issue number

ray-project#50802

---------

Signed-off-by: Edward Oakes <[email protected]>
edoakes added a commit to edoakes/ray that referenced this issue Feb 23, 2025
…ject#50740)

The serve microbenchmark has been sporadically failing due to memory
corruption issues (see the linked issue). One of the tracebacks captured
pointed to the fact that the `deleted_generator_ids_` map was being
accessed concurrently by multiple threads. Fixed by adding a mutex.

Verified that it at least dramatically reduces the frequency of the
crashes.

I've also renamed a few fields for clarity.

ray-project#50802

---------

Signed-off-by: Edward Oakes <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core P0 Issues that should be fixed in short order
Projects
None yet
Development

No branches or pull requests

1 participant