Commit af6a1dc
[core] Fix RAY_CHECK failure during shutdown due to plasma store race condition (#55367)
## Why are these changes needed?
Workers crash with a fatal `RAY_CHECK` failure when the plasma store
connection is broken during shutdown, causing the following error:
```
RAY_CHECK failed: PutInLocalPlasmaStore(object, object_id, true) Status not OK: IOError: Broken pipe
```
Stacktrace:
```
core_worker.cc:720 C Check failed: PutInLocalPlasmaStore(object, object_id, true) Status not OK: IOError: Broken pipe
*** StackTrace Information ***
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(+0x141789a) [0x7924dd2c689a] ray::operator<<()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(_ZN3ray6RayLogD1Ev+0x479) [0x7924dd2c9319] ray::RayLog::~RayLog()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(+0x95cc8a) [0x7924dc80bc8a] ray::core::CoreWorker::CoreWorker()::{lambda()#13}::operator()()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(_ZN3ray4core11TaskManager27MarkTaskReturnObjectsFailedERKNS_17TaskSpecificationENS_3rpc9ErrorTypeEPKNS5_12RayErrorInfoERKN4absl12lts_2023080213flat_hash_setINS_8ObjectIDENSB_13hash_internal4HashISD_EESt8equal_toISD_ESaISD_EEE+0x679) [0x7924dc868f29] ray::core::TaskManager::MarkTaskReturnObjectsFailed()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(_ZN3ray4core11TaskManager15FailPendingTaskERKNS_6TaskIDENS_3rpc9ErrorTypeEPKNS_6StatusEPKNS5_12RayErrorInfoE+0x416) [0x7924dc86f186] ray::core::TaskManager::FailPendingTask()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(+0x9a90e6) [0x7924dc8580e6] ray::core::NormalTaskSubmitter::RequestNewWorkerIfNeeded()::{lambda()#1}::operator()()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(_ZN3ray3rpc14ClientCallImplINS0_23RequestWorkerLeaseReplyEE15OnReplyReceivedEv+0x68) [0x7924dc94aa48] ray::rpc::ClientCallImpl<>::OnReplyReceived()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(_ZNSt17_Function_handlerIFvvEZN3ray3rpc17ClientCallManager29PollEventsFromCompletionQueueEiEUlvE_E9_M_invokeERKSt9_Any_data+0x15) [0x7924dc79e285] std::_Function_handler<>::_M_invoke()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(+0xd9b4c8) [0x7924dcc4a4c8] EventTracker::RecordExecution()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(+0xd4648e) [0x7924dcbf548e] std::_Function_handler<>::_M_invoke()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(+0xd46906) [0x7924dcbf5906] boost::asio::detail::completion_handler<>::do_complete()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(+0x13f417b) [0x7924dd2a317b] boost::asio::detail::scheduler::do_run_one()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(+0x13f5af9) [0x7924dd2a4af9] boost::asio::detail::scheduler::run()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(+0x13f6202) [0x7924dd2a5202] boost::asio::io_context::run()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(_ZN3ray4core10CoreWorker12RunIOServiceEv+0x91) [0x7924dc793a61] ray::core::CoreWorker::RunIOService()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(+0xcba0b0) [0x7924dcb690b0] thread_proxy
/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7924dde71ac3]
/lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x7924ddf03850]
```
Stack trace flow:
1. Task lease request fails ->
`NormalTaskSubmitter::RequestNewWorkerIfNeeded()` callback.
2. Triggers `TaskManager::FailPendingTask()` ->
`MarkTaskReturnObjectsFailed()`.
3. System attempts to store error objects in plasma via
`put_in_local_plasma_callback_`.
4. Plasma connection is broken (raylet/plasma store already shut down).
5. `RAY_CHECK_OK()` in the callback causes fatal crash instead of
graceful handling.
Root Cause:
This is a shutdown ordering race condition:
1. Raylet shuts down first: The raylet stops its IO context
([main_service_.stop()](https://github.com/ray-project/ray/blob/77c5475195e56a26891d88460973198391d20edf/src/ray/object_manager/plasma/store_runner.cc#L146))
which closes plasma store connections.
2. Worker still processes callbacks: Core worker continues processing
pending callbacks on separate threads.
3. Broken connection: When the callback tries to store error objects in
plasma, the connection is already closed.
4. Fatal crash: The `RAY_CHECK_OK()` treats this as an unexpected error
and crashes the process.
Fix:
1. Shutdown-aware plasma operations
- Add `CoreWorker::IsShuttingDown()` method to check shutdown state.
- Skip plasma operations entirely when shutdown is in progress.
- Prevents attempting operations on already-closed connections.
2. Targeted error handling for connection failures
- Replace blanket `RAY_CHECK_OK()` with specific error type checking.
- Handle connection errors (Broken pipe, Connection reset, Bad file
descriptor) as warnings during shutdown scenarios.
- Maintain `RAY_CHECK_OK()` for other error types to catch real issues.
---------
Signed-off-by: Sagar Sumit <[email protected]>
Signed-off-by: Douglas Strodtman <[email protected]>1 parent 22175ae commit af6a1dc
File tree
5 files changed
+376
-39
lines changed- src/ray/core_worker
- tests
5 files changed
+376
-39
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1494 | 1494 | | |
1495 | 1495 | | |
1496 | 1496 | | |
| 1497 | + | |
| 1498 | + | |
| 1499 | + | |
| 1500 | + | |
| 1501 | + | |
| 1502 | + | |
| 1503 | + | |
| 1504 | + | |
| 1505 | + | |
| 1506 | + | |
| 1507 | + | |
| 1508 | + | |
| 1509 | + | |
1497 | 1510 | | |
1498 | 1511 | | |
1499 | 1512 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
423 | 423 | | |
424 | 424 | | |
425 | 425 | | |
426 | | - | |
427 | | - | |
| 426 | + | |
| 427 | + | |
| 428 | + | |
| 429 | + | |
| 430 | + | |
| 431 | + | |
| 432 | + | |
| 433 | + | |
428 | 434 | | |
429 | 435 | | |
430 | 436 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
533 | 533 | | |
534 | 534 | | |
535 | 535 | | |
536 | | - | |
537 | | - | |
538 | | - | |
539 | | - | |
| 536 | + | |
| 537 | + | |
| 538 | + | |
| 539 | + | |
540 | 540 | | |
541 | 541 | | |
542 | 542 | | |
| |||
579 | 579 | | |
580 | 580 | | |
581 | 581 | | |
582 | | - | |
| 582 | + | |
| 583 | + | |
| 584 | + | |
| 585 | + | |
| 586 | + | |
| 587 | + | |
| 588 | + | |
| 589 | + | |
| 590 | + | |
583 | 591 | | |
584 | 592 | | |
585 | 593 | | |
| |||
813 | 821 | | |
814 | 822 | | |
815 | 823 | | |
816 | | - | |
817 | | - | |
818 | | - | |
819 | | - | |
| 824 | + | |
| 825 | + | |
| 826 | + | |
| 827 | + | |
| 828 | + | |
| 829 | + | |
| 830 | + | |
| 831 | + | |
| 832 | + | |
820 | 833 | | |
821 | 834 | | |
822 | 835 | | |
| |||
900 | 913 | | |
901 | 914 | | |
902 | 915 | | |
903 | | - | |
904 | | - | |
905 | | - | |
906 | | - | |
907 | | - | |
908 | | - | |
| 916 | + | |
| 917 | + | |
| 918 | + | |
| 919 | + | |
| 920 | + | |
| 921 | + | |
| 922 | + | |
| 923 | + | |
| 924 | + | |
| 925 | + | |
| 926 | + | |
| 927 | + | |
909 | 928 | | |
| 929 | + | |
| 930 | + | |
| 931 | + | |
| 932 | + | |
| 933 | + | |
| 934 | + | |
| 935 | + | |
| 936 | + | |
| 937 | + | |
| 938 | + | |
| 939 | + | |
910 | 940 | | |
911 | 941 | | |
912 | 942 | | |
913 | 943 | | |
914 | 944 | | |
915 | 945 | | |
916 | | - | |
917 | | - | |
918 | | - | |
919 | | - | |
| 946 | + | |
| 947 | + | |
| 948 | + | |
| 949 | + | |
| 950 | + | |
| 951 | + | |
| 952 | + | |
| 953 | + | |
| 954 | + | |
| 955 | + | |
| 956 | + | |
| 957 | + | |
| 958 | + | |
| 959 | + | |
| 960 | + | |
| 961 | + | |
| 962 | + | |
| 963 | + | |
920 | 964 | | |
921 | 965 | | |
922 | 966 | | |
| |||
1040 | 1084 | | |
1041 | 1085 | | |
1042 | 1086 | | |
1043 | | - | |
1044 | | - | |
1045 | | - | |
1046 | | - | |
| 1087 | + | |
| 1088 | + | |
| 1089 | + | |
| 1090 | + | |
| 1091 | + | |
| 1092 | + | |
| 1093 | + | |
| 1094 | + | |
| 1095 | + | |
| 1096 | + | |
1047 | 1097 | | |
1048 | 1098 | | |
1049 | 1099 | | |
| |||
1454 | 1504 | | |
1455 | 1505 | | |
1456 | 1506 | | |
| 1507 | + | |
| 1508 | + | |
| 1509 | + | |
1457 | 1510 | | |
1458 | | - | |
1459 | | - | |
1460 | | - | |
| 1511 | + | |
| 1512 | + | |
| 1513 | + | |
| 1514 | + | |
| 1515 | + | |
1461 | 1516 | | |
1462 | 1517 | | |
1463 | 1518 | | |
1464 | 1519 | | |
| 1520 | + | |
1465 | 1521 | | |
1466 | | - | |
1467 | | - | |
1468 | | - | |
| 1522 | + | |
| 1523 | + | |
| 1524 | + | |
| 1525 | + | |
| 1526 | + | |
1469 | 1527 | | |
1470 | 1528 | | |
1471 | 1529 | | |
| |||
1488 | 1546 | | |
1489 | 1547 | | |
1490 | 1548 | | |
1491 | | - | |
| 1549 | + | |
| 1550 | + | |
| 1551 | + | |
| 1552 | + | |
| 1553 | + | |
1492 | 1554 | | |
1493 | 1555 | | |
1494 | 1556 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
14 | 14 | | |
15 | 15 | | |
16 | 16 | | |
| 17 | + | |
17 | 18 | | |
18 | 19 | | |
19 | 20 | | |
| |||
25 | 26 | | |
26 | 27 | | |
27 | 28 | | |
| 29 | + | |
28 | 30 | | |
29 | 31 | | |
30 | 32 | | |
| |||
42 | 44 | | |
43 | 45 | | |
44 | 46 | | |
45 | | - | |
| 47 | + | |
46 | 48 | | |
47 | 49 | | |
48 | 50 | | |
| |||
608 | 610 | | |
609 | 611 | | |
610 | 612 | | |
611 | | - | |
612 | | - | |
613 | | - | |
614 | | - | |
615 | | - | |
616 | | - | |
| 613 | + | |
| 614 | + | |
| 615 | + | |
| 616 | + | |
| 617 | + | |
| 618 | + | |
617 | 619 | | |
618 | 620 | | |
619 | 621 | | |
| |||
0 commit comments