Skip dead nodes to avoid connection timeout. #4154

guoyuhong · 2019-02-25T09:08:41Z

What do these changes do?

Skip the dead nodes when a node is newly connected to the cluster.

The time sequence is not as expected because there is a callback in another callback.
In Raylet::RegisterGcs, client_table().Connect is called first and node_manager_.RegisterGcs() which contains client_table().RegisterClientAddedCallback is called later. If the callback function notification_callback in client_table().Connect is finished before client_table().RegisterClientAddedCallback, the logic is correct. However, this function is called in the callback function of Append. That is to say client_table().RegisterClientAddedCallback could be called ahead of notification_callback. In this case HandleNotification will call client_added_callback_ to connect the dead node unexpectedly

Related issue number

#4140

AmplabJenkins · 2019-02-25T12:05:04Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12291/
Test FAILed.

AmplabJenkins · 2019-02-25T18:43:03Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12296/
Test FAILed.

ericl

Looks great. @hartikainen could you try this out?

AmplabJenkins · 2019-02-25T18:55:46Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12295/
Test FAILed.

hartikainen · 2019-02-25T20:46:18Z

Are there instructions somewhere for how to build ray with the new bazel stuff?

hartikainen · 2019-02-25T21:01:59Z

Nevermind, after installing bazel, pip install -U --force-reinstall -e ${RAY_PATH}/python worked fine.

guoyuhong · 2019-02-25T21:33:41Z

@hartikainen Does this fix your case? @ericl It looks like there is test hanging. I need to take a look later.

hartikainen · 2019-02-25T21:43:38Z

This does not seem to fix my case. The autoscaling still stops and I see this on the newly started worker:

(softlearning) ubuntu@ray-20190225t132603-c871dc-halfcheetah-v2-worker-1f380a07:~/softlearning$ cat /tmp/ray/session_2019-02-25_21-37-59_2211/logs/*err
external/plasma/cpp/src/plasma/store.cc:1000: Allowing the Plasma store to use up to 3.00686GB of memory.
external/plasma/cpp/src/plasma/store.cc:1030: Starting object store with directory /dev/shm and huge page support disabled
external/plasma/cpp/src/plasma/store.cc:605: Disconnecting client on fd 8
external/plasma/cpp/src/plasma/store.cc:605: Disconnecting client on fd 9
external/plasma/cpp/src/plasma/store.cc:605: Disconnecting client on fd 14
external/plasma/cpp/src/plasma/store.cc:605: Disconnecting client on fd 16
external/plasma/cpp/src/plasma/store.cc:605: Disconnecting client on fd 12
external/plasma/cpp/src/plasma/store.cc:605: Disconnecting client on fd 15
external/plasma/cpp/src/plasma/store.cc:605: Disconnecting client on fd 10
external/plasma/cpp/src/plasma/store.cc:605: Disconnecting client on fd 11
external/plasma/cpp/src/plasma/store.cc:605: Disconnecting client on fd 13
external/plasma/cpp/src/plasma/store.cc:605: Disconnecting client on fd 7
external/plasma/cpp/src/plasma/store.cc:605: Disconnecting client on fd 5
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0225 21:37:59.788295  2216 node_manager.cc:365] [ConnectClient] Trying to connect to client 322e3ffd240cce970c1f26d3fd440c6ded1044e7 at 10.138.0.46:33831
I0225 21:37:59.790350  2216 node_manager.cc:365] [ConnectClient] Trying to connect to client 30a4c1a557b393540deca4dae2b77c01964948be at 10.138.0.38:38231
I0225 21:37:59.790514  2216 node_manager.cc:365] [ConnectClient] Trying to connect to client 1e08883f693c4c4347f5b587fe7984b1e8592c5a at 10.138.0.28:36559
I0225 21:37:59.790668  2216 node_manager.cc:365] [ConnectClient] Trying to connect to client 15bf3759ec10e9969e6eab4991e72ce71d324fb2 at 10.138.0.11:45731
I0225 21:37:59.790783  2216 node_manager.cc:365] [ConnectClient] Trying to connect to client 1a1433f4f55154aa6c9a6a5296663d4bcd15a9dc at 10.138.0.31:41949
I0225 21:37:59.790891  2216 node_manager.cc:365] [ConnectClient] Trying to connect to client 78786070cf0339fdb65ae37741d7e321e75fa213 at 10.138.0.40:45131
I0225 21:37:59.791041  2216 node_manager.cc:365] [ConnectClient] Trying to connect to client 64c0af82e81ae31be85cdf3bb479b1188ff834eb at 10.138.0.27:35763
W0225 21:40:10.463982  2216 node_manager.cc:349] Failed to connect to client 64c0af82e81ae31be85cdf3bb479b1188ff834eb in ClientAdded. TcpConnect returned status: IOError: Connection timed out. This may be caused by trying to connect to a node manager that has failed.
I0225 21:40:10.464051  2216 node_manager.cc:365] [ConnectClient] Trying to connect to client b3d8c929bd90738c3e5f6541acd89af5cb47d592 at 10.138.0.30:41053
I0225 21:40:10.465397  2216 node_manager.cc:365] [ConnectClient] Trying to connect to client a040a4928137f26f096cf9262b6575063ca77278 at 10.138.0.37:39359
I0225 21:40:10.466609  2216 node_manager.cc:365] [ConnectClient] Trying to connect to client c312606ae246e9361d19853722635a92aa9f2464 at 10.138.0.47:41003
W0225 21:40:10.466903  2216 node_manager.cc:349] Failed to connect to client c312606ae246e9361d19853722635a92aa9f2464 in ClientAdded. TcpConnect returned status: IOError: Connection refused. This may be caused by trying to connect to a node manager that has failed.
I0225 21:40:10.466917  2216 node_manager.cc:365] [ConnectClient] Trying to connect to client f952fbd7f606e2d04a7034a2470bef8d2998e618 at 10.138.0.33:39043
I0225 21:40:10.467900  2216 node_manager.cc:365] [ConnectClient] Trying to connect to client 7347e48cd0cf95536f944dc5f6c88f99ff9e44cd at 10.138.0.35:43387
I0225 21:40:10.468988  2216 node_manager.cc:365] [ConnectClient] Trying to connect to client 019b997757fce23ef0d09269adf39bec4d9915a5 at 10.138.0.32:40573
I0225 21:40:10.469877  2216 node_manager.cc:365] [ConnectClient] Trying to connect to client 0221d2de47a83fa50032ccd720fbc5ed980ed057 at 10.138.0.39:32955
I0225 21:40:10.470999  2216 node_manager.cc:365] [ConnectClient] Trying to connect to client a0af384c34e28ac8f712cedf8d07eec57e64979b at 10.138.0.36:42775
W0225 21:40:10.471735  2216 node_manager.cc:414] Received ClientRemoved callback for an unknown client bfd468ed30a27483389a1962102545aedf9f8564.
I0225 21:40:10.472003  2216 node_manager.cc:443] [HeartbeatAdded]: received heartbeat from unknown client id c312606ae246e9361d19853722635a92aa9f2464
W0225 21:40:10.472221  2216 node_manager.cc:245] Last heartbeat was sent 130685 ms ago
I0225 21:40:10.472726  2216 node_manager.cc:365] [ConnectClient] Trying to connect to client 924b5a43a8d35b338b76c6f3264bf8f42c6803da at 10.138.0.50:40501
W0225 21:40:10.476303  2216 node_manager.cc:1318] Submitted task 00000000b6c12312ae8ffa0c5cdc36e7e5f3d38f is already queued and will not be reconstructed. This is most likely due to spurious reconstruction.
W0225 21:40:10.477416  2216 node_manager.cc:1318] Submitted task 00000000e829fa0d2d596384b33124a5ec3a3fa3 is already queued and will not be reconstructed. This is most likely due to spurious reconstruction.
W0225 21:40:10.478828  2216 node_manager.cc:414] Received ClientRemoved callback for an unknown client 64c0af82e81ae31be85cdf3bb479b1188ff834eb.
F0225 21:40:10.484861  2216 node_manager.cc:394]  Check failed: client_id != gcs_client_->client_table().GetLocalClientId() Exiting because this node manager has mistakenly been marked dead by the monitor.
*** Check failure stack trace: ***
*** Aborted at 1551130810 (unix time) try "date -d @1551130810" if you are using GNU date ***
PC: @     0x7f10ed579428 gsignal
*** SIGABRT (@0x3e8000008a8) received by PID 2216 (TID 0x7f10ee5dd740) from PID 2216; stack trace: ***
@     0x7f10ee1c0390 (unknown)
@     0x7f10ed579428 gsignal
@     0x7f10ed57b02a abort
@           0x4d4139 google::logging_fail()
@           0x4d5f7a google::LogMessage::Fail()
@           0x4d702a google::LogMessage::SendToLog()
@           0x4d5ca5 google::LogMessage::Flush()
@           0x4d5ea1 google::LogMessage::~LogMessage()
@           0x4d3391 ray::RayLog::~RayLog()
@           0x454dc5 ray::raylet::NodeManager::ClientRemoved()
@           0x4a6bab ray::gcs::ClientTable::HandleNotification()
@           0x4a74cc _ZNSt17_Function_handlerIFvPN3ray3gcs14AsyncGcsClientERKNS0_8UniqueIDERKSt6vectorI16ClientTableDataTSaIS8_EEEZZNS1_11ClientTable7ConnectERKS8_ENKUlS3_S6_SG_E_clES3_S6_SG_EUlS3_S6_SC_E_E9_M_invokeERKSt9_Any_dataOS3_S6_SC_
@           0x49f156 _ZZN3ray3gcs3LogINS_8UniqueIDE15ClientTableDataE9SubscribeERKS2_S6_RKSt8functionIFvPNS0_14AsyncGcsClientES6_RKSt6vectorI16ClientTableDataTSaISB_EEEERKS7_IFvS9_EEENKUlRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEE_clESV_
@           0x4bc4c0 (anonymous namespace)::ProcessCallback()
@           0x4bd3b5 ray::gcs::SubscribeRedisCallback()
@           0x4c0ff4 redisProcessCallbacks
@           0x4bfd40 RedisAsioClient::handle_read()
@           0x4be908 boost::asio::detail::reactive_null_buffers_op<>::do_complete()
@           0x413555 boost::asio::detail::epoll_reactor::descriptor_state::do_complete()
@           0x413fcc boost::asio::detail::scheduler::run()
@           0x40b2df main
@     0x7f10ed564830 __libc_start_main
@           0x410669 _start
@                0x0 (unknown)
Ray worker pid: 2228
Traceback (most recent call last):
File "/home/ubuntu/ray/python/ray/workers/default_worker.py", line 111, in <module>
ray.worker.global_worker.main_loop()
File "/home/ubuntu/ray/python/ray/worker.py", line 1003, in main_loop
task = self._get_next_task_from_local_scheduler()
File "/home/ubuntu/ray/python/ray/worker.py", line 986, in _get_next_task_from_local_scheduler
task = self.raylet_client.get_task()
File "python/ray/_raylet.pyx", line 244, in ray._raylet.RayletClient.get_task
File "python/ray/_raylet.pyx", line 59, in ray._raylet.check_status
Exception: [RayletClient] Raylet connection closed.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/ubuntu/ray/python/ray/workers/default_worker.py", line 118, in <module>
driver_id=None)
File "/home/ubuntu/ray/python/ray/utils.py", line 67, in push_error_to_driver
time.time())
File "python/ray/_raylet.pyx", line 294, in ray._raylet.RayletClient.push_error
File "python/ray/_raylet.pyx", line 59, in ray._raylet.check_status
Exception: [RayletClient] Connection closed unexpectedly.
Ray worker pid: 2222
Traceback (most recent call last):
File "/home/ubuntu/ray/python/ray/workers/default_worker.py", line 111, in <module>
ray.worker.global_worker.main_loop()
File "/home/ubuntu/ray/python/ray/worker.py", line 1003, in main_loop
task = self._get_next_task_from_local_scheduler()
File "/home/ubuntu/ray/python/ray/worker.py", line 986, in _get_next_task_from_local_scheduler
task = self.raylet_client.get_task()
File "python/ray/_raylet.pyx", line 244, in ray._raylet.RayletClient.get_task
File "python/ray/_raylet.pyx", line 59, in ray._raylet.check_status
Exception: [RayletClient] Raylet connection closed.

guoyuhong · 2019-02-25T21:58:44Z

@hartikainen sorry for that. I will reproduce it locally and try to fix it.

guoyuhong · 2019-02-26T08:36:14Z

According to https://github.com/ray-project/ray/blob/master/src/ray/raylet/raylet.cc#L71 , client_table().Disconnect is called at the destructor of Raylet. However, at that time the main service has stopped. Therefore, the append operation is not actually executed and there is not entry in the client table with is_insertion=false util the raylet_monitor mark the raylet as dead. This PR moves the client Disconnect call to main.cc and put the termination of the main service into the callback function of Disconnect to guarantee that the Append operation executed successfully. From the log of the added simple test, there is no log like Failed to connect to client.

guoyuhong · 2019-02-26T08:37:43Z

@hartikainen Please help to test this PR using the old way RAY_USE_CMAKE=1 pip install -e python -v to build ray. In my env, pip install -e python -v seems not working. And make sure that the raylet node is not kill by kill -9. Only SIGTERM will terminate it gracefully.

AmplabJenkins · 2019-02-26T11:27:31Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12315/
Test FAILed.

hartikainen · 2019-02-26T21:04:59Z

I currently get the following:

2019-02-26 21:00:52,092 ERROR worker.py:1734 -- The node with client ID 41a220789a6b7a5851f56e81526fd89a99bb8d8b has been marked dead because the monitor has missed too many heartbeats from it.
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/softlearning/bin/softlearning", line 11, in <module>
load_entry_point('softlearning', 'console_scripts', 'softlearning')()
File "/home/ubuntu/softlearning/softlearning/scripts/console_scripts.py", line 213, in main
return cli()
File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/site-packages/click/core.py", line 764, in __call__
return self.main(*args, **kwargs)
File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/site-packages/click/core.py", line 717, in main
rv = self.invoke(ctx)
File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/site-packages/click/core.py", line 1137, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/site-packages/click/core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/site-packages/click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "/home/ubuntu/softlearning/softlearning/scripts/console_scripts.py", line 106, in run_example_cluster_cmd
run_example_cluster(example_module_name, example_argv)
File "/home/ubuntu/softlearning/examples/instrument.py", line 291, in run_example_cluster
scheduler=median_stopping_rule)
File "/home/ubuntu/ray/python/ray/tune/tune.py", line 171, in run_experiments
runner.step()
File "/home/ubuntu/ray/python/ray/tune/trial_runner.py", line 224, in step
next_trial = self._get_next_trial()
File "/home/ubuntu/ray/python/ray/tune/trial_runner.py", line 383, in _get_next_trial
trial = self._scheduler_alg.choose_trial_to_run(self)
File "/home/ubuntu/ray/python/ray/tune/schedulers/trial_scheduler.py", line 89, in choose_trial_to_run
and trial_runner.has_resources(trial.resources)):
File "/home/ubuntu/ray/python/ray/tune/trial_runner.py", line 372, in has_resources
return self.trial_executor.has_resources(resources)
File "/home/ubuntu/ray/python/ray/tune/ray_trial_executor.py", line 290, in has_resources
self._update_avail_resources()
File "/home/ubuntu/ray/python/ray/tune/ray_trial_executor.py", line 269, in _update_avail_resources
resources = ray.global_state.cluster_resources()
File "/home/ubuntu/ray/python/ray/experimental/state.py", line 767, in cluster_resources
clients = self.client_table()
File "/home/ubuntu/ray/python/ray/experimental/state.py", line 404, in client_table
return parse_client_table(self.redis_client)
File "/home/ubuntu/ray/python/ray/experimental/state.py", line 63, in parse_client_table
"Unexpected duplicate removal of client.")
AssertionError: Unexpected duplicate removal of client.
2019-02-26 21:00:52,397 ERROR worker.py:1734 -- The node with client ID f52223b8548f32d771f9f19275755b748814e9d3 has been marked dead because the monitor has missed too many heartbeats from it.
2019-02-26 21:01:11,246 ERROR worker.py:1734 -- The monitor failed with the following error:
Traceback (most recent call last):
File "/home/ubuntu/ray/python/ray/monitor.py", line 383, in <module>
monitor.run()
File "/home/ubuntu/ray/python/ray/monitor.py", line 316, in run
self.update_local_scheduler_map()
File "/home/ubuntu/ray/python/ray/monitor.py", line 258, in update_local_scheduler_map
local_schedulers = self.state.client_table()
File "/home/ubuntu/ray/python/ray/experimental/state.py", line 404, in client_table
return parse_client_table(self.redis_client)
File "/home/ubuntu/ray/python/ray/experimental/state.py", line 63, in parse_client_table
"Unexpected duplicate removal of client.")
AssertionError: Unexpected duplicate removal of client.

One thing I'm not sure about is what signal is used when I stop/preempt a node.

robertnishihara · 2019-02-26T21:18:53Z

python/ray/tests/test_basic.py

I think this should go in python/ray/tests/test_multi_node_2.py. Did you add the allow_graceful flag for consistency or for some other reason? It doesn't seem to be used anywhere.

Yes, this flag is only used for this test. This test is used to repro the problem @hartikainen mentioned. I can check the raylet log to see whether there is Failed to connect to client. However, this required allow_graceful=True to send SIGTERM to raylet instead of SIGKILL. Actually, even without allow_graceful=True the test also passes due to #2905 . This option is used for reproducing. I can remove the flag after this problem fixed.
Do you have any comment for adding a callback to close the main service to guarantee the closing entry is added to GCS?

@robertnishihara I think the option allow_graceful of ray.tests.cluster_utils will be kept to make the test case work.

AmplabJenkins · 2019-02-26T23:09:56Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12323/
Test FAILed.

guoyuhong · 2019-02-27T04:54:49Z

@hartikainen It looks like there is a duplicated removal message. It probably comes from raylet_monitor. Previously, this raylet closing message is appended by raylet_monitor. I need to take a look to avoid the duplicated message.

guoyuhong · 2019-02-27T16:04:40Z

@hartikainen I modified the raylet_monitor not to mark the node to dead when it has been marked by itself. Please try again.

AmplabJenkins · 2019-02-27T17:48:56Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12362/
Test FAILed.

AmplabJenkins · 2019-02-27T18:09:52Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12363/
Test FAILed.

hartikainen · 2019-02-27T22:27:58Z

Looks like the stopped nodes now get cleaned up correctly and this solves the immediate issue:

2019-02-27 21:36:00,604 INFO autoscaler.py:625 -- StandardAutoscaler: 9/12 target nodes (0 pending)
2019-02-27 21:36:00,605 INFO autoscaler.py:626 -- LoadMetrics: MostDelayedHeartbeats={'10.138.0.114': 4.0947923
66027832, '10.138.0.110': 3.1498706340789795, '10.138.0.109': 2.6247594356536865, '10.138.0.119': 0.29855275154
11377, '10.138.0.120': 0.29834842681884766}, NodeIdleSeconds=Min=0 Mean=0 Max=4, NumNodesConnected=13, NumNodes
Used=12.5, ResourceUsage=100.0/104.0 b'CPU', 0.0/0.0 b'GPU', TimeSinceLastHeartbeat=Min=0 Mean=0 Max=4
2019-02-27 21:36:00,605 INFO autoscaler.py:171 -- LoadMetrics: Removed mapping: 10.138.0.109 - 1551303357.98029
28
2019-02-27 21:36:00,605 INFO autoscaler.py:171 -- LoadMetrics: Removed mapping: 10.138.0.114 - 1551303356.5102599
2019-02-27 21:36:00,605 INFO autoscaler.py:171 -- LoadMetrics: Removed mapping: 10.138.0.110 - 1551303357.4551816
2019-02-27 21:36:00,605 INFO autoscaler.py:177 -- LoadMetrics: Removed 3 stale ip mappings: {'10.138.0.109', '10.138.0.114', '10.138.0.110'} not in {'10.138.0.119', '10.138.0.113', '10.138.0.108', '10.138.0.123', '10.138.0.121', '10.138.0.115', '10.138.0.111', '10.138.0.112', '10.138.0.120', '10.138.0.116'}
2019-02-27 21:36:00,605 INFO autoscaler.py:171 -- LoadMetrics: Removed mapping: 10.138.0.109 - {b'GPU': 0.0, b'CPU': 8.0}
2019-02-27 21:36:00,606 INFO autoscaler.py:171 -- LoadMetrics: Removed mapping: 10.138.0.114 - {b'GPU': 0.0, b'CPU': 8.0}
2019-02-27 21:36:00,606 INFO autoscaler.py:171 -- LoadMetrics: Removed mapping: 10.138.0.110 - {b'GPU': 0.0, b'CPU': 8.0}
2019-02-27 21:36:00,606 INFO autoscaler.py:177 -- LoadMetrics: Removed 3 stale ip mappings: {'10.138.0.109', '10.138.0.114', '10.138.0.110'} not in {'10.138.0.119', '10.138.0.113', '10.138.0.108', '10.138.0.123', '10.138.0.121', '10.138.0.115', '10.138.0.111', '10.138.0.112', '10.138.0.120', '10.138.0.116'}
2019-02-27 21:36:00,606 INFO autoscaler.py:171 -- LoadMetrics: Removed mapping: 10.138.0.109 - {b'GPU': 0.0, b'CPU': 0.0}
2019-02-27 21:36:00,606 INFO autoscaler.py:171 -- LoadMetrics: Removed mapping: 10.138.0.114 - {b'GPU': 0.0, b'CPU': 0.0}
2019-02-27 21:36:00,606 INFO autoscaler.py:171 -- LoadMetrics: Removed mapping: 10.138.0.110 - {b'GPU': 0.0, b'CPU': 0.0}
2019-02-27 21:36:00,606 INFO autoscaler.py:177 -- LoadMetrics: Removed 3 stale ip mappings: {'10.138.0.109', '10.138.0.114', '10.138.0.110'} not in {'10.138.0.119', '10.138.0.113', '10.138.0.108', '10.138.0.123', '10.138.0.121', '10.138.0.115', '10.138.0.111', '10.138.0.112', '10.138.0.120', '10.138.0.116'}
2019-02-27 21:36:00,606 INFO autoscaler.py:171 -- LoadMetrics: Removed mapping: 10.138.0.109 - 1551303357.9802928
2019-02-27 21:36:00,606 INFO autoscaler.py:171 -- LoadMetrics: Removed mapping: 10.138.0.114 - 1551303356.5102599
2019-02-27 21:36:00,606 INFO autoscaler.py:171 -- LoadMetrics: Removed mapping: 10.138.0.110 - 1551303357.4551816
2019-02-27 21:36:00,606 INFO autoscaler.py:177 -- LoadMetrics: Removed 3 stale ip mappings: {'10.138.0.109', '10.138.0.114', '10.138.0.110'} not in {'10.138.0.119', '10.138.0.113', '10.138.0.108', '10.138.0.123', '10.138.0.121', '10.138.0.115', '10.138.0.111', '10.138.0.112', '10.138.0.120', '10.138.0.116'}
2019-02-27 21:36:00,607 INFO autoscaler.py:448 -- Ending bringup phase
2019-02-27 21:36:05,680 INFO autoscaler.py:625 -- StandardAutoscaler: 9/9 target nodes (0 pending)
2019-02-27 21:36:05,680 INFO autoscaler.py:626 -- LoadMetrics: MostDelayedHeartbeats={'10.138.0.119': 0.2719385623931885, '10.138.0.120': 0.271834135055542, '10.138.0.112': 0.27173781394958496, '10.138.0.123': 0.27164387702941895, '10.138.0.111': 0.27155065536499023}, NodeIdleSeconds=Min=0 Mean=0 Max=0, NumNodesConnected=10, NumNodesUsed=9.5, ResourceUsage=76.0/80.0 b'CPU', 0.0/0.0 b'GPU', TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0
2019-02-27 21:36:05,684 INFO autoscaler.py:448 -- Ending bringup phase
2019-02-27 21:36:10,754 INFO autoscaler.py:625 -- StandardAutoscaler: 9/9 target nodes (0 pending)
2019-02-27 21:36:10,755 INFO autoscaler.py:626 -- LoadMetrics: MostDelayedHeartbeats={'10.138.0.119': 0.28159356117248535, '10.138.0.120': 0.2814810276031494, '10.138.0.112': 0.2813866138458252, '10.138.0.123': 0.28122663497924805, '10.138.0.111': 0.2811317443847656}, NodeIdleSeconds=Min=0 Mean=0 Max=0, NumNodesConnected=10, NumNodesUsed=9.5, ResourceUsage=76.0/80.0 b'CPU', 0.0/0.0 b'GPU', TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0

I do see some weird behavior with the autoscaler when a node gets preempted though. It took almost 30 minutes for the autoscaling to resume after nodes where stopped. Several trials (7 in total) that were still running on healthy nodes failed 1-2 times for some reason. In the tune trial logs, I see:

== Status ==                                                                                        [1630/1847]
Using FIFO scheduling algorithm.
Resources requested: 96/96 CPUs, 0/0 GPUs
Memory usage on this node: 2.6/31.6 GB
Result logdir: /home/ubuntu/ray_results/gym/DClaw3/ScrewV2-v0/2019-02-27T21-13-38-from-state-benchmarks-2
Number of trials: 25 ({'RUNNING': 24, 'PENDING': 1})
PENDING trials:
- id=783859b9-seed=7366:       PENDING, 1 failures: /home/ubuntu/ray_results/gym/DClaw3/ScrewV2-v0/2019-02-27T
21-13-38-from-state-benchmarks-2/id=783859b9-seed=7366_2019-02-27_21-17-420ouunslb/error_2019-02-27_21-45-58.tx
t [ray-20190227t130934-e2860e-dclaw3-screwv2-v0-worker-ddd50bc8 pid=2258], 992 s, 35 iter, 35000 ts
RUNNING trials:
- id=ee8b5a2a-seed=7785:       RUNNING [pid=2407], 1851 s, 58 iter, 58000 ts
- id=f212c5cb-seed=3302:       RUNNING [pid=2404], 1856 s, 58 iter, 58000 ts
- id=a0a24159-seed=7312:       RUNNING, 1 failures: /home/ubuntu/ray_results/gym/DClaw3/ScrewV2-v0/2019-02-27T
21-13-38-from-state-benchmarks-2/id=a0a24159-seed=7312_2019-02-27_21-15-40q_1ae3i_/error_2019-02-27_21-18-54.tx
t [ray-20190227t130934-e2860e-dclaw3-screwv2-v0-worker-8bd0ca40 pid=3182], 1482 s, 56 iter, 56000 ts
- id=b335f8c0-seed=5280:       RUNNING, 1 failures: /home/ubuntu/ray_results/gym/DClaw3/ScrewV2-v0/2019-02-27T
21-13-38-from-state-benchmarks-2/id=b335f8c0-seed=5280_2019-02-27_21-15-40seagyvy1/error_2019-02-27_21-19-52.tx
t [ray-20190227t130934-e2860e-dclaw3-screwv2-v0-worker-243f2b2a pid=3159], 1420 s, 53 iter, 53000 ts
- id=5999d694-seed=9433:       RUNNING, 1 failures: /home/ubuntu/ray_results/gym/DClaw3/ScrewV2-v0/2019-02-27T
21-13-38-from-state-benchmarks-2/id=5999d694-seed=9433_2019-02-27_21-15-40x2o1rwao/error_2019-02-27_22-01-59.tx
t [ray-20190227t130934-e2860e-dclaw3-screwv2-v0-worker-4e710d3a pid=2313], 282 s, 11 iter, 11000 ts
- id=b3cd1c9b-seed=5105:       RUNNING, 2 failures: /home/ubuntu/ray_results/gym/DClaw3/ScrewV2-v0/2019-02-27T
21-13-38-from-state-benchmarks-2/id=b3cd1c9b-seed=5105_2019-02-27_21-15-4043r5rowr/error_2019-02-27_21-53-59.tx
t [ray-20190227t130934-e2860e-dclaw3-screwv2-v0-worker-6242ecb7 pid=2345], 192 s, 7 iter, 7000 ts
- id=249e2e15-seed=1623:       RUNNING, 3 failures: /home/ubuntu/ray_results/gym/DClaw3/ScrewV2-v0/2019-02-27T
21-13-38-from-state-benchmarks-2/id=249e2e15-seed=1623_2019-02-27_21-15-40ppfq63zw/error_2019-02-27_21-57-59.tx
t [ray-20190227t130934-e2860e-dclaw3-screwv2-v0-worker-dee39ffe pid=2334], 130 s, 5 iter, 5000 ts
- id=7506301a-seed=8909:       RUNNING, 1 failures: /home/ubuntu/ray_results/gym/DClaw3/ScrewV2-v0/2019-02-27T
21-13-38-from-state-benchmarks-2/id=7506301a-seed=8909_2019-02-27_21-15-40kt1yuwlz/error_2019-02-27_21-49-59.tx
t [ray-20190227t130934-e2860e-dclaw3-screwv2-v0-worker-dee39ffe pid=2336], 84 s, 3 iter, 3000 ts
- id=2a1b4729-seed=4735:       RUNNING, 1 failures: /home/ubuntu/ray_results/gym/DClaw3/ScrewV2-v0/2019-02-27T
21-13-38-from-state-benchmarks-2/id=2a1b4729-seed=4735_2019-02-27_21-15-40hn3899g2/error_2019-02-27_21-19-32.tx
t [ray-20190227t130934-e2860e-dclaw3-screwv2-v0-worker-8bd0ca40 pid=3183], 1447 s, 55 iter, 55000 ts
... 6 not shown
- id=84edc8fa-seed=5356:       RUNNING [ray-20190227t130934-e2860e-dclaw3-screwv2-v0-worker-6b711761 pid=2277]
, 1519 s, 54 iter, 54000 ts
- id=ee910050-seed=435:        RUNNING [ray-20190227t130934-e2860e-dclaw3-screwv2-v0-worker-45031020 pid=2288]
, 1521 s, 56 iter, 56000 ts
- id=541ebc41-seed=5998:       RUNNING [ray-20190227t130934-e2860e-dclaw3-screwv2-v0-worker-00394824 pid=3199]
, 1564 s, 59 iter, 59000 ts
- id=2b62c416-seed=6321:       RUNNING [ray-20190227t130934-e2860e-dclaw3-screwv2-v0-worker-00394824 pid=3198]
, 1572 s, 59 iter, 59000 ts
- id=b4335c61-seed=8546:       RUNNING [ray-20190227t130934-e2860e-dclaw3-screwv2-v0-worker-6cf033fa pid=3152]
, 1521 s, 60 iter, 60000 ts
- id=03a5adc3-seed=6271:       RUNNING, 1 failures: /home/ubuntu/ray_results/gym/DClaw3/ScrewV2-v0/2019-02-27T
21-13-38-from-state-benchmarks-2/id=03a5adc3-seed=6271_2019-02-27_21-18-299w97mibf/error_2019-02-27_21-20-11.tx
t [ray-20190227t130934-e2860e-dclaw3-screwv2-v0-worker-243f2b2a pid=3155], 1401 s, 52 iter, 52000 ts
- id=c9815b19-seed=2755:       RUNNING [ray-20190227t130934-e2860e-dclaw3-screwv2-v0-worker-091c9fc4 pid=2382]
, 1273 s, 45 iter, 45000 ts
- id=11ceb9e7-seed=5375:       RUNNING [ray-20190227t130934-e2860e-dclaw3-screwv2-v0-worker-091c9fc4 pid=2384]
, 1272 s, 45 iter, 45000 ts
- id=862c795d-seed=8345:       RUNNING [ray-20190227t130934-e2860e-dclaw3-screwv2-v0-worker-4e710d3a pid=2314]
, 1119 s, 56 iter, 56000 ts

This problem is probably unrelated to this pull request however. Thanks @guoyuhong!

guoyuhong · 2019-02-28T03:31:16Z

@hartikainen Thanks for the confirmation. I will change this PR according to the comments.

hartikainen · 2019-02-28T05:37:39Z

I think it would also make sense to write tests for the two failed cases above.

guoyuhong · 2019-02-28T15:33:27Z

The PR is ready for review.

AmplabJenkins · 2019-02-28T16:16:08Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12392/
Test FAILed.

AmplabJenkins · 2019-03-01T06:24:42Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12404/
Test FAILed.

ericl

Just one question

ericl · 2019-03-01T06:16:41Z

src/ray/raylet/node_manager.cc

This can happen if there is a race condition right? Maybe we could just say "Failed to connect to ray node <client_id>. This may be since the node was recently removed."

I have changed according to your comment.
Yes. there could be 2 cases.

If raylet crashes or is killed by SIGKILL and the heartbeat timeout (30 seconds) has not happened, this may happen. This could be the case that happens most since 30 seconds is very long.

Race condition: The new node just gets the client information and tries to connect to the existing nodes and at this time the disconnecting message arrives. This it is very rare since the time interval is very short.

AmplabJenkins · 2019-03-01T14:10:22Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12427/
Test FAILed.

pcmoritz · 2019-03-01T19:03:59Z

@guoyuhong Do you know why the Java test failed?

pcmoritz · 2019-03-01T20:18:13Z

Jenkins retest this please

AmplabJenkins · 2019-03-01T21:59:44Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12438/
Test FAILed.

guoyuhong · 2019-03-02T14:17:31Z

@pcmoritz It looks like that the callback function is not called and the raylet socket file is not deleted. I added the code the handle this.

AmplabJenkins · 2019-03-02T15:54:51Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12471/
Test FAILed.

guoyuhong requested review from ericl and robertnishihara February 25, 2019 09:08

ericl approved these changes Feb 25, 2019

View reviewed changes

ericl self-assigned this Feb 25, 2019

guoyuhong requested a review from ericl February 26, 2019 10:01

guoyuhong force-pushed the skipDeadNode branch from 2c711a9 to 97c001e Compare February 26, 2019 14:57

robertnishihara reviewed Feb 26, 2019

View reviewed changes

guoyuhong force-pushed the skipDeadNode branch from dd96b66 to d2371c6 Compare February 28, 2019 15:28

guoyuhong force-pushed the skipDeadNode branch from d2371c6 to 2b623a1 Compare March 1, 2019 03:25

pcmoritz mentioned this pull request Mar 1, 2019

Update arrow version to fix plasma bugs #4127

Merged

ericl approved these changes Mar 1, 2019

View reviewed changes

hartikainen mentioned this pull request Mar 2, 2019

[tune] Fix trial result fetching #4219

Merged

Yuhong Guo added 11 commits March 2, 2019 17:56

Skip dead nodes to avoid connection timeout.

a7e4007

Fix test failure

7b3cd90

Fix

8a764e8

Guarantee that raylet finishes the Append operation while disconnecting.

271abb6

Lint

8c566e1

Check GCS before marking dead in monitor.

b1a31db

Move marking message to the callback function

a16472e

Add test case

52d0b55

Lint

37e7713

Change the message

3b1e8b2

Guarantee that raylet socket file is deleted.

5b000a5

guoyuhong force-pushed the skipDeadNode branch from c9f311f to 5b000a5 Compare March 2, 2019 12:52

pcmoritz merged commit 6f46edc into ray-project:master Mar 2, 2019

guoyuhong deleted the skipDeadNode branch March 3, 2019 07:49

guoyuhong mentioned this pull request Apr 10, 2019

Change client table from Log to Set #4598

Closed

1 task

Skip dead nodes to avoid connection timeout. #4154

Skip dead nodes to avoid connection timeout. #4154

Uh oh!

Conversation

guoyuhong commented Feb 25, 2019

What do these changes do?

Related issue number

Uh oh!

AmplabJenkins commented Feb 25, 2019

Uh oh!

AmplabJenkins commented Feb 25, 2019

Uh oh!

ericl left a comment

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented Feb 25, 2019

Uh oh!

hartikainen commented Feb 25, 2019

Uh oh!

hartikainen commented Feb 25, 2019

Uh oh!

guoyuhong commented Feb 25, 2019

Uh oh!

hartikainen commented Feb 25, 2019

Uh oh!

guoyuhong commented Feb 25, 2019

Uh oh!

guoyuhong commented Feb 26, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

guoyuhong commented Feb 26, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AmplabJenkins commented Feb 26, 2019

Uh oh!

hartikainen commented Feb 26, 2019

Uh oh!

robertnishihara Feb 26, 2019

Choose a reason for hiding this comment

Uh oh!

guoyuhong Feb 27, 2019

Choose a reason for hiding this comment

Uh oh!

guoyuhong Feb 28, 2019

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented Feb 26, 2019

Uh oh!

guoyuhong commented Feb 27, 2019

Uh oh!

guoyuhong commented Feb 27, 2019

Uh oh!

AmplabJenkins commented Feb 27, 2019

Uh oh!

AmplabJenkins commented Feb 27, 2019

Uh oh!

hartikainen commented Feb 27, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

guoyuhong commented Feb 28, 2019

Uh oh!

hartikainen commented Feb 28, 2019

Uh oh!

guoyuhong commented Feb 28, 2019

Uh oh!

AmplabJenkins commented Feb 28, 2019

Uh oh!

AmplabJenkins commented Mar 1, 2019

Uh oh!

ericl left a comment

Choose a reason for hiding this comment

Uh oh!

ericl Mar 1, 2019

Choose a reason for hiding this comment

Uh oh!

guoyuhong Mar 1, 2019

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented Mar 1, 2019

Uh oh!

guoyuhong commented Feb 26, 2019 •

edited

Loading

guoyuhong commented Feb 26, 2019 •

edited

Loading

hartikainen commented Feb 27, 2019 •

edited

Loading