Fix broken pipe callback #4513

guoyuhong · 2019-03-29T16:21:22Z

What do these changes do?

Better handle Broken Pipe status, say don't 'send' message when the status is Broken Pipe. If we continue to send message, the callback won't get called.
Fix test_multi_node_2.py:: test_worker_plasma_store_failure by killing reporter when running in Python3.
Fix test_component_failures.py::test_plasma_store_failed by reducing the node from 4 to 2. This test passes in my local envs, but it may be two heavy for travis machines.
Fix test_actor.py::test_resource_assignment by refine scheduling_resources.cc. The test fails when there is 0.9 GPU and 0.2 GPU is returned. 1.1 GPU will make the check fail. Furthermore, it not right to compare a double number to int number. The crash in the test is as follows (the actual number of fractional_pair_it->second is 1.1 and the released resource is 0.2):

F0329 16:15:18.325944 207971776 scheduling_resources.cc:261]  Check failed: fractional_pair_it->second <= 1
*** Check failure stack trace: ***
*** Aborted at 1553847318 (unix time) try "date -d @1553847318" if you are using GNU date ***
PC: @                0x0 (unknown)
*** SIGABRT (@0x7fff678a32c6) received by PID 21869 (TID 0x10c6565c0) stack trace: ***
    @     0x7fff6794db5d _sigtramp
    @        0x10c5d1f87 (unknown)
    @     0x7fff6780d6a6 abort
    @        0x10898cfa9 google::logging_fail()
    @        0x10898bcd3 google::LogMessage::SendToLog()
    @        0x10898c3a5 google::LogMessage::Flush()
    @        0x10898c482 google::LogMessage::~LogMessage()
    @        0x108985c35 ray::RayLog::~RayLog()
    @        0x1088ce6f5 ray::raylet::ResourceIds::Release()
    @        0x1088d03a4 ray::raylet::ResourceIdSet::Release()
    @        0x1088a7230 ray::raylet::NodeManager::ProcessDisconnectClientMessage()
    @        0x1088a5d9d ray::raylet::NodeManager::ProcessClientMessage()
    @        0x1088c390c std::__1::__function::__func<>::operator()()
    @        0x10897b446 ray::ClientConnection<>::ProcessMessage()
    @        0x1089822f5 boost::asio::detail::reactive_socket_recv_op<>::do_complete()
    @        0x10888ac48 boost::asio::detail::scheduler::do_run_one()
    @        0x10888a682 boost::asio::detail::scheduler::run()
    @        0x10887f70e main
    @     0x7fff677683d5 start
    @               0x10 (unknown)

Related issue number

guoyuhong · 2019-03-29T16:25:13Z

I also found test_stress.py::test_submitting_many_actors_to_one and test_object_manager.py::test_actor_broadcast are too heavy for travis. I can see the OOM message:

https://travis-ci.com/ant-tech-alliance/ray/jobs/188826280
https://travis-ci.com/ant-tech-alliance/ray/jobs/188789871

AmplabJenkins · 2019-03-29T16:25:32Z

Can one of the admins verify this patch?

AmplabJenkins · 2019-03-29T19:03:18Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13363/
Test FAILed.

raulchen · 2019-04-01T03:36:16Z

src/ray/common/client_connection.cc

+  };
+
+  if (async_write_broken_pipe_) {
+    // The connection is not healthy and the heartbeat is not timeout yet.


Let's also comment this:

Call the handlers directly. Because writing messages to a connection with broken-pipe status will result in the callbacks never being called.

src/ray/common/client_connection.h

raulchen · 2019-04-01T04:06:07Z

src/ray/raylet/scheduling_resources.cc

+        fractional_pair_it->second -= 1;
+        if (fractional_pair_it->second < 1e-6) {
+          fractional_ids_.erase(fractional_pair_it);
+        }


Not too sure if this is what we want. @robertnishihara could you also take a look at this part?

After reading more about this, I think this fix is correct.

raulchen · 2019-04-01T04:08:12Z

I also found test_stress.py::test_submitting_many_actors_to_one and test_object_manager.py::test_actor_broadcast are too heavy for travis. I can see the OOM message:

https://travis-ci.com/ant-tech-alliance/ray/jobs/188826280
https://travis-ci.com/ant-tech-alliance/ray/jobs/188789871

I'm surprised that bazel is using the most memory. Shouldn't bazel already finish when we start running tests?

raulchen · 2019-04-01T04:15:44Z

I also found test_stress.py::test_submitting_many_actors_to_one and test_object_manager.py::test_actor_broadcast are too heavy for travis. I can see the OOM message:

https://travis-ci.com/ant-tech-alliance/ray/jobs/188826280
https://travis-ci.com/ant-tech-alliance/ray/jobs/188789871

I'm surprised that bazel is using the most memory. Shouldn't bazel already finish when we start running tests?

It seems that we need to explicitly call bazel shutdown. Hopefully this can spare more memory and mitigate this test failure.

@guoyuhong could you try adding bazel shutdown here?

ray/.travis.yml

Line 169 in 0d94f3e

Co-Authored-By: guoyuhong <[email protected]>

guoyuhong · 2019-04-01T09:04:54Z

@raulchen bazel shutdown is added. Let's wait and see.

guoyuhong · 2019-04-01T10:27:18Z

Though bazel shutdown was called. There are 3 test cases raised OOM exception in the Linux Travis Test.

AmplabJenkins · 2019-04-01T12:04:05Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13405/
Test FAILed.

AmplabJenkins · 2019-04-01T12:11:47Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13406/
Test FAILed.

raulchen · 2019-04-02T05:29:29Z

I also fix some other problematic tests with #4535. Hopefully CI will be more stable.

robertnishihara · 2019-04-04T06:41:51Z

src/ray/raylet/scheduling_resources.cc

    } else {
+      RAY_CHECK(fractional_pair_it->second < 1);
      fractional_pair_it->second += fractional_pair_to_return.second;
-      RAY_CHECK(fractional_pair_it->second <= 1);


This should probably be RAY_CHECK(fractional_pair_it->second < 1 + std::numeric_limits<double>::epsilon());

I believe @romilbhardwaj is making this change in #4555.

However, all of the imprecision is just a temporary solution. There are too many places that we'd have to add epsilon in order to make things correct. @williamma12 is working on a solution that simply makes the fractional resource bookkeeping exact (instead of approximate) by using integers instead of doubles (where the integer can represent 1/1000th of a resource).

The problem in this case isn't precision. The problem is that the remaining value and returned value can add up to be larger than 1. E.g., returning 0.3 to existing 0.8 will be 1.1 in total.

Yes, @raulchen is right. From my observation in test_actor.py::test_resource_assignment, there will be 1.1 which is generated by adding 0.2 to 0.9.

@raulchen @guoyuhong that shouldn't be possible. If that happens, then it is a bug.

Each fractional resource corresponds to a specific resource ID. Suppose there are two GPUs with IDs 0 and 1, and suppose there are two tasks that require 0.2 and 0.9 GPUs.

The raylet starts with (ID1, 1), (ID2, 1).

The first task gets scheduled, so the raylet has (ID1, 0.8), (ID2, 1), and the first worker has (ID1, 0.2).

The second task gets scheduled, so the raylet has (ID1, 0.8), (ID2, 0.1), the first worker has (ID1, 0.2) and the second worker has (ID2, 0.9)

The first task finishes, so the raylet has (ID1, 1), (ID2, 0.1), and the second worker has (ID2, 0.9)

The second task finishes, so the raylet has (ID1, 1), (ID2, 1)

We won't add the 0.2 and the 0.9 together because they correspond to different IDs. It isn't possible for one worker to have (ID1, 0.2) and another worker to have (ID1, 0.9) because the total quantity of a given ID is 1.

Does my notation make sense?

@robertnishihara Thanks for the information. I looked at the code again. The resource_id is int64_t not a string. I may misunderstand the code. I run the test several time and cannot repro the case that adding 0.2 to 0.9, which is strange.
Sorry to this bad change. Shall I revert the change or this will be fixed in #4555 by @romilbhardwaj?

Thanks @guoyuhong, we'll fix it in #4555.

robertnishihara · 2019-04-04T06:42:39Z

Thanks @guoyuhong! I've been able to reproduce the failure test_multi_node_2.py::test_worker_plasma_store_failure locally in the past I think, so I thought that may actually be a real bug, but I'm not sure.

Yuhong Guo added 7 commits March 29, 2019 14:36

Handle broken pipe in ServerConnection

da00fb5

Fix test_resource_assignment

6ce57c8

Fix test_worker_plasma_store_failure

a668a66

Temporary test to get some messages

a98c207

Fix scheduling_resources.cc

b59c81d

Fix reporter and revert .travis.yml

74929aa

Reduce node number to 2 for test_plasma_store_failed

343e1d3

raulchen reviewed Apr 1, 2019

View reviewed changes

raulchen and others added 2 commits April 1, 2019 16:59

Update comment in src/ray/common/client_connection.h

e4c3b74

Co-Authored-By: guoyuhong <[email protected]>

Refine comment in client_connection.cc and shutdown bazel in .travis.yml

e15b6bf

raulchen approved these changes Apr 2, 2019

View reviewed changes

raulchen merged commit c2c548b into ray-project:master Apr 2, 2019

raulchen deleted the fixBrokenPipeCallback branch April 2, 2019 09:42

robertnishihara reviewed Apr 4, 2019

View reviewed changes

Fix broken pipe callback #4513

Fix broken pipe callback #4513

Uh oh!

Conversation

guoyuhong commented Mar 29, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What do these changes do?

Related issue number

Uh oh!

guoyuhong commented Mar 29, 2019

Uh oh!

AmplabJenkins commented Mar 29, 2019

Uh oh!

AmplabJenkins commented Mar 29, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

raulchen commented Apr 1, 2019

Uh oh!

raulchen commented Apr 1, 2019

Uh oh!

guoyuhong commented Apr 1, 2019

Uh oh!

guoyuhong commented Apr 1, 2019

Uh oh!

AmplabJenkins commented Apr 1, 2019

Uh oh!

AmplabJenkins commented Apr 1, 2019

Uh oh!

raulchen commented Apr 2, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

robertnishihara commented Apr 4, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

guoyuhong commented Mar 29, 2019 •

edited

Loading