Skip to content

Conversation

@guoyuhong
Copy link
Contributor

@guoyuhong guoyuhong commented Mar 29, 2019

What do these changes do?

  1. Better handle Broken Pipe status, say don't 'send' message when the status is Broken Pipe. If we continue to send message, the callback won't get called.
  2. Fix test_multi_node_2.py:: test_worker_plasma_store_failure by killing reporter when running in Python3.
  3. Fix test_component_failures.py::test_plasma_store_failed by reducing the node from 4 to 2. This test passes in my local envs, but it may be two heavy for travis machines.
  4. Fix test_actor.py::test_resource_assignment by refine scheduling_resources.cc. The test fails when there is 0.9 GPU and 0.2 GPU is returned. 1.1 GPU will make the check fail. Furthermore, it not right to compare a double number to int number. The crash in the test is as follows (the actual number of fractional_pair_it->second is 1.1 and the released resource is 0.2):
F0329 16:15:18.325944 207971776 scheduling_resources.cc:261]  Check failed: fractional_pair_it->second <= 1
*** Check failure stack trace: ***
*** Aborted at 1553847318 (unix time) try "date -d @1553847318" if you are using GNU date ***
PC: @                0x0 (unknown)
*** SIGABRT (@0x7fff678a32c6) received by PID 21869 (TID 0x10c6565c0) stack trace: ***
    @     0x7fff6794db5d _sigtramp
    @        0x10c5d1f87 (unknown)
    @     0x7fff6780d6a6 abort
    @        0x10898cfa9 google::logging_fail()
    @        0x10898bcd3 google::LogMessage::SendToLog()
    @        0x10898c3a5 google::LogMessage::Flush()
    @        0x10898c482 google::LogMessage::~LogMessage()
    @        0x108985c35 ray::RayLog::~RayLog()
    @        0x1088ce6f5 ray::raylet::ResourceIds::Release()
    @        0x1088d03a4 ray::raylet::ResourceIdSet::Release()
    @        0x1088a7230 ray::raylet::NodeManager::ProcessDisconnectClientMessage()
    @        0x1088a5d9d ray::raylet::NodeManager::ProcessClientMessage()
    @        0x1088c390c std::__1::__function::__func<>::operator()()
    @        0x10897b446 ray::ClientConnection<>::ProcessMessage()
    @        0x1089822f5 boost::asio::detail::reactive_socket_recv_op<>::do_complete()
    @        0x10888ac48 boost::asio::detail::scheduler::do_run_one()
    @        0x10888a682 boost::asio::detail::scheduler::run()
    @        0x10887f70e main
    @     0x7fff677683d5 start
    @               0x10 (unknown)

Related issue number

@guoyuhong
Copy link
Contributor Author

I also found test_stress.py::test_submitting_many_actors_to_one and test_object_manager.py::test_actor_broadcast are too heavy for travis. I can see the OOM message:
image
https://travis-ci.com/ant-tech-alliance/ray/jobs/188826280
https://travis-ci.com/ant-tech-alliance/ray/jobs/188789871

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13363/
Test FAILed.

};

if (async_write_broken_pipe_) {
// The connection is not healthy and the heartbeat is not timeout yet.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also comment this:

Call the handlers directly. Because writing messages to a connection with broken-pipe status will result in
the callbacks never being called.

fractional_pair_it->second -= 1;
if (fractional_pair_it->second < 1e-6) {
fractional_ids_.erase(fractional_pair_it);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not too sure if this is what we want. @robertnishihara could you also take a look at this part?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reading more about this, I think this fix is correct.

@raulchen
Copy link
Contributor

raulchen commented Apr 1, 2019

I also found test_stress.py::test_submitting_many_actors_to_one and test_object_manager.py::test_actor_broadcast are too heavy for travis. I can see the OOM message:
image
https://travis-ci.com/ant-tech-alliance/ray/jobs/188826280
https://travis-ci.com/ant-tech-alliance/ray/jobs/188789871

I'm surprised that bazel is using the most memory. Shouldn't bazel already finish when we start running tests?

@raulchen
Copy link
Contributor

raulchen commented Apr 1, 2019

I also found test_stress.py::test_submitting_many_actors_to_one and test_object_manager.py::test_actor_broadcast are too heavy for travis. I can see the OOM message:
image
https://travis-ci.com/ant-tech-alliance/ray/jobs/188826280
https://travis-ci.com/ant-tech-alliance/ray/jobs/188789871

I'm surprised that bazel is using the most memory. Shouldn't bazel already finish when we start running tests?

It seems that we need to explicitly call bazel shutdown. Hopefully this can spare more memory and mitigate this test failure.

@guoyuhong could you try adding bazel shutdown here?

ray/.travis.yml

Line 169 in 0d94f3e

@guoyuhong
Copy link
Contributor Author

@raulchen bazel shutdown is added. Let's wait and see.

@guoyuhong
Copy link
Contributor Author

Though bazel shutdown was called. There are 3 test cases raised OOM exception in the Linux Travis Test.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13405/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13406/
Test FAILed.

@raulchen
Copy link
Contributor

raulchen commented Apr 2, 2019

I also fix some other problematic tests with #4535. Hopefully CI will be more stable.

@raulchen raulchen merged commit c2c548b into ray-project:master Apr 2, 2019
@raulchen raulchen deleted the fixBrokenPipeCallback branch April 2, 2019 09:42
} else {
RAY_CHECK(fractional_pair_it->second < 1);
fractional_pair_it->second += fractional_pair_to_return.second;
RAY_CHECK(fractional_pair_it->second <= 1);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be RAY_CHECK(fractional_pair_it->second < 1 + std::numeric_limits<double>::epsilon());

I believe @romilbhardwaj is making this change in #4555.

However, all of the imprecision is just a temporary solution. There are too many places that we'd have to add epsilon in order to make things correct. @williamma12 is working on a solution that simply makes the fractional resource bookkeeping exact (instead of approximate) by using integers instead of doubles (where the integer can represent 1/1000th of a resource).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem in this case isn't precision. The problem is that the remaining value and returned value can add up to be larger than 1. E.g., returning 0.3 to existing 0.8 will be 1.1 in total.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, @raulchen is right. From my observation in test_actor.py::test_resource_assignment, there will be 1.1 which is generated by adding 0.2 to 0.9.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@raulchen @guoyuhong that shouldn't be possible. If that happens, then it is a bug.

Each fractional resource corresponds to a specific resource ID. Suppose there are two GPUs with IDs 0 and 1, and suppose there are two tasks that require 0.2 and 0.9 GPUs.

  • The raylet starts with (ID1, 1), (ID2, 1).
  • The first task gets scheduled, so the raylet has (ID1, 0.8), (ID2, 1), and the first worker has (ID1, 0.2).
  • The second task gets scheduled, so the raylet has (ID1, 0.8), (ID2, 0.1), the first worker has (ID1, 0.2) and the second worker has (ID2, 0.9)
  • The first task finishes, so the raylet has (ID1, 1), (ID2, 0.1), and the second worker has (ID2, 0.9)
  • The second task finishes, so the raylet has (ID1, 1), (ID2, 1)

We won't add the 0.2 and the 0.9 together because they correspond to different IDs. It isn't possible for one worker to have (ID1, 0.2) and another worker to have (ID1, 0.9) because the total quantity of a given ID is 1.

Does my notation make sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robertnishihara Thanks for the information. I looked at the code again. The resource_id is int64_t not a string. I may misunderstand the code. I run the test several time and cannot repro the case that adding 0.2 to 0.9, which is strange.
Sorry to this bad change. Shall I revert the change or this will be fixed in #4555 by @romilbhardwaj?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @guoyuhong, we'll fix it in #4555.

@robertnishihara
Copy link
Collaborator

Thanks @guoyuhong! I've been able to reproduce the failure test_multi_node_2.py::test_worker_plasma_store_failure locally in the past I think, so I thought that may actually be a real bug, but I'm not sure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants