Perf fix #2069

eric-jj · 2018-05-16T05:54:03Z

What do these changes do?

Related issue number

atumanov · 2018-05-16T06:02:05Z

src/ray/raylet/node_manager.cc

the .count() == 0 is supposed to be idiomatically equivalent. Do you find your proposed way better in performance?

count() means we need try to find a item in the map. and cluster_resource_map_[client_id] will find it again. Although find is O(n) operation for unordered_map, but calculation hash and function call will have cost, right. The fix will change find operation from twice to once

yes, I know, I said "nevermind" below, when I saw you reusing the iterator ;)

atumanov · 2018-05-16T06:03:18Z

src/ray/raylet/node_manager.cc

Oh, I see. nvm.

atumanov · 2018-05-16T06:07:11Z

src/ray/object_manager/object_store_notification_manager.cc

add_handlers_ and rem_handlers_ are vectors of function pointers. Did you find switching to a reference giving you a significant performance improvement?

std::function copy is not so cheap, the std::function may be a wrapper of lambda. In the lambda, it may catch some variables, it need copy of the object. For it you can debug how lambda was converted to a std::function.

Not only lambda, but also std::bind will generate a very heavy std::function object, it is better to use reference if it is possible.

sounds good, I agree with that and have been pushing internally to use refs and const refs as much as possible... Just curious if you actually saw performance differences in your testing environment.

I can't get the perf number now. Because the current code has too many difference from our internal branch.
I did such change because I met similar problem before.

atumanov · 2018-05-16T06:15:14Z

src/ray/raylet/node_manager.cc

no, here I explicitly want to work with a copy (as per the comment on L290), because we're calling RemoveTasks on local queues below, which will change the contents of scheduled_tasks_. This will modify a container we're iterating over, if it's a const ref.

How about change the following change to

std::vector dispatched_taskIDs;
dispatched_taskIDs.reserve(scheduled_tasks.size());

const auto &local_resources =
cluster_resource_map_[my_client_id].GetAvailableResources();
for (const auto &task : scheduled_tasks) {
const auto &task_resources = task.GetTaskSpecification().GetRequiredResources();
if (!task_resources.IsSubset(local_resources)) {
// Not enough local resources for this task right now, skip this task.
continue;
}
dispatched_taskIDs.push_back(task.GetTaskSpecification().TaskId());
}
// We have enough resources for this task. Assign task.
// TODO(atumanov): perform the task state/queue transition inside AssignTask.
auto dispatched_tasks =
local_queues_.RemoveTasks(dispatched_taskIDs);
for (auto& task : dispatched_tasks)
AssignTask(task);

In this way, we don't need copy scheduled_tasks, and only need call the RemoveTasks for once.

AmplabJenkins · 2018-05-16T06:59:48Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5416/
Test PASSed.

atumanov · 2018-05-16T08:02:04Z

src/ray/raylet/node_manager.cc

in terms of correctness, it should be fine to batch dispatched ids and then dispatch them. The intention was to dispatch them ASAP, to minimize latency. I'm ok with this change for now.

AmplabJenkins · 2018-05-16T08:13:50Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5419/
Test PASSed.

AmplabJenkins · 2018-05-16T08:57:32Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5420/
Test PASSed.

AmplabJenkins · 2018-05-17T03:12:18Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5444/
Test FAILed.

robertnishihara · 2018-05-17T03:59:09Z

Hey @eric-jj, you should be able to expedite the Travis builds by doing the following

git remote add private-travis [email protected]:robertnishihara/ray-private-travis.git
git push private-travis

to test your PR. You can view the results at https://travis-ci.com/robertnishihara/ray-private-travis/branches.

AmplabJenkins · 2018-05-17T04:56:34Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5446/
Test PASSed.

robertnishihara · 2018-05-18T01:02:39Z

@eric-jj, it looks like some conflicts were introduced in #2035, can you rebase this?

Note that I pushed a small variable name change.

AmplabJenkins · 2018-05-18T02:12:40Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5462/
Test PASSed.

AmplabJenkins · 2018-05-18T06:05:41Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5470/
Test PASSed.

AmplabJenkins · 2018-05-18T13:22:05Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5478/
Test PASSed.

AmplabJenkins · 2018-05-18T18:39:05Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5480/
Test PASSed.

AmplabJenkins · 2018-05-19T08:51:07Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5495/
Test PASSed.

robertnishihara · 2018-05-19T21:57:44Z

runtest.py seems to be failing (consistently) on Travis (in the xray build).

testIdenticalFunctionNames (__main__.APITest) ... Detected environment variable 'RAY_USE_XRAY'.
/home/travis/.local/lib/python3.6/site-packages/ray-0.4.0-py3.6-linux-x86_64.egg/ray/worker.py:1476: ResourceWarning: unclosed <socket.socket fd=4, family=AddressFamily.AF_INET, type=SocketKind.SOCK_DGRAM, proto=0, laddr=('10.30.0.3', 49994), raddr=('8.8.8.8', 53)>
  node_ip_address = ray.services.get_node_ip_address()
Process STDOUT and STDERR is being redirected to /tmp/raylogs/.
Waiting for redis server at 127.0.0.1:46321 to respond...
/home/travis/.local/lib/python3.6/site-packages/ray-0.4.0-py3.6-linux-x86_64.egg/ray/services.py:537: ResourceWarning: unclosed file <_io.TextIOWrapper name='/tmp/raylogs/redis-2018-05-19_08-23-07-03279.out' mode='a' encoding='UTF-8'>
  "redis-{}".format(i), redirect_output)
/home/travis/.local/lib/python3.6/site-packages/ray-0.4.0-py3.6-linux-x86_64.egg/ray/services.py:537: ResourceWarning: unclosed file <_io.TextIOWrapper name='/tmp/raylogs/redis-2018-05-19_08-23-07-03279.err' mode='a' encoding='UTF-8'>
  "redis-{}".format(i), redirect_output)
Waiting for redis server at 127.0.0.1:54850 to respond...
/home/travis/.local/lib/python3.6/site-packages/ray-0.4.0-py3.6-linux-x86_64.egg/ray/services.py:1330: ResourceWarning: unclosed file <_io.TextIOWrapper name='/tmp/raylogs/redis-0-2018-05-19_08-23-07-02682.out' mode='a' encoding='UTF-8'>
  cleanup=cleanup)
/home/travis/.local/lib/python3.6/site-packages/ray-0.4.0-py3.6-linux-x86_64.egg/ray/services.py:1330: ResourceWarning: unclosed file <_io.TextIOWrapper name='/tmp/raylogs/redis-0-2018-05-19_08-23-07-02682.err' mode='a' encoding='UTF-8'>
  cleanup=cleanup)
Failed to start the UI, you may need to run 'pip install jupyter'.
/home/travis/.local/lib/python3.6/site-packages/ray-0.4.0-py3.6-linux-x86_64.egg/ray/services.py:1696: ResourceWarning: unclosed file <_io.TextIOWrapper name='/tmp/raylogs/monitor-2018-05-19_08-23-08-01091.out' mode='a' encoding='UTF-8'>
  use_raylet=use_raylet)
/home/travis/.local/lib/python3.6/site-packages/ray-0.4.0-py3.6-linux-x86_64.egg/ray/services.py:1696: ResourceWarning: unclosed file <_io.TextIOWrapper name='/tmp/raylogs/monitor-2018-05-19_08-23-08-01091.err' mode='a' encoding='UTF-8'>
  use_raylet=use_raylet)
/home/travis/.local/lib/python3.6/site-packages/ray-0.4.0-py3.6-linux-x86_64.egg/ray/services.py:1696: ResourceWarning: unclosed file <_io.TextIOWrapper name='/tmp/raylogs/log_monitor-2018-05-19_08-23-08-03330.out' mode='a' encoding='UTF-8'>
  use_raylet=use_raylet)
/home/travis/.local/lib/python3.6/site-packages/ray-0.4.0-py3.6-linux-x86_64.egg/ray/services.py:1696: ResourceWarning: unclosed file <_io.TextIOWrapper name='/tmp/raylogs/log_monitor-2018-05-19_08-23-08-03330.err' mode='a' encoding='UTF-8'>
  use_raylet=use_raylet)
/home/travis/.local/lib/python3.6/site-packages/ray-0.4.0-py3.6-linux-x86_64.egg/ray/services.py:1696: ResourceWarning: unclosed file <_io.TextIOWrapper name='/tmp/raylogs/plasma_store_0-2018-05-19_08-23-08-06294.out' mode='a' encoding='UTF-8'>
  use_raylet=use_raylet)
/home/travis/.local/lib/python3.6/site-packages/ray-0.4.0-py3.6-linux-x86_64.egg/ray/services.py:1696: ResourceWarning: unclosed file <_io.TextIOWrapper name='/tmp/raylogs/plasma_store_0-2018-05-19_08-23-08-06294.err' mode='a' encoding='UTF-8'>
  use_raylet=use_raylet)
/home/travis/.local/lib/python3.6/site-packages/ray-0.4.0-py3.6-linux-x86_64.egg/ray/services.py:1696: ResourceWarning: unclosed file <_io.TextIOWrapper name='/tmp/raylogs/plasma_manager_0-2018-05-19_08-23-08-04225.out' mode='a' encoding='UTF-8'>
  use_raylet=use_raylet)
/home/travis/.local/lib/python3.6/site-packages/ray-0.4.0-py3.6-linux-x86_64.egg/ray/services.py:1696: ResourceWarning: unclosed file <_io.TextIOWrapper name='/tmp/raylogs/plasma_manager_0-2018-05-19_08-23-08-04225.err' mode='a' encoding='UTF-8'>
  use_raylet=use_raylet)
/home/travis/.local/lib/python3.6/site-packages/ray-0.4.0-py3.6-linux-x86_64.egg/ray/services.py:1696: ResourceWarning: unclosed file <_io.TextIOWrapper name='/tmp/raylogs/raylet_0-2018-05-19_08-23-08-01016.out' mode='a' encoding='UTF-8'>
  use_raylet=use_raylet)
/home/travis/.local/lib/python3.6/site-packages/ray-0.4.0-py3.6-linux-x86_64.egg/ray/services.py:1696: ResourceWarning: unclosed file <_io.TextIOWrapper name='/tmp/raylogs/raylet_0-2018-05-19_08-23-08-01016.err' mode='a' encoding='UTF-8'>
  use_raylet=use_raylet)
/home/travis/.local/lib/python3.6/site-packages/ray-0.4.0-py3.6-linux-x86_64.egg/ray/services.py:1696: ResourceWarning: unclosed file <_io.TextIOWrapper name='/tmp/raylogs/webui-2018-05-19_08-23-08-04976.out' mode='a' encoding='UTF-8'>
  use_raylet=use_raylet)
/home/travis/.local/lib/python3.6/site-packages/ray-0.4.0-py3.6-linux-x86_64.egg/ray/services.py:1696: ResourceWarning: unclosed file <_io.TextIOWrapper name='/tmp/raylogs/webui-2018-05-19_08-23-08-04976.err' mode='a' encoding='UTF-8'>
  use_raylet=use_raylet)


No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.
Check the details on how to adjust your build configuration on: https://docs.travis-ci.com/user/common-build-problems/#Build-times-out-because-no-output-was-received

The build has been terminated

robertnishihara · 2018-05-19T22:32:19Z

I can reproduce the issue (stochastically) On Ubuntu with Python 3.5 by running

RAY_USE_XRAY=1 python test/runtest.py APITest.testIdenticalFunctionNames

will look into it a little.

robertnishihara · 2018-05-19T22:53:01Z

src/ray/raylet/node_manager.cc

    return;
  }
  const ClientID &my_client_id = gcs_client_->client_table().GetLocalClientId();
+  const auto &local_resources =


@atumanov @eric-jj

There may be a bug here (or at least a change in behavior). Before, we called AssignTask inside the for loop, which presumably modified local_resources. However, now we call AssignTask at the very end, so the check if (!task_resources.IsSubset(local_resources)) { may give different results now.

I will try reverting some of this and see if that makes the problem I'm seeing go away.

I will investigate it.

@robertnishihara good catch. That's the reason why I didn't batch AssignTask calls before.
this->cluster_resource_map_[my_client_id].Acquire(spec.GetRequiredResources())); this line acquires resources when the task is assigned. This makes sure that the locally available resources are updated at each for loop iteration.

yes. you are right, I have reverted it back.

I have no idea why I can't reopen the request after I rebase from master branch, have created a new pull request.

atumanov reviewed May 16, 2018

View reviewed changes

src/ray/raylet/node_manager.cc Outdated

Copy link

Contributor

atumanov May 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. nvm.

atumanov reviewed May 16, 2018

View reviewed changes

eric-jj closed this May 16, 2018

eric-jj force-pushed the perf_fix branch from c881b05 to 570c315 Compare May 16, 2018 07:41

eric-jj reopened this May 16, 2018

atumanov reviewed May 16, 2018

View reviewed changes

eric-jj closed this May 18, 2018

eric-jj force-pushed the perf_fix branch from 015cb90 to 78e4b02 Compare May 18, 2018 12:10

eric-jj reopened this May 18, 2018

Some perf fix

1a0f48c

robertnishihara force-pushed the perf_fix branch from 98b29dd to 1a0f48c Compare May 19, 2018 07:43

robertnishihara reviewed May 19, 2018

View reviewed changes

eric-jj closed this May 20, 2018

eric-jj mentioned this pull request May 20, 2018

Perf fix #2110

Merged

Perf fix #2069

Perf fix #2069

Uh oh!

Conversation

eric-jj commented May 16, 2018

What do these changes do?

Related issue number

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented May 16, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented May 16, 2018

Uh oh!

AmplabJenkins commented May 16, 2018

Uh oh!

AmplabJenkins commented May 17, 2018

Uh oh!

robertnishihara commented May 17, 2018

Uh oh!

AmplabJenkins commented May 17, 2018

Uh oh!

robertnishihara commented May 18, 2018

Uh oh!

AmplabJenkins commented May 18, 2018

Uh oh!

AmplabJenkins commented May 18, 2018

Uh oh!

AmplabJenkins commented May 18, 2018

Uh oh!

AmplabJenkins commented May 18, 2018

Uh oh!

AmplabJenkins commented May 19, 2018

Uh oh!

robertnishihara commented May 19, 2018

Uh oh!

robertnishihara commented May 19, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

atumanov May 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

atumanov May 20, 2018 •

edited

Loading