Skip to content

Conversation

@eric-jj
Copy link
Contributor

@eric-jj eric-jj commented May 16, 2018

What do these changes do?

Related issue number

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the .count() == 0 is supposed to be idiomatically equivalent. Do you find your proposed way better in performance?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

count() means we need try to find a item in the map. and cluster_resource_map_[client_id] will find it again. Although find is O(n) operation for unordered_map, but calculation hash and function call will have cost, right. The fix will change find operation from twice to once

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I know, I said "nevermind" below, when I saw you reusing the iterator ;)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. nvm.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add_handlers_ and rem_handlers_ are vectors of function pointers. Did you find switching to a reference giving you a significant performance improvement?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::function copy is not so cheap, the std::function may be a wrapper of lambda. In the lambda, it may catch some variables, it need copy of the object. For it you can debug how lambda was converted to a std::function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not only lambda, but also std::bind will generate a very heavy std::function object, it is better to use reference if it is possible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good, I agree with that and have been pushing internally to use refs and const refs as much as possible... Just curious if you actually saw performance differences in your testing environment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't get the perf number now. Because the current code has too many difference from our internal branch.
I did such change because I met similar problem before.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, here I explicitly want to work with a copy (as per the comment on L290), because we're calling RemoveTasks on local queues below, which will change the contents of scheduled_tasks_. This will modify a container we're iterating over, if it's a const ref.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about change the following change to

std::vector dispatched_taskIDs;
dispatched_taskIDs.reserve(scheduled_tasks.size());

const auto &local_resources =
cluster_resource_map_[my_client_id].GetAvailableResources();
for (const auto &task : scheduled_tasks) {
const auto &task_resources = task.GetTaskSpecification().GetRequiredResources();
if (!task_resources.IsSubset(local_resources)) {
// Not enough local resources for this task right now, skip this task.
continue;
}
dispatched_taskIDs.push_back(task.GetTaskSpecification().TaskId());
}
// We have enough resources for this task. Assign task.
// TODO(atumanov): perform the task state/queue transition inside AssignTask.
auto dispatched_tasks =
local_queues_.RemoveTasks(dispatched_taskIDs);
for (auto& task : dispatched_tasks)
AssignTask(task);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this way, we don't need copy scheduled_tasks, and only need call the RemoveTasks for once.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5416/
Test PASSed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in terms of correctness, it should be fine to batch dispatched ids and then dispatch them. The intention was to dispatch them ASAP, to minimize latency. I'm ok with this change for now.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5419/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5420/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5444/
Test FAILed.

@robertnishihara
Copy link
Collaborator

Hey @eric-jj, you should be able to expedite the Travis builds by doing the following

git remote add private-travis [email protected]:robertnishihara/ray-private-travis.git
git push private-travis

to test your PR. You can view the results at https://travis-ci.com/robertnishihara/ray-private-travis/branches.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5446/
Test PASSed.

@robertnishihara
Copy link
Collaborator

@eric-jj, it looks like some conflicts were introduced in #2035, can you rebase this?

Note that I pushed a small variable name change.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5462/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5470/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5478/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5480/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5495/
Test PASSed.

@robertnishihara
Copy link
Collaborator

runtest.py seems to be failing (consistently) on Travis (in the xray build).

testIdenticalFunctionNames (__main__.APITest) ... Detected environment variable 'RAY_USE_XRAY'.
/home/travis/.local/lib/python3.6/site-packages/ray-0.4.0-py3.6-linux-x86_64.egg/ray/worker.py:1476: ResourceWarning: unclosed <socket.socket fd=4, family=AddressFamily.AF_INET, type=SocketKind.SOCK_DGRAM, proto=0, laddr=('10.30.0.3', 49994), raddr=('8.8.8.8', 53)>
  node_ip_address = ray.services.get_node_ip_address()
Process STDOUT and STDERR is being redirected to /tmp/raylogs/.
Waiting for redis server at 127.0.0.1:46321 to respond...
/home/travis/.local/lib/python3.6/site-packages/ray-0.4.0-py3.6-linux-x86_64.egg/ray/services.py:537: ResourceWarning: unclosed file <_io.TextIOWrapper name='/tmp/raylogs/redis-2018-05-19_08-23-07-03279.out' mode='a' encoding='UTF-8'>
  "redis-{}".format(i), redirect_output)
/home/travis/.local/lib/python3.6/site-packages/ray-0.4.0-py3.6-linux-x86_64.egg/ray/services.py:537: ResourceWarning: unclosed file <_io.TextIOWrapper name='/tmp/raylogs/redis-2018-05-19_08-23-07-03279.err' mode='a' encoding='UTF-8'>
  "redis-{}".format(i), redirect_output)
Waiting for redis server at 127.0.0.1:54850 to respond...
/home/travis/.local/lib/python3.6/site-packages/ray-0.4.0-py3.6-linux-x86_64.egg/ray/services.py:1330: ResourceWarning: unclosed file <_io.TextIOWrapper name='/tmp/raylogs/redis-0-2018-05-19_08-23-07-02682.out' mode='a' encoding='UTF-8'>
  cleanup=cleanup)
/home/travis/.local/lib/python3.6/site-packages/ray-0.4.0-py3.6-linux-x86_64.egg/ray/services.py:1330: ResourceWarning: unclosed file <_io.TextIOWrapper name='/tmp/raylogs/redis-0-2018-05-19_08-23-07-02682.err' mode='a' encoding='UTF-8'>
  cleanup=cleanup)
Failed to start the UI, you may need to run 'pip install jupyter'.
/home/travis/.local/lib/python3.6/site-packages/ray-0.4.0-py3.6-linux-x86_64.egg/ray/services.py:1696: ResourceWarning: unclosed file <_io.TextIOWrapper name='/tmp/raylogs/monitor-2018-05-19_08-23-08-01091.out' mode='a' encoding='UTF-8'>
  use_raylet=use_raylet)
/home/travis/.local/lib/python3.6/site-packages/ray-0.4.0-py3.6-linux-x86_64.egg/ray/services.py:1696: ResourceWarning: unclosed file <_io.TextIOWrapper name='/tmp/raylogs/monitor-2018-05-19_08-23-08-01091.err' mode='a' encoding='UTF-8'>
  use_raylet=use_raylet)
/home/travis/.local/lib/python3.6/site-packages/ray-0.4.0-py3.6-linux-x86_64.egg/ray/services.py:1696: ResourceWarning: unclosed file <_io.TextIOWrapper name='/tmp/raylogs/log_monitor-2018-05-19_08-23-08-03330.out' mode='a' encoding='UTF-8'>
  use_raylet=use_raylet)
/home/travis/.local/lib/python3.6/site-packages/ray-0.4.0-py3.6-linux-x86_64.egg/ray/services.py:1696: ResourceWarning: unclosed file <_io.TextIOWrapper name='/tmp/raylogs/log_monitor-2018-05-19_08-23-08-03330.err' mode='a' encoding='UTF-8'>
  use_raylet=use_raylet)
/home/travis/.local/lib/python3.6/site-packages/ray-0.4.0-py3.6-linux-x86_64.egg/ray/services.py:1696: ResourceWarning: unclosed file <_io.TextIOWrapper name='/tmp/raylogs/plasma_store_0-2018-05-19_08-23-08-06294.out' mode='a' encoding='UTF-8'>
  use_raylet=use_raylet)
/home/travis/.local/lib/python3.6/site-packages/ray-0.4.0-py3.6-linux-x86_64.egg/ray/services.py:1696: ResourceWarning: unclosed file <_io.TextIOWrapper name='/tmp/raylogs/plasma_store_0-2018-05-19_08-23-08-06294.err' mode='a' encoding='UTF-8'>
  use_raylet=use_raylet)
/home/travis/.local/lib/python3.6/site-packages/ray-0.4.0-py3.6-linux-x86_64.egg/ray/services.py:1696: ResourceWarning: unclosed file <_io.TextIOWrapper name='/tmp/raylogs/plasma_manager_0-2018-05-19_08-23-08-04225.out' mode='a' encoding='UTF-8'>
  use_raylet=use_raylet)
/home/travis/.local/lib/python3.6/site-packages/ray-0.4.0-py3.6-linux-x86_64.egg/ray/services.py:1696: ResourceWarning: unclosed file <_io.TextIOWrapper name='/tmp/raylogs/plasma_manager_0-2018-05-19_08-23-08-04225.err' mode='a' encoding='UTF-8'>
  use_raylet=use_raylet)
/home/travis/.local/lib/python3.6/site-packages/ray-0.4.0-py3.6-linux-x86_64.egg/ray/services.py:1696: ResourceWarning: unclosed file <_io.TextIOWrapper name='/tmp/raylogs/raylet_0-2018-05-19_08-23-08-01016.out' mode='a' encoding='UTF-8'>
  use_raylet=use_raylet)
/home/travis/.local/lib/python3.6/site-packages/ray-0.4.0-py3.6-linux-x86_64.egg/ray/services.py:1696: ResourceWarning: unclosed file <_io.TextIOWrapper name='/tmp/raylogs/raylet_0-2018-05-19_08-23-08-01016.err' mode='a' encoding='UTF-8'>
  use_raylet=use_raylet)
/home/travis/.local/lib/python3.6/site-packages/ray-0.4.0-py3.6-linux-x86_64.egg/ray/services.py:1696: ResourceWarning: unclosed file <_io.TextIOWrapper name='/tmp/raylogs/webui-2018-05-19_08-23-08-04976.out' mode='a' encoding='UTF-8'>
  use_raylet=use_raylet)
/home/travis/.local/lib/python3.6/site-packages/ray-0.4.0-py3.6-linux-x86_64.egg/ray/services.py:1696: ResourceWarning: unclosed file <_io.TextIOWrapper name='/tmp/raylogs/webui-2018-05-19_08-23-08-04976.err' mode='a' encoding='UTF-8'>
  use_raylet=use_raylet)


No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.
Check the details on how to adjust your build configuration on: https://docs.travis-ci.com/user/common-build-problems/#Build-times-out-because-no-output-was-received

The build has been terminated

@robertnishihara
Copy link
Collaborator

I can reproduce the issue (stochastically) On Ubuntu with Python 3.5 by running

RAY_USE_XRAY=1 python test/runtest.py APITest.testIdenticalFunctionNames

will look into it a little.

return;
}
const ClientID &my_client_id = gcs_client_->client_table().GetLocalClientId();
const auto &local_resources =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@atumanov @eric-jj

There may be a bug here (or at least a change in behavior). Before, we called AssignTask inside the for loop, which presumably modified local_resources. However, now we call AssignTask at the very end, so the check if (!task_resources.IsSubset(local_resources)) { may give different results now.

I will try reverting some of this and see if that makes the problem I'm seeing go away.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will investigate it.

Copy link
Contributor

@atumanov atumanov May 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robertnishihara good catch. That's the reason why I didn't batch AssignTask calls before.
this->cluster_resource_map_[my_client_id].Acquire(spec.GetRequiredResources())); this line acquires resources when the task is assigned. This makes sure that the locally available resources are updated at each for loop iteration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. you are right, I have reverted it back.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea why I can't reopen the request after I rebase from master branch, have created a new pull request.

@eric-jj eric-jj closed this May 20, 2018
@eric-jj eric-jj mentioned this pull request May 20, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants