Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broadcast op cannot be created inside name scope #13

Closed
ppwwyyxx opened this issue Jun 27, 2019 · 10 comments
Closed

Broadcast op cannot be created inside name scope #13

ppwwyyxx opened this issue Jun 27, 2019 · 10 comments
Labels
enhancement New feature or request

Comments

@ppwwyyxx
Copy link

Describe the bug
I can run the example synthetic_benchmark.py successfully after fixing some previously reported bugs. However, after making this change:

-bcast_op = bps.broadcast_global_variables(0)
+with tf.name_scope("test"):
+    bcast_op = bps.broadcast_global_variables(0)

And run it again:

python3 byteps/launcher/launch.py python3 byteps/example/tensorflow/synthetic_benchmark.py

It produces the error:

2019-06-27 12:19:26.837903: I tensorflow/core/common_runtime/process_util.cc:115] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.
[2019-06-27 12:19:27.271525: F byteps/common/global.cc:265] Check failed: _name_to_cxt.find(name) != _name_to_cxt.end() test/BytePSBroadcast_bn4b_branch2a_gamma_0 is not initialized

It looks like when the broadcast op is created inside a name scope, the tensor is declared in byteps without the scope, but is then looked up with the scope, causing a mismatch.

@bobzhuyb
Copy link
Member

This is tricky. I can add a "scope" argument to broadcast_global_variables(), just like how we handle scope in push_pull(). Do you think it's good enough?

In general, we expected users to only call broadcast_global_variables() once in the very beginning, and didn't expect users to do that in a scope.

@ppwwyyxx
Copy link
Author

Calling bps.push_pull in a name scope can cause the same issue.
Name scopes are not necessary but help me organize the graph. I met the issue when porting working horovod code to byteps.

May I know why supporting the name scope is tricky? With https://www.tensorflow.org/api_docs/python/tf/Graph#get_name_scope, supporting name scope seems equally easy to suppoting an extra scope argument.

@bobzhuyb
Copy link
Member

bobzhuyb commented Jun 27, 2019

"Calling bps.push_pull in a name scope can cause the same issue." It should not have an issue if you pass in the scope. Like here -- https://github.com/bytedance/byteps/blob/master/byteps/tensorflow/__init__.py#L161

I looked at get_name_scope() before, but I found it must be called from a graph. Is get_default_graph() always correct? I thought there could be multiple graphs. Probably I just don't understand TF well enough...

@bobzhuyb bobzhuyb added the enhancement New feature or request label Jun 27, 2019
@bobzhuyb
Copy link
Member

bobzhuyb commented Jun 27, 2019

I said it's tricky, because we must make sure the name here https://github.com/bytedance/byteps/blob/master/byteps/tensorflow/ops.cc#L182 is consistent with how the tensors were declared.

Is it possible to have a tensor declared in scope A, but the broadcast op is in another scope B? It seems to me that your case is exactly this.

I agree that if I can get the scope anywhere, I can declare the tensor in scope B instead of A. If you are sure (and I'll double check) get_name_scope() always gives the correct result, we can use that to handle scope problem.

@ppwwyyxx
Copy link
Author

tf.get_default_graph().get_name_scope() should be safe to use. The default graph is eventually used by with tf.name_scope(), and it is also the graph which the push_pull/broadcast ops will be added into.

The above is only valid in graph mode and I have no experience with eager mode, though.

@bobzhuyb
Copy link
Member

Okay. We'll make the change and have some tests.

Before it is merged, I hope this won't block you. You just need to pass in the scope here https://github.com/bytedance/byteps/blob/master/byteps/tensorflow/ops.py#L111 as a temporary workaround.

@bobzhuyb
Copy link
Member

I believe all the bugs you found have been fixed and merged. Hopefully it will work for you without modification this time.

@ppwwyyxx
Copy link
Author

The same bug still exists. It still fails with:

[2019-06-28 10:04:20.201094: F byteps/common/global.cc:268] Check failed: _name_to_cxt.find(name) != _name_to_cxt.end() test/BytePSBroadcast_bn4e_branch2c_gamma_0 is not initialized

When I set LOG_LEVEL to debug, I can see:

[2019-06-28 10:04:19. 54103: D byteps/common/global.cc:278] Declared tensor testBytePSBroadcast_res5a_branch1_bias_0, declared key (not PS key): 277 rank=1           
[2019-06-28 10:04:19. 61103: D byteps/common/global.cc:278] Declared tensor testBytePSBroadcast_bn5a_branch1_gamma_0, declared key (not PS key): 278 rank=1     
...

So there is only a mismatch in "/".

@bobzhuyb
Copy link
Member

You are right.. See #24

@bobzhuyb
Copy link
Member

bobzhuyb commented Jul 2, 2019

We believe that this problem has been resolved. Feel free to reopen it if you have further questions.

@bobzhuyb bobzhuyb closed this as completed Jul 2, 2019
pleasantrabbit pushed a commit that referenced this issue Jul 13, 2020
pleasantrabbit pushed a commit that referenced this issue Nov 3, 2020
* basic worker->server

* basic server->worker

* finish GetSharedMemory()

* bugfix for recvmsg

* fix reserved_context

* fix repeated release

* can run 1v1 with very large partition bytes

* improve env check

* fix GetSharedMemory and clean

* fix seg fault

* add async copy

* fix compile

* join threads when shutdown

* quick fix

* add ipc benchmark

* add ipc benchmark again

* fix ipc benchmark

* fix 2v2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants