-
Notifications
You must be signed in to change notification settings - Fork 487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Broadcast op cannot be created inside name scope #13
Comments
This is tricky. I can add a "scope" argument to broadcast_global_variables(), just like how we handle scope in push_pull(). Do you think it's good enough? In general, we expected users to only call broadcast_global_variables() once in the very beginning, and didn't expect users to do that in a scope. |
Calling bps.push_pull in a name scope can cause the same issue. May I know why supporting the name scope is tricky? With https://www.tensorflow.org/api_docs/python/tf/Graph#get_name_scope, supporting name scope seems equally easy to suppoting an extra scope argument. |
"Calling bps.push_pull in a name scope can cause the same issue." It should not have an issue if you pass in the scope. Like here -- https://github.com/bytedance/byteps/blob/master/byteps/tensorflow/__init__.py#L161 I looked at get_name_scope() before, but I found it must be called from a graph. Is get_default_graph() always correct? I thought there could be multiple graphs. Probably I just don't understand TF well enough... |
I said it's tricky, because we must make sure the name here https://github.com/bytedance/byteps/blob/master/byteps/tensorflow/ops.cc#L182 is consistent with how the tensors were declared. Is it possible to have a tensor declared in scope A, but the broadcast op is in another scope B? It seems to me that your case is exactly this. I agree that if I can get the scope anywhere, I can declare the tensor in scope B instead of A. If you are sure (and I'll double check) get_name_scope() always gives the correct result, we can use that to handle scope problem. |
The above is only valid in graph mode and I have no experience with eager mode, though. |
Okay. We'll make the change and have some tests. Before it is merged, I hope this won't block you. You just need to pass in the scope here https://github.com/bytedance/byteps/blob/master/byteps/tensorflow/ops.py#L111 as a temporary workaround. |
I believe all the bugs you found have been fixed and merged. Hopefully it will work for you without modification this time. |
The same bug still exists. It still fails with:
When I set LOG_LEVEL to debug, I can see:
So there is only a mismatch in "/". |
You are right.. See #24 |
We believe that this problem has been resolved. Feel free to reopen it if you have further questions. |
* basic worker->server * basic server->worker * finish GetSharedMemory() * bugfix for recvmsg * fix reserved_context * fix repeated release * can run 1v1 with very large partition bytes * improve env check * fix GetSharedMemory and clean * fix seg fault * add async copy * fix compile * join threads when shutdown * quick fix * add ipc benchmark * add ipc benchmark again * fix ipc benchmark * fix 2v2
Describe the bug
I can run the example
synthetic_benchmark.py
successfully after fixing some previously reported bugs. However, after making this change:And run it again:
It produces the error:
It looks like when the broadcast op is created inside a name scope, the tensor is declared in byteps without the scope, but is then looked up with the scope, causing a mismatch.
The text was updated successfully, but these errors were encountered: