-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fast data loading feedback (--load_fast=true
; “RustBoard”)
#4784
Comments
Summary: We’d like to set `--load_fast=auto` as the default for TensorBoard 2.5. To make that less surprising, we now print an informational message when `--load_fast` is set to `auto` and the data server is actually used. We don’t show it with `--load_fast=true`; if you pass that, we assume that you know what you’re doing. The message looks like: ``` $ tensorboard --logdir /tmp/logs --bind_all --load_fast=auto 2021-03-17 11:41:51.151546: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory 2021-03-17 11:41:51.151567: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. NOTE: Using experimental fast data loading logic. To disable, pass "--load_fast=false" and report issues on GitHub. More details: #4784 TensorBoard 2.5.0a0 at http://localhost:6007/ (Press CTRL+C to quit) ``` Test Plan: Run with `--load_fast` set to `false`, `auto`, and `true`, and note that the message only appears when set to `auto`. Then uninstall the data server and run with `auto`, and note that the message does not appear. wchargin-branch: cli-data-server-message wchargin-source: ff24dc84b7b225b5351295c45d106f136933997a
Summary: We’d like to set `--load_fast=auto` as the default for TensorBoard 2.5. To make that less surprising, we now print an informational message when `--load_fast` is set to `auto` and the data server is actually used. We don’t show it with `--load_fast=true`; if you pass that, we assume that you know what you’re doing. The message looks like: ``` $ tensorboard --logdir /tmp/logs --bind_all --load_fast=auto 2021-03-17 11:41:51.151546: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory 2021-03-17 11:41:51.151567: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. NOTE: Using experimental fast data loading logic. To disable, pass "--load_fast=false" and report issues on GitHub. More details: #4784 TensorBoard 2.5.0a0 at http://localhost:6007/ (Press CTRL+C to quit) ``` Test Plan: Run with `--load_fast` set to `false`, `auto`, and `true`, and note that the message only appears when set to `auto`. Then uninstall the data server and run with `auto`, and note that the message does not appear. wchargin-branch: cli-data-server-message
Hello! Very much interested in this, as we currently maintain a custom entrypoint to make Tensorboard work at all with our data sizes. Unfortunately, I can't get this to work anywhere. Using the latest nightly docker image I get the following error:
Presumably it tries to bind some port that's already in use by another process; unfortunately it doesn't say which one. Also, it doesn't seem to work with |
@tgolsson: Hi; thank you for your feedback! I hadn’t looked into Docker edit: Fixed in #4804; confirmed fix in Docker nightlies.
Yep. As of #4794, if you use This is super helpful feedback; thank you. |
With tensorboard-plugin-profile (2.4.0) installed, I'm getting errors in the log:
(They disappear with --load_fast=false) |
Hi @brychcy—thanks! Yes, this is true. The profile plugin uses Added a note to the “Known issues” section; thank you! |
@brychcy: I’ve sent the profiler folks a patch: Their build appears to be pretty broken, so I’m not sure how long it |
@wchargin Not quite feedback, but I'm wondering if there's any thoughts on multi-directory Rustboard ( |
@tgolsson: Good question! I was thinking of instead supporting a more
That is, you could add or remove log directories at runtime without Opened #4923 to track this, and would be happy to hear your thoughts. |
I am getting a lot of warnings about too many open files -- is there a way to reduce or cap the number of open file descriptors?
I don't have that many runs (~2000), so it shouldn't really be an issue. Using lsof to count the number of open FDs shows over 12k being used...
Compared to <500 in "slow" mode.
In my case, the "slow" mode actually loads files faster since it doesn't run into this issue. |
Using It works fine if I set |
Fast data loading may be causing issues with the profiler tensorflow/profiler#344 (one of several issues mentioning this problem recently) - a possible solution for now is to switch it off with |
Update: try the latest Profiler plugin v2.5 ( |
You're welcome, happy to help! |
Anyone else landing here because they're following instructions from this link regarding using Tensorboard in AzureML? |
Closing as the issue has been resolved after I have released tensorboard_plugin_profile 2.5.0. |
Ah, we would like to keep this issue opened to solicit more feedbacks on the feature. Reopening. |
Hi, would you mind sharing a bit more information? I might be able to help but that I would need to know how to reproduce your issue. (I am replying here because I contributed to a similar issue in the past, but of course it is up for the repo owners to make the decision). Thank you! |
We have a fairly exotic setup, but you might be able to reproduce it by creating a GCE VM with a custom service account that has GCS permissions, then running |
@samos123 I assume you meant GKE... The error should not be there as I thought I fixed that. Could you please check which server version you are using? I guess something like |
GKE + Workload Identity would use a similar mechanism and I would expect to have same issue. We were using 2.8.0. Could you share the code where the authentication happens with |
@samos123 sorry for confusion, I didn't know that the GCE abbreviation exists. The PR was #5939 , in particular it gets a GCP Access Token using the |
On my ubuntu 20.04.6LTS Nvidia A-100 DGX, i cannot get fast loading to work:
that is all that I get. |
Does not change it.
|
Important UPDATE after #6578: As of On Ubuntu 20.04 Linux machines where glibc version is 2.31, the rustboard server will fail to launch, trying to find glibc 2.32 - 2.34. Ubuntu 22.04 will be fine, as it's shipped with GLIBC 2.35.
Workaround: On FYI, how to figure out the GLIBC version on the system: $ ldd --version | grep GLIB
ldd (Ubuntu GLIBC 2.31-0ubuntu9.9) 2.31
$ cat /etc/lsb-release | grep DESCRIPTION
DISTRIB_DESCRIPTION="Ubuntu 20.04.5 LTS" Verifying that $ objdump -T $(python -c "from tensorboard_data_server import server_binary; print(server_binary())") | grep GLIBC
...
0000000000000000 DF *UND* 0000000000000000 GLIBC_2.34 pthread_create
0000000000000000 DF *UND* 0000000000000000 GLIBC_2.34 __libc_start_main I'd like to kindly ask the tensorboard team to lower the GLIBC requirement in future releases. I will open an issue if needed. -> #6578 |
@wookayin . Thanks for flagging. Yes, please open a new issue! |
Hello, I'm very excited for this feature as tensorboard's speed has been a big pain point so far. BUT when I try to use it, it tells me it's not supported on MacOS:
You say it is supported on MacOS though, so what's going on here? I've got MacBookPro17,1; Apple M1 chips; MacOS Ventura, version 13.4.1; tb-nightly Version: 2.15.0a20231013; tf-nightly-macos Version: 2.16.0.dev20231013. P.S. I've gotten same results using non-nightly tensorflow-macos & no tensorflow at all. Also I followed your instructions exactly to uninstall tensorboard & tb_nightly before reinstalling tb_nightly. |
@profPlum: Hazarding a guess:
That's probably your problem. The If interested, you can build it yourself easily. I just tested it on my
If you want to double-check that it's using the data server, you can (I don't currently work on TensorBoard, so consider this not an official |
@wchargin Thanks I appreciate the help! (& I'll let you know if it works) |
@wchargin Hi again, I tried your instructions verbatim and it says roughly the same:
But to clarify: did you want to me to launch the original (pip) tensorboard again? That point confused me and it is what I did but I'm not sure if it's what you meant. P.S. With: fresh install of tb_nightly==2.15.0a20231019 & cargo version: 1.73.0 (9c4383fb5 2023-08-26). Also I got same results on a linux docker container. |
@wchargin Hi, I got issue when running this: It shows google interface with: Could you please guide me with this? Thanks |
@Frn1nd0 , your issue is unrelated to fast data loading. Instead you have run into a recent regression with compatibility with Chrome. The Colab team have been investigating. We expect them to keep us updated at the following issue: |
@bmd3k Thanks for the clarification, appreciate that! Hope they can fix this soon. |
Update: #6578 is fixed; as of tensorboard 2.15 GLIBC minimum requirement is 2.29 (compatible with Ubuntu 20.04) |
@profPlum: Oops, sorry, I wrote the environment variable wrong: it
Yes. |
Is this still a bug in the recent versions? I can repro with version 2.11.2. |
I would like to share my experience about how to solve the problem about
My OS is
It runs successfully, but when I run following Python codes, it outputs import tensorboard_data_server
res = tensorboard_data_server.server_binary()
print(res) It seems that the binary of tensorboard-data-server is not installed properly. conda install tensorboard
conda install chardet After installation, I run the above Python codes again and it successfully outputs the path to binary of tensorboard-data-server like Finally, I can run |
2024-04-16 13:41:07.757664: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory |
it is not usable with Google gcsfuse. See #6790 |
This thread is for tracking feedback about TensorBoard’s experimental
mode for fast data loading. Typical speedups range from 100× to 400×.
Who should try this: Anyone who’s found TensorBoard’s data loading
to be slower than they’d like.
Who shouldn’t try this: Windows users (for now).
Feedback: Feedback form, or reply on this thread.
Try it out
To try this out, please uninstall all copies of TensorBoard and then
install the latest version of
tb-nightly
:Then, invoke TensorBoard with the
--load_fast=true
flag:Use TensorBoard as you usually would. It should work the same way, just
faster.
Feedback
You can respond to this anonymous Google Form, or reply on this
thread, or open a new issue. Let us know: did it work? how much faster
was it? any suggestions or requests?
Known issues
We know about these, but please let us know if they matter for you, so
that we can prioritize working on them:
mode (e.g., the profile plugin).
FAQ
What does “data loading” include?
It includes time spent reading files in your logdir. It does not include
time spent painting charts on the frontend.
What is the
--load_fast
flag?Pass
--load_fast=true
to tell TensorBoard to use a new data loadingmechanism, which is generally hundreds of times faster.
Is
--load_fast=true
right for me?Currently, this mode is supported on Linux and macOS. If you are
interested in using it on other platforms, ping @wchargin and I’ll show
you how to build it.
Most features of TensorBoard are expected to work with the new data
loading mechanism. All standard TensorBoard dashboards (scalars, images,
etc.) should work, and flags like
--reload_interval
should work, too.You can use logdirs on local disk or on GCS buckets (public or private).
Do I need to have TensorFlow installed?
No.
What’s happening under the hood?
Instead of crawling your logdir in a mixture of Python and C++ code with
a lot of locking, cross-language marshalling, and slow data manipulation
in Python, we read the data in a dedicated subprocess. This program is
written in Rust and is optimized for concurrent reading and serving.
More design details here.
The text was updated successfully, but these errors were encountered: