Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the performance and scalability of pod viewer #2254

Merged
merged 7 commits into from
May 24, 2019

Conversation

qiuminxu
Copy link
Contributor

  • Motivation for features / changes
    When loading a large trace (~50MB), pod viewer takes a long time to render (>30s) and crashes the browser when changing the step or metric id. This change improves the logic and reduces the rendering and scripting time to ~1s.

  • Technical description of changes

  1. Previously send recv links are saved for each replica, which doesn't scale for a 4-way spatial partitioning model on a tpu v3-2048 pod which have 512 replicas. In data processing (c++), we instead only save aggregated stats across all replicas to reduce the result json size. Instead of srcCoreId and dstCoreId for each replica, we save a list of srcCoreIds and dstCoreIds of all replicas for each channel.
    proto.ts reflects this changes.
  2. There's another proto change that deprecates crsDurationUs and replace with allReduceComputeDurationUs and allReduceSyncDurationUs.
  3. Changed details-card.ts, details-card.html and pod-viewer-dashboard.ts to ensure backward compatibility of those proto changes.
  4. Improve TopologyGraph scalability
    a. Change the _computeTopoData to not be triggered by changes in selectedMetricIdx. When selectedMetricIdx changes, we will only change the color of the cards (instead of redrawing).
    b. Remove the channel selection in topology-graph. The user can hover over the channel bars and select for the channel.
    c. Instead of drawing all the links at the beginning and change the visibility (this creates too many dom elements and causes browser to crash), this change only draws the links that with the selected channel id.
  5. Fix the merge and updates of d3 (Selection was not on the right elements), and avoid redrawing stack-bar-chart and topology-graph from scratch.
  • Screenshots of UI changes

  • Detailed steps to verify changes work correctly (as executed by you)
    bazel run :tensorboard -- --logdir=gs://cloud-tpu-tools-df
    Select test run, and pod viewer tool. Under the 'new' host is the new trace, and others are old traces.

  • Alternate designs / implementations considered

@googlebot
Copy link

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and have the pull request author add another comment and the bot will run again. If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

@googlebot
Copy link

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

@qiuminxu qiuminxu requested a review from stephanwlee May 21, 2019 17:10
@qiuminxu qiuminxu requested a review from stephanwlee May 24, 2019 01:37
@stephanwlee stephanwlee merged commit 2101d87 into tensorflow:master May 24, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants