[Dashboard] Dashboard basic modules by fyrestone · Pull Request #10303 · ray-project/ray

fyrestone · 2020-08-25T02:41:04Z

This PR includes 3 basic modules for new dashboard:

reporter - Forks reporter.py, the profiling RESTful API has been simplified to one GET /api/launch_profiling.
log - Basic log module, provides GET /log_index, GET /logs from dashboard and GET /logs from dashboard agent.
stats_collector - Collects stats (currently, actors and node_stats), provides GET /nodes?view=[summary|hostNameList] and GET /nodes/{hostname}

Related issue number

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/latest/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failure rates at https://ray-travis-tracker.herokuapp.com/.
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested (please justify below)

rkooo567

It LGTM in general (and nice addition of tests :)), but since the PR is really large, I will have another batch of review later. Next time when you submit a PR, please break them down (this could be broken down to 3 PRs imo).

rkooo567 · 2020-08-26T04:08:21Z

dashboard/head.py

                nodes = await self._get_nodes()
+
+                # Get correct node info by state,
+                #   1. The node is ALIVE if any ALIVE node info


Can you explain the reason in the comment too?

GetAllNodeInfo returns all nodes info including dead nodes. We needs to know how to merge them to hostname. For example, there are two nodes info:

{node id: 1, hostname: example}

{node id: 2, hostname: example}

We choose which one for host example?

For one hostname, there will be a list of node info with only two cases:

All nodes info of one hostname are DEAD

Only one node info of one hostname is ALIVE

So, here is the rule,

Choose a DEAD one if all nodes info of one hostname are DEAD.

Choose the ALIVE one if there is ALIVE node in one hostname.

It's better to add a timestamp to GcsNodeInfo.

Hmm.. I see. So this is due to the case where there could be multiple raylets on a single host. I think this should be the case only when we use cluster_utils right? Or do you know any other case? I think it kind of doesn't make sense to have multiple ray start in a single node.

If this is useful only for the cluster utils, why don't we just group by hostname + node_id and display both in the dashboard? Is there any issue with this?

If a node failover many times, there will be a lot node info belongs to one hostname.

I think we can hide the second options' con easily by filtering DEAD node ids to user facing APIs.

{
node_id: {
ip:
host:
state:
}
}

And in the frontend, we can just filter DEAD state node id if there are the same hostname + ip pairs.

@rkooo567 I see. Use node id as the key of node info will returns full nodes info to front-end. And front-end do the filter job. My concern is that the data will be too large if we run many nodes in one cluster. For the DEAD node, the physics node info is useless. If we choose node id as the key, we need some strategies to reduce the data size:

Only contains GcsNodeInfo data for the DEAD nodes.

Each (IP + hostname) reserves a limited number of DEAD nodes. For example, the last 5 DEAD nodes.

I see your concerns (since you guys are running long-running clusters, it can have lots of information). I like the second solution, but we can allow users to configure them. So, for regular usage, it is 0~2, and for cluster_utils cases, it can be like 10.

Only contains GcsNodeInfo data for the DEAD nodes.

For this, I am probably not familiar how the data is represented now (because I thought this was the current behavior). Can you explain a little bit more?

Current node info is a mixture of node physical stats (from reporter agent) and node stats (from GetNodeStats rpc). If a node is dead, the node physical stats and node stats will be unreliable, only GcsNodeInfo is correct.

dashboard/modules/log/log_head.py

dashboard/modules/log/test_log.py

dashboard/modules/reporter/reporter_agent.py

dashboard/modules/reporter/reporter_head.py

dashboard/modules/reporter/test_reporter.py

dashboard/tests/test_dashboard.py

mfitton

Overall, this looks really good. I had a couple questions, and a couple other pieces of minor feedback, but hopefully it'll be quick to address. Thanks for all your hard work on this @fyrestone
Happy to approve once you address or reply to my comments.

dashboard/agent.py

dashboard/dashboard.py

dashboard/agent.py

mfitton · 2020-08-26T17:09:01Z

dashboard/head.py

                nodes = await self._get_nodes()
+
+                # Get correct node info by state,
+                #   1. The node is ALIVE if any ALIVE node info


@fyrestone I think it could be possible that two machines have the same hostname even if one of them isn't dead. @architkulkarni is currently working on a bug fix for this case in the old dashboard repo.

mfitton · 2020-08-26T17:11:00Z

dashboard/head.py

                nodes = await self._get_nodes()
+
+                # Get correct node info by state,
+                #   1. The node is ALIVE if any ALIVE node info


Do you think it would make sense to use something like an (IP, hostname) pair as a key to uniquely describe a host?

dashboard/modules/stats_collector/stats_collector_head.py

dashboard/tests/test_dashboard.py

dashboard/tests/conftest.py

dashboard/utils.py

Co-authored-by: Max Fitton <mfitton@berkeley.edu>

rkooo567

I think the only concern I have is how we will group nodes. I prefer to make it support cluster_utils because we will have more distributed data collection we'd like to test in the future (for example, we will remove all GetCoreWorkerStats, and replace this with our metrics infra. This is hard to be tested without cluster_utils).

But, I think this is not a hard blocker for this PR. (but I think this should be the next dashboard task).

rkooo567 · 2020-08-27T22:30:10Z

Maybe another solution is to just refactor our node.py to contain hostname. In that way, we can inject the fake hostname for clutser_utils.

…_modules2

fyrestone · 2020-08-28T09:00:34Z

I think the only concern I have is how we will group nodes. I prefer to make it support cluster_utils because we will have more distributed data collection we'd like to test in the future (for example, we will remove all GetCoreWorkerStats, and replace this with our metrics infra. This is hard to be tested without cluster_utils).

But, I think this is not a hard blocker for this PR. (but I think this should be the next dashboard task).

I agree with you. I can make a PR after we find a solution.

mfitton

Looks good. It seems to me like the test failures are unrelated, and I agree with Sang that the PR shouldn't be blocked by the node grouping problem.

rkooo567 · 2020-08-29T04:31:32Z

@fyrestone Please resolve the merge conflict and tag test_Ok label once build result looks fine. I will merge it after that :)

…_modules2

fyrestone · 2020-08-29T14:09:09Z

@fyrestone Please resolve the merge conflict and tag test_Ok label once build result looks fine. I will merge it after that :)

I have resolved the conflicts, and set the test_Ok label. I am waiting for the test results.

rkooo567 · 2020-08-30T06:09:14Z

Thanks again for this PR @fyrestone! It looks really clean, and I am so happy about testing status. Would you mind pushing a PR for the node grouping next? (if you are planning to work on that)!

fyrestone · 2020-08-31T06:51:43Z

Thanks again for this PR @fyrestone! It looks really clean, and I am so happy about testing status. Would you mind pushing a PR for the node grouping next? (if you are planning to work on that)!

Thanks. I will create the node grouping PR this week. The GET /nodes will return all nodes info using node id as the key, and GET /node/{hostname} will be replaced by GET /node/{nodeid}.

刘宝 added 7 commits August 25, 2020 10:26

Improve reporter module

6e1b8ce

Add test_node_physical_stats to test_reporter.py

ee3615d

Add test_class_method_route_table to test_dashboard.py

5da1591

Add stats_collector module for dashboard

169ec0c

Subscribe actor table data

bef8f0a

Add log module for dashboard

57ab2b4

Only enable test module in some test cases

c06541f

fyrestone added the dashboard Issues specific to the Ray Dashboard label Aug 25, 2020

fyrestone added this to the Turn on new dashboard milestone Aug 25, 2020

CI run all dashboard tests

093fb21

fyrestone changed the title ~~[WIP][Dashboard] Dashboard basic modules~~ [Dashboard] Dashboard basic modules Aug 25, 2020

fyrestone requested review from mfitton, raulchen and rkooo567 August 25, 2020 14:46

rkooo567 assigned rkooo567 and mfitton Aug 25, 2020

rkooo567 reviewed Aug 26, 2020

View reviewed changes

dashboard/tests/test_dashboard.py Outdated Show resolved Hide resolved

dashboard/tests/test_dashboard.py Outdated Show resolved Hide resolved

刘宝 added 7 commits August 26, 2020 15:09

Reduce test timeout to 10s

a3bf3e9

Use fstring

c1a886f

Remove unused code

e9226eb

Remove blank line

05f4325

Fix dashboard tests

99d1a4c

Fix asyncio.create_task not available in py36; Fix lint

0ac887c

Add format_web_url to ray.test_utils

ca5c93b

mfitton reviewed Aug 26, 2020

View reviewed changes

mfitton added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Aug 26, 2020

fyrestone and others added 3 commits August 27, 2020 11:39

Update dashboard/modules/reporter/reporter_head.py

e660ca7

Co-authored-by: Max Fitton <mfitton@berkeley.edu>

Add DictChangeItem type for Dict change

d9c331f

Refine logger.exception

bf87431

刘宝 added 2 commits August 27, 2020 15:42

Refine GET /api/launch_profiling

6304ed1

Remove disable_test_module fixture

52f9a7d

rkooo567 approved these changes Aug 27, 2020

View reviewed changes

Merge remote-tracking branch 'origin_ray/master' into dashboard_basic…

0ed836c

…_modules2

Fix test_basic may fail

6b3e788

mfitton self-requested a review August 28, 2020 18:15

mfitton approved these changes Aug 28, 2020 •

edited

Loading

View reviewed changes

mfitton approved these changes Aug 28, 2020

View reviewed changes

fyrestone added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. and removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. labels Aug 29, 2020

Merge remote-tracking branch 'origin_ray/master' into dashboard_basic…

172b6b4

…_modules2

rkooo567 merged commit e9b0463 into ray-project:master Aug 30, 2020

Conversation

fyrestone commented Aug 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Related issue number

Checks

Uh oh!

rkooo567 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fyrestone Aug 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rkooo567 Aug 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rkooo567 Aug 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rkooo567 Aug 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mfitton left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rkooo567 left a comment

Choose a reason for hiding this comment

Uh oh!

rkooo567 commented Aug 27, 2020

Uh oh!

fyrestone commented Aug 28, 2020

Uh oh!

Uh oh!

mfitton left a comment

Choose a reason for hiding this comment

Uh oh!

rkooo567 commented Aug 29, 2020

Uh oh!

fyrestone commented Aug 29, 2020

Uh oh!

rkooo567 commented Aug 30, 2020

Uh oh!

fyrestone commented Aug 31, 2020

fyrestone commented Aug 25, 2020 •

edited

Loading

fyrestone Aug 26, 2020 •

edited

Loading

rkooo567 Aug 27, 2020 •

edited

Loading

rkooo567 Aug 27, 2020 •

edited

Loading

rkooo567 Aug 27, 2020 •

edited

Loading