[Dashboard] Dashboard basic modules#10303
Conversation
rkooo567
left a comment
There was a problem hiding this comment.
It LGTM in general (and nice addition of tests :)), but since the PR is really large, I will have another batch of review later. Next time when you submit a PR, please break them down (this could be broken down to 3 PRs imo).
| nodes = await self._get_nodes() | ||
|
|
||
| # Get correct node info by state, | ||
| # 1. The node is ALIVE if any ALIVE node info |
There was a problem hiding this comment.
Can you explain the reason in the comment too?
There was a problem hiding this comment.
GetAllNodeInfo returns all nodes info including dead nodes. We needs to know how to merge them to hostname. For example, there are two nodes info:
- {node id: 1, hostname: example}
- {node id: 2, hostname: example}
We choose which one for host example?
For one hostname, there will be a list of node info with only two cases:
- All nodes info of one hostname are DEAD
- Only one node info of one hostname is ALIVE
So, here is the rule,
- Choose a DEAD one if all nodes info of one hostname are DEAD.
- Choose the ALIVE one if there is ALIVE node in one hostname.
There was a problem hiding this comment.
It's better to add a timestamp to GcsNodeInfo.
There was a problem hiding this comment.
Hmm.. I see. So this is due to the case where there could be multiple raylets on a single host. I think this should be the case only when we use cluster_utils right? Or do you know any other case? I think it kind of doesn't make sense to have multiple ray start in a single node.
If this is useful only for the cluster utils, why don't we just group by hostname + node_id and display both in the dashboard? Is there any issue with this?
There was a problem hiding this comment.
If a node failover many times, there will be a lot node info belongs to one hostname.
There was a problem hiding this comment.
I think we can hide the second options' con easily by filtering DEAD node ids to user facing APIs.
There was a problem hiding this comment.
{
node_id: {
ip:
host:
state:
}
}
And in the frontend, we can just filter DEAD state node id if there are the same hostname + ip pairs.
There was a problem hiding this comment.
@rkooo567 I see. Use node id as the key of node info will returns full nodes info to front-end. And front-end do the filter job. My concern is that the data will be too large if we run many nodes in one cluster. For the DEAD node, the physics node info is useless. If we choose node id as the key, we need some strategies to reduce the data size:
- Only contains
GcsNodeInfodata for the DEAD nodes. - Each (IP + hostname) reserves a limited number of DEAD nodes. For example, the last 5 DEAD nodes.
There was a problem hiding this comment.
I see your concerns (since you guys are running long-running clusters, it can have lots of information). I like the second solution, but we can allow users to configure them. So, for regular usage, it is 0~2, and for cluster_utils cases, it can be like 10.
Only contains GcsNodeInfo data for the DEAD nodes.
For this, I am probably not familiar how the data is represented now (because I thought this was the current behavior). Can you explain a little bit more?
There was a problem hiding this comment.
Current node info is a mixture of node physical stats (from reporter agent) and node stats (from GetNodeStats rpc). If a node is dead, the node physical stats and node stats will be unreliable, only GcsNodeInfo is correct.
mfitton
left a comment
There was a problem hiding this comment.
Overall, this looks really good. I had a couple questions, and a couple other pieces of minor feedback, but hopefully it'll be quick to address. Thanks for all your hard work on this @fyrestone
Happy to approve once you address or reply to my comments.
| nodes = await self._get_nodes() | ||
|
|
||
| # Get correct node info by state, | ||
| # 1. The node is ALIVE if any ALIVE node info |
There was a problem hiding this comment.
@fyrestone I think it could be possible that two machines have the same hostname even if one of them isn't dead. @architkulkarni is currently working on a bug fix for this case in the old dashboard repo.
| nodes = await self._get_nodes() | ||
|
|
||
| # Get correct node info by state, | ||
| # 1. The node is ALIVE if any ALIVE node info |
There was a problem hiding this comment.
Do you think it would make sense to use something like an (IP, hostname) pair as a key to uniquely describe a host?
Co-authored-by: Max Fitton <mfitton@berkeley.edu>
rkooo567
left a comment
There was a problem hiding this comment.
I think the only concern I have is how we will group nodes. I prefer to make it support cluster_utils because we will have more distributed data collection we'd like to test in the future (for example, we will remove all GetCoreWorkerStats, and replace this with our metrics infra. This is hard to be tested without cluster_utils).
But, I think this is not a hard blocker for this PR. (but I think this should be the next dashboard task).
|
Maybe another solution is to just refactor our node.py to contain hostname. In that way, we can inject the fake hostname for clutser_utils. |
I agree with you. I can make a PR after we find a solution. |
mfitton
left a comment
There was a problem hiding this comment.
Looks good. It seems to me like the test failures are unrelated, and I agree with Sang that the PR shouldn't be blocked by the node grouping problem.
|
@fyrestone Please resolve the merge conflict and tag |
I have resolved the conflicts, and set the |
|
Thanks again for this PR @fyrestone! It looks really clean, and I am so happy about testing status. Would you mind pushing a PR for the node grouping next? (if you are planning to work on that)! |
Thanks. I will create the node grouping PR this week. The |
This PR includes 3 basic modules for new dashboard:
reporter.py, the profiling RESTful API has been simplified to oneGET /api/launch_profiling.GET /log_index,GET /logsfrom dashboard andGET /logsfrom dashboard agent.GET /nodes?view=[summary|hostNameList]andGET /nodes/{hostname}Related issue number
Checks
scripts/format.shto lint the changes in this PR.