-
Notifications
You must be signed in to change notification settings - Fork 7.2k
[Dashboard] Dashboard basic modules #10303
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
6e1b8ce
ee3615d
5da1591
169ec0c
bef8f0a
57ab2b4
c06541f
093fb21
a3bf3e9
c1a886f
e9226eb
05f4325
99d1a4c
0ac887c
ca5c93b
e660ca7
d9c331f
bf87431
6304ed1
52f9a7d
0ed836c
6b3e788
172b6b4
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,12 +1,22 @@ | ||
| # py_test_module_list creates a py_test target for each | ||
| # py_test_module_list creates a py_test target for each | ||
| # Python file in `files` | ||
| def py_test_module_list(files, size, deps, extra_srcs, **kwargs): | ||
| for file in files: | ||
| # remove .py | ||
| name = file[:-3] | ||
| native.py_test( | ||
| name=name, | ||
| size=size, | ||
| srcs=extra_srcs+[file], | ||
| **kwargs | ||
| ) | ||
| for file in files: | ||
| # remove .py | ||
| name = file[:-3] | ||
| native.py_test( | ||
| name = name, | ||
| size = size, | ||
| srcs = extra_srcs + [file], | ||
| **kwargs | ||
| ) | ||
|
|
||
| def py_test_run_all_subdirectory(include, exclude, extra_srcs, **kwargs): | ||
| for file in native.glob(include = include, exclude = exclude): | ||
| print(file) | ||
| basename = file.rpartition("/")[-1] | ||
| native.py_test( | ||
| name = basename[:-3], | ||
| srcs = extra_srcs + [file], | ||
| **kwargs | ||
| ) |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,13 +1,19 @@ | ||
| load("//bazel:python.bzl", "py_test_run_all_subdirectory") | ||
|
|
||
| # This is a dummy test dependency that causes the above tests to be | ||
| # re-run if any of these files changes. | ||
| py_library( | ||
| name = "dashboard_lib", | ||
| srcs = glob(["**/*.py"],exclude=["tests/*"]), | ||
| srcs = glob( | ||
| ["**/*.py"], | ||
| exclude = ["tests/*"], | ||
| ), | ||
| ) | ||
|
|
||
| py_test( | ||
| name = "test_dashboard", | ||
| py_test_run_all_subdirectory( | ||
| size = "small", | ||
| srcs = glob(["tests/*.py"]), | ||
| deps = [":dashboard_lib"] | ||
| include = ["**/test*.py"], | ||
| exclude = ["modules/test/**"], | ||
| extra_srcs = [], | ||
| tags = ["exclusive"], | ||
| ) |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -28,7 +28,8 @@ def gcs_node_info_to_dict(message): | |
|
|
||
|
|
||
| class DashboardHead: | ||
| def __init__(self, http_host, http_port, redis_address, redis_password): | ||
| def __init__(self, http_host, http_port, redis_address, redis_password, | ||
| log_dir): | ||
| # NodeInfoGcsService | ||
| self._gcs_node_info_stub = None | ||
| self._gcs_rpc_error_counter = 0 | ||
|
|
@@ -37,6 +38,7 @@ def __init__(self, http_host, http_port, redis_address, redis_password): | |
| self.http_port = http_port | ||
| self.redis_address = dashboard_utils.address_tuple(redis_address) | ||
| self.redis_password = redis_password | ||
| self.log_dir = log_dir | ||
| self.aioredis_client = None | ||
| self.aiogrpc_gcs_channel = None | ||
| self.http_session = None | ||
|
|
@@ -74,6 +76,22 @@ async def _update_nodes(self): | |
| while True: | ||
| try: | ||
| nodes = await self._get_nodes() | ||
|
|
||
| # Get correct node info by state, | ||
| # 1. The node is ALIVE if any ALIVE node info | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you explain the reason in the comment too?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
We choose which one for host For one hostname, there will be a list of node info with only two cases:
So, here is the rule,
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's better to add a timestamp to
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm.. I see. So this is due to the case where there could be multiple raylets on a single host. I think this should be the case only when we use If this is useful only for the cluster utils, why don't we just group by hostname + node_id and display both in the dashboard? Is there any issue with this?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If a node failover many times, there will be a lot node info belongs to one hostname.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we can hide the second options' con easily by filtering DEAD node ids to user facing APIs.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. { And in the frontend, we can just filter DEAD state node id if there are the same hostname + ip pairs.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @rkooo567 I see. Use node id as the key of node info will returns full nodes info to front-end. And front-end do the filter job. My concern is that the data will be too large if we run many nodes in one cluster. For the DEAD node, the physics node info is useless. If we choose
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see your concerns (since you guys are running long-running clusters, it can have lots of information). I like the second solution, but we can allow users to configure them. So, for regular usage, it is 0~2, and for cluster_utils cases, it can be like 10.
For this, I am probably not familiar how the data is represented now (because I thought this was the current behavior). Can you explain a little bit more?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Current node info is a mixture of |
||
| # of the hostname exists. | ||
| # 2. The node is DEAD if all node info of the | ||
| # hostname are DEAD. | ||
| hostname_to_node_info = {} | ||
| for node in nodes: | ||
| hostname = node["nodeManagerAddress"] | ||
| assert node["state"] in ["ALIVE", "DEAD"] | ||
| choose = hostname_to_node_info.get(hostname) | ||
| if choose is not None and choose["state"] == "ALIVE": | ||
| continue | ||
| hostname_to_node_info[hostname] = node | ||
| nodes = hostname_to_node_info.values() | ||
|
|
||
| self._gcs_rpc_error_counter = 0 | ||
| node_ips = [node["nodeManagerAddress"] for node in nodes] | ||
| node_hostnames = [ | ||
|
|
@@ -83,13 +101,11 @@ async def _update_nodes(self): | |
| agents = dict(DataSource.agents) | ||
| for node in nodes: | ||
| node_ip = node["nodeManagerAddress"] | ||
| if node_ip not in agents: | ||
| key = "{}{}".format( | ||
| dashboard_consts.DASHBOARD_AGENT_PORT_PREFIX, | ||
| node_ip) | ||
| agent_port = await self.aioredis_client.get(key) | ||
| if agent_port: | ||
| agents[node_ip] = json.loads(agent_port) | ||
| key = "{}{}".format( | ||
| dashboard_consts.DASHBOARD_AGENT_PORT_PREFIX, node_ip) | ||
| agent_port = await self.aioredis_client.get(key) | ||
| if agent_port: | ||
| agents[node_ip] = json.loads(agent_port) | ||
| for ip in agents.keys() - set(node_ips): | ||
| agents.pop(ip, None) | ||
|
|
||
|
|
@@ -99,8 +115,8 @@ async def _update_nodes(self): | |
| dict(zip(node_hostnames, node_ips))) | ||
| DataSource.ip_to_hostname.reset( | ||
| dict(zip(node_ips, node_hostnames))) | ||
| except aiogrpc.AioRpcError as ex: | ||
| logger.exception(ex) | ||
| except aiogrpc.AioRpcError: | ||
| logger.exception("Got AioRpcError when updating nodes.") | ||
| self._gcs_rpc_error_counter += 1 | ||
| if self._gcs_rpc_error_counter > \ | ||
| dashboard_consts.MAX_COUNT_OF_GCS_RPC_ERROR: | ||
|
|
@@ -109,8 +125,8 @@ async def _update_nodes(self): | |
| self._gcs_rpc_error_counter, | ||
| dashboard_consts.MAX_COUNT_OF_GCS_RPC_ERROR) | ||
| sys.exit(-1) | ||
| except Exception as ex: | ||
| logger.exception(ex) | ||
| except Exception: | ||
| logger.exception("Error updating nodes.") | ||
| finally: | ||
| await asyncio.sleep( | ||
| dashboard_consts.UPDATE_NODES_INTERVAL_SECONDS) | ||
|
|
@@ -126,7 +142,7 @@ def _load_modules(self): | |
| c = cls(self) | ||
| dashboard_utils.ClassMethodRouteTable.bind(c) | ||
| modules.append(c) | ||
| logger.info("Loaded {} modules.".format(len(modules))) | ||
| logger.info("Loaded %d modules.", len(modules)) | ||
| return modules | ||
|
|
||
| async def run(self): | ||
|
|
@@ -183,8 +199,8 @@ async def _async_notify(): | |
| co = await dashboard_utils.NotifyQueue.get() | ||
| try: | ||
| await co | ||
| except Exception as e: | ||
| logger.exception(e) | ||
| except Exception: | ||
| logger.exception(f"Error notifying coroutine {co}") | ||
|
|
||
| async def _purge_data(): | ||
| """Purge data in datacenter.""" | ||
|
|
@@ -193,8 +209,8 @@ async def _purge_data(): | |
| dashboard_consts.PURGE_DATA_INTERVAL_SECONDS) | ||
| try: | ||
| await DataOrganizer.purge() | ||
| except Exception as e: | ||
| logger.exception(e) | ||
| except Exception: | ||
| logger.exception("Error purging data.") | ||
|
|
||
| modules = self._load_modules() | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,19 @@ | ||
| import logging | ||
|
|
||
| import mimetypes | ||
|
|
||
| import ray.new_dashboard.utils as dashboard_utils | ||
|
|
||
| logger = logging.getLogger(__name__) | ||
| routes = dashboard_utils.ClassMethodRouteTable | ||
|
|
||
|
|
||
| class LogAgent(dashboard_utils.DashboardAgentModule): | ||
| def __init__(self, dashboard_agent): | ||
| super().__init__(dashboard_agent) | ||
| mimetypes.add_type("text/plain", ".err") | ||
| mimetypes.add_type("text/plain", ".out") | ||
| routes.static("/logs", self._dashboard_agent.log_dir, show_index=True) | ||
|
|
||
| async def run(self, server): | ||
| pass |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,69 @@ | ||
| import logging | ||
|
|
||
| import mimetypes | ||
|
|
||
| import aiohttp.web | ||
| import ray.new_dashboard.utils as dashboard_utils | ||
| from ray.new_dashboard.datacenter import DataSource, GlobalSignals | ||
|
|
||
| logger = logging.getLogger(__name__) | ||
| routes = dashboard_utils.ClassMethodRouteTable | ||
|
|
||
|
|
||
| class LogHead(dashboard_utils.DashboardHeadModule): | ||
| LOG_URL_TEMPLATE = "http://{ip}:{port}/logs" | ||
|
|
||
| def __init__(self, dashboard_head): | ||
| super().__init__(dashboard_head) | ||
| mimetypes.add_type("text/plain", ".err") | ||
| mimetypes.add_type("text/plain", ".out") | ||
| routes.static("/logs", self._dashboard_head.log_dir, show_index=True) | ||
| GlobalSignals.node_info_fetched.append( | ||
| self.insert_log_url_to_node_info) | ||
|
|
||
| async def insert_log_url_to_node_info(self, node_info): | ||
| ip = node_info.get("ip") | ||
| if ip is None: | ||
| return | ||
| agent_port = DataSource.agents.get(ip) | ||
| if agent_port is None: | ||
| return | ||
| agent_http_port, _ = agent_port | ||
| log_url = self.LOG_URL_TEMPLATE.format(ip=ip, port=agent_http_port) | ||
fyrestone marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| node_info["logUrl"] = log_url | ||
|
|
||
| @routes.get("/log_index") | ||
| async def get_log_index(self, req) -> aiohttp.web.Response: | ||
| url_list = [] | ||
| for ip, ports in DataSource.agents.items(): | ||
| url_list.append( | ||
| self.LOG_URL_TEMPLATE.format(ip=ip, port=str(ports[0]))) | ||
| if self._dashboard_head.ip not in DataSource.agents: | ||
| url_list.append( | ||
| self.LOG_URL_TEMPLATE.format( | ||
| ip=self._dashboard_head.ip, | ||
| port=self._dashboard_head.http_port)) | ||
| return aiohttp.web.Response( | ||
| text=self._directory_as_html(url_list), content_type="text/html") | ||
|
|
||
| @staticmethod | ||
| def _directory_as_html(url_list) -> str: | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this rendering logic should be defined on the front-end. I really prefer not shipping HTML to the front-end if it isn't needed, and it is a lot cleaner to write HTML in React and on the front-end than to write it in Python without a library. It seems to me like if someone wanted to change this, it would be difficult.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This function is to show the node log entries by html from log head, the html structure is same with show_index of static files from log agent. The log structure is:
@mxz96102 will provides the front-end of log module with advanced features. |
||
| # returns directory's index as html | ||
|
|
||
| index_of = "Index of logs" | ||
| h1 = f"<h1>{index_of}</h1>" | ||
|
|
||
| index_list = [] | ||
| for url in sorted(url_list): | ||
| index_list.append(f'<li><a href="{url}">{url}</a></li>') | ||
| index_list = "\n".join(index_list) | ||
| ul = f"<ul>\n{index_list}\n</ul>" | ||
| body = f"<body>\n{h1}\n{ul}\n</body>" | ||
|
|
||
| head_str = f"<head>\n<title>{index_of}</title>\n</head>" | ||
| html = f"<html>\n{head_str}\n{body}\n</html>" | ||
|
|
||
| return html | ||
|
|
||
| async def run(self, server): | ||
| pass | ||
Uh oh!
There was an error while loading. Please reload this page.