Observability Roadmap #30097

alanwguo · 2022-11-08T17:57:01Z

Observability Roadmap

A huge part of being successful at developing applications on top of Ray is being able to successfully debug and optimize those applications. In order to do that, one must be able to understand the behavior of their ray applications so they can address any bugs or issues that break or slow their application. The goal of our observability efforts is to provide all the information needed to effectively write, debug, optimize, and monitor ray applications.

Since the Ray runtime handles many of the low level system behavior of the ray application, we’re also in a unique position to provide data about ray application out of the box using our State API and Dashboard UI. Ultimately, we believe we can add a ton of value to the Ray experience by providing the most relevant data when you need it, great visualizations to understand that data, and the right set of tools to dig deeper into problems. We’re not alone in that thinking. In fact, one of the most popular talks at the Ray Summit 2022 was Ray Observability: Present and Future.

For the observability roadmap, the high level prioritization is as follows: we prioritize building out valuable content first (low hanging fruit), then making significant usability improvements with our UI, and finally, introducing advanced visualizations.

Help us shape the roadmap!

Before we begin, we highly encourage you to provide feedback for our roadmap! Please message us in the ray slack in the #dashboard channel or in the dashboard forum at https://discuss.ray.io/c/dashboard/9.

Delivered features

Features from Ray 2.2
Features from Ray 2.3

Ray 2.4

State API Beta

Since the alpha release of State API in 2.0, we have been collecting feedback from Ray developers. In the beta releases, we continue to improve the State API based on the user feedback by exposing the most useful states of Ray resources like actors, tasks and nodes. We are also stabilizing many of the CLI and outputs schema so that Ray developers could build their own observability tools on top of the State APIs without worrying about changing APIs.

Please take 5-8 mins to help us make better Ray State API by fulfilling this 📄survey! If you are interested in chatting more, there will also be a link at the end of the survey to choose a time slot to ☎️chat with one of us!

Beyond

Some of these things are early stages in the design process. Things may change before the final feature is released, but we want you all to know what’s coming so you can provide feedback earlier in the process.

Advanced task drill down visualizations

We are also planning to further improve the advanced task visualization.

The tracing view lets you view the hierarchy of dependencies for your tasks so you can drill down and understand why the application is behaving as it is. For example, you can see that some tasks are running serially because it depends on another task.

The DAG view displays the relationship between tasks/actors and the execution state over time.

Data visualizations

With distributed applications, the usage, storage, and transfer of data is often a critical part of the application. We believe visualizations that help you understand these things will enable users to debug memory crashes or optimize data transfer.

Advanced profiling

We are planning to make it easy to run other advanced profilers such as memory profiler, GPU profiler, or framework profilers (e.g., Pytorch) against Ray actors/tasks/workers.

dmatrix · 2022-11-08T20:57:54Z

This is fabulous!

tianlinzx · 2022-11-21T07:19:06Z

This is fabulous!

rkooo567 · 2022-12-15T02:02:34Z

We released Ray 2.2, and the following features have been delivered.

Ray 2.2

Metrics improvements

Metrics gives a glance views of the cluster which help users to detect problems effectively. Ray 2.1 introduces the default metrics graph integration to the dashboard. We’re adding more metrics and improvements to the Dashboard UI, including debugging breakdowns for object store memory allocations, actor state breakdowns, and heap memory usage by Ray component!

Profiling tool

Profiling Python programs is necessary to debug performance or memory leak issues. However, it has been difficult to profile Ray programs that have 100s of workers running concurrently.

In Ray 2.2, users can easily run py-spy against all running workers through Ray dashboard.

Task visualization improvements

Observability starts from understanding what’s going on from the program.

We are adding task-based breakdowns for your ray jobs. This view allows you to quickly view at a glance the tasks with the most errors or the ones that are hanging.

Dashboard stability improvements

We continue to make improvements to the stability and the scalability of the dashboard. We are going to guarantee the stable latency of Dashboard APIs at large scale clusters while minimizing the performance impact on workloads running in the cluster.

itamarst · 2023-01-09T19:21:56Z

I work on a profiler for Python data processing applications (https://sciagraph.com), including profiling in production. Currently only designed for jobs with subprocesses, aggregating from a cluster is not possible yet. Perhaps a reasonable integration would be per graph item? So would be happy to talk about that if it's interesting to you.

alanwguo · 2023-01-09T23:08:33Z

@itamarst, that sounds interesting. I'll send an email to you and we can continue the conversation there

TUB-hasib · 2023-01-22T13:03:14Z

Hi, where can I get information about the difference between ray serve version 2 and version 3? also when will we get the version 3 as a stable version

richardliaw · 2023-01-25T22:17:02Z

When you see v3.0.0, this means you are on the bleeding edge nightly wheels. 3.0.0 won't be released for a long time, but we will release 2.4 and 2.5 next, which are cut off of the 3.0 (master) branch -- you should instead use the stable latest version (2.x).

rkooo567 · 2023-02-27T06:03:20Z

We released Ray 2.3, and the following features have been delivered.

See the Ray 2.3 release blog for more information!

Ray 2.3 also includes the following features other than the below two big features.

actor detail page
Better task and placement group table
job profiling
ray status page from the job detail page.
new metrics (e.g., the memory/CPU usage per task/actor group).

Dashboard usability improvements

We’re also looking into a revamp of the dashboard UI to improve the information hierarchy and usability. We are taking a user-journey driven approach of organizing the dashboard so that developers and infra engineers alike can quickly get to the information they need. This means organizing the dashboard by top level concepts like jobs, cluster (nodes and autoscaler) and logs, better navigability so you can quickly click to go to the information you need, and more visualizations and content so you can dig into more details of your application.

Ray timeline and advanced progress bar.

We wish to build out more advanced visualizations of the tasks that ran in a ray application. In particular, we want these visualizations to be valuable after a ray job has finished (either successfully or errored).

The timeline view is a higher level view that lets you optimize or debug errors in your job. You can quickly see how long tasks are taking to run in your application and how well the workload is distributed across all the workers in your cluster.

We also want to add improvements to the progress bar. For example, by adding conceptual task groups so that progress can be viewed from high level steps. We also want to make it easier to determine if errors occurred within the task itself or because a downstream dependency errored.

scottsun94 · 2023-05-04T05:40:49Z

Here is the Public PRD for Ray Logging which will guide the future improvements to Ray Logging.

Please take a look and leave your feedback.

stale · 2023-10-15T08:14:36Z

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

If you'd like to keep the issue open, just leave any comment, and the stale label will be removed!
If you'd like to get more attention to the issue, please tag one of Ray's contributors.

You can always ask for help on our discussion forum or Ray's public slack channel.

ericl pinned this issue Nov 8, 2022

Rohan138 unpinned this issue Mar 9, 2023

Rohan138 pinned this issue Mar 10, 2023

richardliaw unpinned this issue Aug 2, 2023

stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Oct 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Observability Roadmap #30097

Observability Roadmap #30097

alanwguo commented Nov 8, 2022 •

edited by rkooo567

Loading

dmatrix commented Nov 8, 2022

tianlinzx commented Nov 21, 2022

rkooo567 commented Dec 15, 2022 •

edited

Loading

itamarst commented Jan 9, 2023

alanwguo commented Jan 9, 2023

TUB-hasib commented Jan 22, 2023

richardliaw commented Jan 25, 2023

rkooo567 commented Feb 27, 2023 •

edited

Loading

scottsun94 commented May 4, 2023

stale bot commented Oct 15, 2023

Observability Roadmap #30097

Observability Roadmap #30097

Comments

alanwguo commented Nov 8, 2022 • edited by rkooo567 Loading

Observability Roadmap

Help us shape the roadmap!

Delivered features

Ray 2.4

State API Beta

Beyond

Advanced task drill down visualizations

Data visualizations

Advanced profiling

dmatrix commented Nov 8, 2022

tianlinzx commented Nov 21, 2022

rkooo567 commented Dec 15, 2022 • edited Loading

Ray 2.2

Metrics improvements

Profiling tool

Task visualization improvements

Dashboard stability improvements

itamarst commented Jan 9, 2023

alanwguo commented Jan 9, 2023

TUB-hasib commented Jan 22, 2023

richardliaw commented Jan 25, 2023

rkooo567 commented Feb 27, 2023 • edited Loading

Dashboard usability improvements

Ray timeline and advanced progress bar.

scottsun94 commented May 4, 2023

stale bot commented Oct 15, 2023

alanwguo commented Nov 8, 2022 •

edited by rkooo567

Loading

rkooo567 commented Dec 15, 2022 •

edited

Loading

rkooo567 commented Feb 27, 2023 •

edited

Loading