Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Observability Roadmap #30097

Open
alanwguo opened this issue Nov 8, 2022 · 10 comments
Open

Observability Roadmap #30097

alanwguo opened this issue Nov 8, 2022 · 10 comments
Labels
stale The issue is stale. It will be closed within 7 days unless there are further conversation

Comments

@alanwguo
Copy link
Contributor

alanwguo commented Nov 8, 2022

Observability Roadmap

A huge part of being successful at developing applications on top of Ray is being able to successfully debug and optimize those applications. In order to do that, one must be able to understand the behavior of their ray applications so they can address any bugs or issues that break or slow their application. The goal of our observability efforts is to provide all the information needed to effectively write, debug, optimize, and monitor ray applications.

Since the Ray runtime handles many of the low level system behavior of the ray application, we’re also in a unique position to provide data about ray application out of the box using our State API and Dashboard UI. Ultimately, we believe we can add a ton of value to the Ray experience by providing the most relevant data when you need it, great visualizations to understand that data, and the right set of tools to dig deeper into problems. We’re not alone in that thinking. In fact, one of the most popular talks at the Ray Summit 2022 was Ray Observability: Present and Future.

For the observability roadmap, the high level prioritization is as follows: we prioritize building out valuable content first (low hanging fruit), then making significant usability improvements with our UI, and finally, introducing advanced visualizations.

Help us shape the roadmap!

Before we begin, we highly encourage you to provide feedback for our roadmap! Please message us in the ray slack in the #dashboard channel or in the dashboard forum at https://discuss.ray.io/c/dashboard/9.

Delivered features

Features from Ray 2.2
Features from Ray 2.3

Ray 2.4

State API Beta

Since the alpha release of State API in 2.0, we have been collecting feedback from Ray developers. In the beta releases, we continue to improve the State API based on the user feedback by exposing the most useful states of Ray resources like actors, tasks and nodes. We are also stabilizing many of the CLI and outputs schema so that Ray developers could build their own observability tools on top of the State APIs without worrying about changing APIs.

Please take 5-8 mins to help us make better Ray State API by fulfilling this 📄survey! If you are interested in chatting more, there will also be a link at the end of the survey to choose a time slot to ☎️chat with one of us!

Beyond

Some of these things are early stages in the design process. Things may change before the final feature is released, but we want you all to know what’s coming so you can provide feedback earlier in the process.

Advanced task drill down visualizations

We are also planning to further improve the advanced task visualization.

The tracing view lets you view the hierarchy of dependencies for your tasks so you can drill down and understand why the application is behaving as it is. For example, you can see that some tasks are running serially because it depends on another task.

image

The DAG view displays the relationship between tasks/actors and the execution state over time.

image

Data visualizations

With distributed applications, the usage, storage, and transfer of data is often a critical part of the application. We believe visualizations that help you understand these things will enable users to debug memory crashes or optimize data transfer.

image
image

Advanced profiling

We are planning to make it easy to run other advanced profilers such as memory profiler, GPU profiler, or framework profilers (e.g., Pytorch) against Ray actors/tasks/workers.

@ericl ericl pinned this issue Nov 8, 2022
@dmatrix
Copy link
Contributor

dmatrix commented Nov 8, 2022

This is fabulous!

1 similar comment
@tianlinzx
Copy link

This is fabulous!

@rkooo567
Copy link
Contributor

rkooo567 commented Dec 15, 2022

We released Ray 2.2, and the following features have been delivered.

Ray 2.2

Metrics improvements

Metrics gives a glance views of the cluster which help users to detect problems effectively. Ray 2.1 introduces the default metrics graph integration to the dashboard. We’re adding more metrics and improvements to the Dashboard UI, including debugging breakdowns for object store memory allocations, actor state breakdowns, and heap memory usage by Ray component!

Profiling tool

Profiling Python programs is necessary to debug performance or memory leak issues. However, it has been difficult to profile Ray programs that have 100s of workers running concurrently.

In Ray 2.2, users can easily run py-spy against all running workers through Ray dashboard.

Screen Shot 2022-11-08 at 9 58 45 AM

image

Task visualization improvements

Observability starts from understanding what’s going on from the program.

We are adding task-based breakdowns for your ray jobs. This view allows you to quickly view at a glance the tasks with the most errors or the ones that are hanging.

image

Dashboard stability improvements

We continue to make improvements to the stability and the scalability of the dashboard. We are going to guarantee the stable latency of Dashboard APIs at large scale clusters while minimizing the performance impact on workloads running in the cluster.

@itamarst
Copy link

itamarst commented Jan 9, 2023

I work on a profiler for Python data processing applications (https://sciagraph.com), including profiling in production. Currently only designed for jobs with subprocesses, aggregating from a cluster is not possible yet. Perhaps a reasonable integration would be per graph item? So would be happy to talk about that if it's interesting to you.

@alanwguo
Copy link
Contributor Author

alanwguo commented Jan 9, 2023

@itamarst, that sounds interesting. I'll send an email to you and we can continue the conversation there

@TUB-hasib
Copy link

Hi, where can I get information about the difference between ray serve version 2 and version 3? also when will we get the version 3 as a stable version

@richardliaw
Copy link
Contributor

When you see v3.0.0, this means you are on the bleeding edge nightly wheels. 3.0.0 won't be released for a long time, but we will release 2.4 and 2.5 next, which are cut off of the 3.0 (master) branch -- you should instead use the stable latest version (2.x).

@rkooo567
Copy link
Contributor

rkooo567 commented Feb 27, 2023

We released Ray 2.3, and the following features have been delivered.

See the Ray 2.3 release blog for more information!

Ray 2.3 also includes the following features other than the below two big features.

  • actor detail page
  • Better task and placement group table
  • job profiling
  • ray status page from the job detail page.
  • new metrics (e.g., the memory/CPU usage per task/actor group).

Dashboard usability improvements

We’re also looking into a revamp of the dashboard UI to improve the information hierarchy and usability. We are taking a user-journey driven approach of organizing the dashboard so that developers and infra engineers alike can quickly get to the information they need. This means organizing the dashboard by top level concepts like jobs, cluster (nodes and autoscaler) and logs, better navigability so you can quickly click to go to the information you need, and more visualizations and content so you can dig into more details of your application.

image

Ray timeline and advanced progress bar.

We wish to build out more advanced visualizations of the tasks that ran in a ray application. In particular, we want these visualizations to be valuable after a ray job has finished (either successfully or errored).

image

The timeline view is a higher level view that lets you optimize or debug errors in your job. You can quickly see how long tasks are taking to run in your application and how well the workload is distributed across all the workers in your cluster.

We also want to add improvements to the progress bar. For example, by adding conceptual task groups so that progress can be viewed from high level steps. We also want to make it easier to determine if errors occurred within the task itself or because a downstream dependency errored.

image

@Rohan138 Rohan138 unpinned this issue Mar 9, 2023
@Rohan138 Rohan138 pinned this issue Mar 10, 2023
@scottsun94
Copy link
Contributor

Here is the Public PRD for Ray Logging which will guide the future improvements to Ray Logging.

Please take a look and leave your feedback.

@richardliaw richardliaw unpinned this issue Aug 2, 2023
@stale
Copy link

stale bot commented Oct 15, 2023

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

  • If you'd like to keep the issue open, just leave any comment, and the stale label will be removed!
  • If you'd like to get more attention to the issue, please tag one of Ray's contributors.

You can always ask for help on our discussion forum or Ray's public slack channel.

@stale stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Oct 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale The issue is stale. It will be closed within 7 days unless there are further conversation
Projects
None yet
Development

No branches or pull requests

8 participants