[Poc] Use Background pool to get JobInfo from Ray Dashboard by owenowenisme · Pull Request #4043 · ray-project/kuberay

owenowenisme · 2025-09-03T12:34:07Z

Why are these changes needed?

Related issue number

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>

Future-Outlier

do we prevent go routine leak?

Future-Outlier

I am thinking that we also need a mechanism to delete the data in sync.Map when delete the CR right?

owenowenisme · 2025-09-18T07:44:54Z

I am thinking that we also need a mechanism to delete the data in sync.Map when delete the CR right?

You're right!

Future-Outlier

When the RayCluster dashboard becomes unresponsive or slow, the RayJob controller's reconciliation process continues to invoke AsyncGetJobInfo for the same RayJob Custom Resource (CR), leading to multiple identical function instances accumulating in the taskQueue.

To be more specific, 1000 concurrency + 1000 RayJob CR all using 1 RayCluster (cluster selector), what will happen?

Can we avoid in taskQueue, there's no same RayJob CR?
In extreme case, this will eventually lead to OOM, right?

or maybe we should let every RayJob launch a go routine to async query the status?

Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>

Future-Outlier · 2025-10-03T03:24:35Z

ray-operator/controllers/ray/utils/dashboardclient/dashboard_httpclient.go

+	r.workerPool.channelContent.Set(jobId, struct{}{})
+	r.workerPool.taskQueue <- func() {
 		jobInfo, err := r.GetJobInfo(ctx, jobId)
+		r.workerPool.channelContent.Remove(jobId)


we can use defer r.workerPool.channelContent.Remove(jobId)

Future-Outlier

just come to my mind an interesting edge case.

if a rayjob's key is in the channel, but the CR is deleted, what should happend?
we shouldn't store the information to the map in this scenario, but how?

Future-Outlier

TODO: We should put error in jobInfoMap.

Future-Outlier · 2025-10-09T14:51:31Z

ray-operator/controllers/ray/utils/dashboardclient/dashboard_httpclient.go

+func (r *RayDashboardClient) AsyncGetJobInfo(ctx context.Context, jobId string) {
+	if _, ok := r.workerPool.channelContent.Get(jobId); ok {
+		return
+	}
+	r.workerPool.channelContent.Set(jobId, struct{}{})
+	r.workerPool.taskQueue <- func() {
+		jobInfo, err := r.GetJobInfo(ctx, jobId)
+		r.workerPool.channelContent.Remove(jobId)
+		if err != nil {
+			fmt.Printf("AsyncGetJobInfo: error: %v\n", err)
+			return
+		}
+		if jobInfo != nil {
+			r.jobInfoMap.Set(jobId, jobInfo)
+		}
+	}
+}


There's actually an edge case.
Let's assume

RayJob finalizer deletes the jobID item from jobInfoMap and workerPool.channelContent

the background go routine pool retrieves jobID from r.workerPool.taskQueue

query job info using jobID from 2

store result from 3 in jobInfoMap

In this case, we shouldn't store the result.
However, it's hard to handle this edge case, and the data we store will be near 100 bytes, is it ok not to handle this?
(Let's do the calculation, let's say we have 100,000 RayJob CR, the most stale cache we can produce will be 10MB (100 bytes *100000)

I think the solution to handle this edge case is using another backgroud go routine to list all rayjob CR, and check is there any additional key in jobInfoMap, and delete them

cc @rueian @andrewsykim

need your two's advice

I think that is not hard to avoid. We just need to put a placeholder into the map and only update the map if the placeholder exists.

And we also need to clear the jobInfoMap before each job retry and deletion.

rueian · 2025-10-09T17:01:27Z

ray-operator/controllers/ray/rayjob_controller.go

-
+	Scheme              *runtime.Scheme
+	Recorder            record.EventRecorder
+	JobInfoMap          *cmap.ConcurrentMap[string, *utiltypes.RayJobInfo]


Could we try not injecting this into the RayJobReconciler? I think it should be an implementation detail of the dashboard client and should be better hidden by the it.

…outine-for-dsashboard-http # Conflicts: # ray-operator/controllers/ray/rayjob_controller.go

Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>

rueian · 2025-10-10T21:07:42Z

ray-operator/controllers/ray/utils/dashboardclient/worker_pool.go

+)
+
+type WorkerPool struct {
+	channelContent cmap.ConcurrentMap[string, struct{}]


Move this to the dashboard client.

rueian · 2025-10-10T21:08:37Z

ray-operator/controllers/ray/utils/dashboardclient/worker_pool.go

+}
+
+// Start launches worker goroutines to consume from queue
+func (wp *WorkerPool) Start() {


Make this private.

rueian · 2025-10-10T21:12:56Z

ray-operator/controllers/ray/utils/dashboardclient/worker_pool.go

+	workers        int
+}
+
+func NewWorkerPool(taskQueue chan func()) *WorkerPool {


Suggested change

func NewWorkerPool(taskQueue chan func()) *WorkerPool {

func NewWorkerPool(workers int) *WorkerPool {

Passing a task queue channel is weird. Specifying a worker count is more understandable. You can also make a buffered channel based on the worker count internally.

rueian · 2025-10-10T21:13:43Z

ray-operator/controllers/ray/utils/dashboardclient/dashboard_httpclient.go


 type RayDashboardClientInterface interface {
-	InitClient(client *http.Client, dashboardURL string)
+	InitClient(client *http.Client, dashboardURL string, workerPool *WorkerPool, jobInfoMap *cmap.ConcurrentMap[string, *utiltypes.RayJobCache])


Suggested change

InitClient(client *http.Client, dashboardURL string, workerPool *WorkerPool, jobInfoMap *cmap.ConcurrentMap[string, *utiltypes.RayJobCache])

InitClient(client *http.Client, dashboardURL string)

Hide the jobInfoMap and workerPool implementation details.

Is it okay to keep it if controller not directly call InitClient?
Currently controller will create a new DashboardClient for every reconciliation if we need to put the creation of workerPool , cmap .. in InitClient we will recreate them every reconciliation which is not what we want.

Current GetRayDashboardClientFunc acts kind of like a factory.

Couldn't they be created once in the GetRayDashboardClientFunc?

Just like what you did with workerPool.

Future-Outlier · 2025-10-11T01:20:19Z

stale cache -> LRU cache

Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>

owenowenisme force-pushed the poc/rayjob-bg-goroutine-for-dsashboard-http branch from 372c33b to 9596618 Compare September 3, 2025 12:47

create simple worker pool and add AsyncGetJobInfo

4069033

Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>

owenowenisme force-pushed the poc/rayjob-bg-goroutine-for-dsashboard-http branch from 38ba83a to 4069033 Compare September 17, 2025 17:54

remove log

45635be

Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>

Future-Outlier reviewed Sep 18, 2025

View reviewed changes

use another map tp track channel content

72142c0

Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>

Future-Outlier reviewed Oct 3, 2025

View reviewed changes

Future-Outlier reviewed Oct 7, 2025

View reviewed changes

Future-Outlier requested changes Oct 9, 2025

View reviewed changes

Future-Outlier reviewed Oct 9, 2025

View reviewed changes

rueian reviewed Oct 9, 2025

View reviewed changes

owenowenisme added 5 commits October 10, 2025 14:22

Merge remote-tracking branch 'upstream/master' into poc/rayjob-bg-gor…

fe90e70

…outine-for-dsashboard-http # Conflicts: # ray-operator/controllers/ray/rayjob_controller.go

update go mod

727d2c2

Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>

remove JobInfoMap from controller

d9a1a88

Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>

defer the channelContent removal

0f1f766

Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>

add error in cache

da0e45e

Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>

rueian reviewed Oct 10, 2025

View reviewed changes

owenowenisme added 3 commits October 13, 2025 13:09

move workerPoolChannelContent into dashboard client

afcb9e1

Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>

make Start worker pool private

82b3acf

Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>

update

69be5fb

Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>

fscnick mentioned this pull request Oct 30, 2025

background goroutine get job info #4160

Merged

4 tasks

	func NewWorkerPool(taskQueue chan func()) *WorkerPool {
	func NewWorkerPool(workers int) *WorkerPool {

	InitClient(client http.Client, dashboardURL string, workerPool WorkerPool, jobInfoMap cmap.ConcurrentMap[string, utiltypes.RayJobCache])
	InitClient(client *http.Client, dashboardURL string)

Conversation

owenowenisme commented Sep 3, 2025

Why are these changes needed?

Related issue number

Checks

Uh oh!

Future-Outlier left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Future-Outlier left a comment

Choose a reason for hiding this comment

Uh oh!

owenowenisme commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Future-Outlier left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Future-Outlier left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Future-Outlier left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rueian Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rueian Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Future-Outlier commented Oct 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Future-Outlier left a comment •

edited

Loading

owenowenisme commented Sep 18, 2025 •

edited

Loading

Future-Outlier left a comment •

edited

Loading

Future-Outlier left a comment •

edited

Loading

rueian Oct 9, 2025 •

edited

Loading

rueian Oct 10, 2025 •

edited

Loading