[Poc] Use Background pool to get JobInfo from Ray Dashboard#4043
[Poc] Use Background pool to get JobInfo from Ray Dashboard#4043owenowenisme wants to merge 11 commits intoray-project:masterfrom
Conversation
372c33b to
9596618
Compare
Signed-off-by: You-Cheng Lin (Owen) <[email protected]>
38ba83a to
4069033
Compare
Signed-off-by: You-Cheng Lin (Owen) <[email protected]>
Future-Outlier
left a comment
There was a problem hiding this comment.
I am thinking that we also need a mechanism to delete the data in sync.Map when delete the CR right?
You're right! |
There was a problem hiding this comment.
When the RayCluster dashboard becomes unresponsive or slow, the RayJob controller's reconciliation process continues to invoke AsyncGetJobInfo for the same RayJob Custom Resource (CR), leading to multiple identical function instances accumulating in the taskQueue.
To be more specific, 1000 concurrency + 1000 RayJob CR all using 1 RayCluster (cluster selector), what will happen?
Can we avoid in taskQueue, there's no same RayJob CR?
In extreme case, this will eventually lead to OOM, right?
or maybe we should let every RayJob launch a go routine to async query the status?
Signed-off-by: You-Cheng Lin (Owen) <[email protected]>
| r.workerPool.channelContent.Set(jobId, struct{}{}) | ||
| r.workerPool.taskQueue <- func() { | ||
| jobInfo, err := r.GetJobInfo(ctx, jobId) | ||
| r.workerPool.channelContent.Remove(jobId) |
There was a problem hiding this comment.
we can use defer r.workerPool.channelContent.Remove(jobId)
Future-Outlier
left a comment
There was a problem hiding this comment.
TODO: We should put error in jobInfoMap.
| func (r *RayDashboardClient) AsyncGetJobInfo(ctx context.Context, jobId string) { | ||
| if _, ok := r.workerPool.channelContent.Get(jobId); ok { | ||
| return | ||
| } | ||
| r.workerPool.channelContent.Set(jobId, struct{}{}) | ||
| r.workerPool.taskQueue <- func() { | ||
| jobInfo, err := r.GetJobInfo(ctx, jobId) | ||
| r.workerPool.channelContent.Remove(jobId) | ||
| if err != nil { | ||
| fmt.Printf("AsyncGetJobInfo: error: %v\n", err) | ||
| return | ||
| } | ||
| if jobInfo != nil { | ||
| r.jobInfoMap.Set(jobId, jobInfo) | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
There's actually an edge case.
Let's assume
- RayJob finalizer deletes the
jobIDitem fromjobInfoMapandworkerPool.channelContent - the background go routine pool retrieves jobID from
r.workerPool.taskQueue - query job info using jobID from 2
- store result from 3 in jobInfoMap
In this case, we shouldn't store the result.
However, it's hard to handle this edge case, and the data we store will be near 100 bytes, is it ok not to handle this?
(Let's do the calculation, let's say we have 100,000 RayJob CR, the most stale cache we can produce will be 10MB (100 bytes *100000)
I think the solution to handle this edge case is using another backgroud go routine to list all rayjob CR, and check is there any additional key in jobInfoMap, and delete them
need your two's advice
There was a problem hiding this comment.
I think that is not hard to avoid. We just need to put a placeholder into the map and only update the map if the placeholder exists.
There was a problem hiding this comment.
And we also need to clear the jobInfoMap before each job retry and deletion.
|
|
||
| Scheme *runtime.Scheme | ||
| Recorder record.EventRecorder | ||
| JobInfoMap *cmap.ConcurrentMap[string, *utiltypes.RayJobInfo] |
There was a problem hiding this comment.
Could we try not injecting this into the RayJobReconciler? I think it should be an implementation detail of the dashboard client and should be better hidden by the it.
…outine-for-dsashboard-http # Conflicts: # ray-operator/controllers/ray/rayjob_controller.go
Signed-off-by: You-Cheng Lin (Owen) <[email protected]>
Signed-off-by: You-Cheng Lin (Owen) <[email protected]>
Signed-off-by: You-Cheng Lin (Owen) <[email protected]>
Signed-off-by: You-Cheng Lin (Owen) <[email protected]>
| ) | ||
|
|
||
| type WorkerPool struct { | ||
| channelContent cmap.ConcurrentMap[string, struct{}] |
There was a problem hiding this comment.
Move this to the dashboard client.
| } | ||
|
|
||
| // Start launches worker goroutines to consume from queue | ||
| func (wp *WorkerPool) Start() { |
| workers int | ||
| } | ||
|
|
||
| func NewWorkerPool(taskQueue chan func()) *WorkerPool { |
There was a problem hiding this comment.
| func NewWorkerPool(taskQueue chan func()) *WorkerPool { | |
| func NewWorkerPool(workers int) *WorkerPool { |
There was a problem hiding this comment.
Passing a task queue channel is weird. Specifying a worker count is more understandable. You can also make a buffered channel based on the worker count internally.
|
|
||
| type RayDashboardClientInterface interface { | ||
| InitClient(client *http.Client, dashboardURL string) | ||
| InitClient(client *http.Client, dashboardURL string, workerPool *WorkerPool, jobInfoMap *cmap.ConcurrentMap[string, *utiltypes.RayJobCache]) |
There was a problem hiding this comment.
| InitClient(client *http.Client, dashboardURL string, workerPool *WorkerPool, jobInfoMap *cmap.ConcurrentMap[string, *utiltypes.RayJobCache]) | |
| InitClient(client *http.Client, dashboardURL string) |
Hide the jobInfoMap and workerPool implementation details.
There was a problem hiding this comment.
Is it okay to keep it if controller not directly call InitClient?
Currently controller will create a new DashboardClient for every reconciliation if we need to put the creation of workerPool , cmap .. in InitClient we will recreate them every reconciliation which is not what we want.
Current GetRayDashboardClientFunc acts kind of like a factory.
There was a problem hiding this comment.
Couldn't they be created once in the GetRayDashboardClientFunc?
There was a problem hiding this comment.
Just like what you did with workerPool.
|
stale cache -> LRU cache |
Signed-off-by: You-Cheng Lin (Owen) <[email protected]>
Signed-off-by: You-Cheng Lin (Owen) <[email protected]>
Signed-off-by: You-Cheng Lin (Owen) <[email protected]>
Why are these changes needed?
Related issue number
Checks