[history server] Web Server + Event Processor#4329
[history server] Web Server + Event Processor#4329rueian merged 47 commits intoray-project:masterfrom
Conversation
Co-authored-by: chiayi chiayiliang327@gmail.com Co-authored-by: KunWuLuan kunwuluan@gmail.com
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com> Co-authored-by: KunWuLuan <kunwuluan@gmail.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Future-Outlier
left a comment
There was a problem hiding this comment.
cc @chiayi @KunWuLuan
to help review, thank you!
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
| const ( | ||
| NIL TaskStatus = "NIL" | ||
| PENDING_ARGS_AVAIL TaskStatus = "PENDING_ARGS_AVAIL" | ||
| PENDING_NODE_ASSIGNMENT TaskStatus = "PENDING_NODE_ASSIGNMENT" | ||
| PENDING_OBJ_STORE_MEM_AVAIL TaskStatus = "PENDING_OBJ_STORE_MEM_AVAIL" | ||
| PENDING_ARGS_FETCH TaskStatus = "PENDING_ARGS_FETCH" | ||
| SUBMITTED_TO_WORKER TaskStatus = "SUBMITTED_TO_WORKER" | ||
| PENDING_ACTOR_TASK_ARGS_FETCH TaskStatus = "PENDING_ACTOR_TASK_ARGS_FETCH" | ||
| PENDING_ACTOR_TASK_ORDERING_OR_CONCURRENCY TaskStatus = "PENDING_ACTOR_TASK_ORDERING_OR_CONCURRENCY" | ||
| RUNNING TaskStatus = "RUNNING" | ||
| RUNNING_IN_RAY_GET TaskStatus = "RUNNING_IN_RAY_GET" | ||
| RUNNING_IN_RAY_WAIT TaskStatus = "RUNNING_IN_RAY_WAIT" | ||
| FINISHED TaskStatus = "FINISHED" | ||
| FAILED TaskStatus = "FAILED" | ||
| ) |
There was a problem hiding this comment.
Signed-off-by: Future-Outlier <eric901201@gmail.com>
|
LGTM! Just a question you mentioned:
How does the event processor handle data consistency when multiple replicas are deployed? Currently each pod runs its own historyserver with in-memory state. Won't this cause inconsistent responses depending on which pod handles the request? |
|
todo:
|
yes it will, and this will be solved in the beta version. |
Signed-off-by: Future-Outlier <eric901201@gmail.com>
I see. Thanks for the tips! |
…oxying Signed-off-by: Future-Outlier <eric901201@gmail.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
|
cc @chiayi @KunWuLuan to do a final pass, thank you! |
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
|
cursor review |
Signed-off-by: Future-Outlier <eric901201@gmail.com>
|
cursor review |
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
|
LGTM /approve |
| Mu sync.RWMutex | ||
| } | ||
|
|
||
| func (c *ClusterTaskMap) RLock() { |
There was a problem hiding this comment.
Why we need these funcs?
There was a problem hiding this comment.
Go maps are not thread-safe. Concurrent read/write causes undefined behaviors, so we use locks.
https://go.dev/blog/maps#concurrency
┌─────────────────────┐ ┌─────────────────────┐
│ Event Processor │ │ HTTP Handler │
│ (goroutine 1..N) │ │ (goroutine 1..M) │
└──────────┬──────────┘ └──────────┬──────────┘
│ WRITE │ READ
▼ ▼
┌──────────────────────────────────────────┐
│ ClusterTaskMap (RWMutex) │
│ ┌────────────────────────────────────┐ │
│ │ TaskMap per cluster (Mutex) │ │
│ │ ┌──────────────────────────────┐ │ │
│ │ │ map[taskId] → []Task │ │ │
│ │ └──────────────────────────────┘ │ │
│ └────────────────────────────────────┘ │
└──────────────────────────────────────────┘
Co-authored-by: @chiayi chiayiliang327@gmail.com
Co-authored-by: @KunWuLuan kunwuluan@gmail.com
Why are these changes needed?
This web server serves the history server's frontend and fetches data from the event server (processor).
As a follow-up, we should enable autoscaling for the web server using Kubernetes HPA.
https://kubernetes.io/docs/concepts/workloads/autoscaling/horizontal-pod-autoscale/
Note: I combined code from this branch #4187 and this branch #4253, and then fixed a lot of bugs
Architecture
How to test and develop in your local env
response
Related issue number
#3966
#4374
HistoryServer Alpha Milestone Gap Analysis
Summary
API Endpoints (Terminated Clusters)
/clusters/nodes/nodes/{node_id}/events/api/cluster_status/api/grafana_health/api/prometheus_health/api/data/datasets/{job_id}/api/serve/applications//api/v0/placement_groups//api/v0/tasks/api/v0/tasks/summarize/api/v0/logs/api/v0/logs/file/logical/actors/logical/actors/{actor_id}/api/jobs/api/jobs/{job_id}Remaining Work (Priority)
/api/jobs,/api/jobs/{job_id}/eventsendpoint/nodes/{node_id}/api/v0/logs/file/api/cluster_status/api/grafana_health,/api/prometheus_health/api/serve/applications/,/api/v0/placement_groups/others:
Overall Progress: ~75%
Checks