Scaler Roadmap #60

sharpener6 · 2025-02-28T00:55:22Z

This issue is to describe what's the future state of scaler should be, so it will requires a lot works, and this is to describe what's the future state will be, below is an very simplified version of example layout:

1597463007 · 2025-02-28T15:45:20Z

Butler: Worker Orchestration Service

Butler is an optional service that handles worker lifecycle management on behalf of Scaler. Webhooks are used for Scheduler-Butler communication and anyone can write a Butler service implementation on the platform of their choice without knowing the internal mechanics of Scaler.

Butler Webhook API

/request: Scheduler requests additional worker groups. It's up to the Butler to create however many worker groups based on the tag info it received from the scheduler. A worker group is a collection of workers normally tied to a common hardware resource such as a container or a physical node.

Payload: The scheduler provides a summary of the tasks currently queued in which the Butler can use to start up workers of a specific type

{
  "tasks": [
    {
      "tags": [],
      "count": 4716
    },
    {
      "tags": ["gpu:nvidia"],
      "count": 15
    }
  ]
}

Successful response: Butler responds with the worker groups created

{
  "worker_groups": [
    {
      "worker_group_id": "10ca89ab-c432-4de1-a6b8-b23223af79eb",
      "worker_ids": [
        "W|Linux|15940|ab0ee3df-a8f2-4c8f-b209-f3c71750e4d1",
        "W|Linux|15946|f3f6c893-275b-40fe-977e-486f62eaf552",
        "W|Linux|15942|b89e2510-85d7-4d7b-8052-f44e1c100664"
      ]
    },
    {
      "worker_group_id": "a94fc473-4995-42e8-9706-a969d455805f",
      "worker_ids": [
        "W|Linux|16752|657eaef3-eff5-44fd-904c-3528a00f033e",
        "W|Linux|16757|a1dd651f-db65-43ab-a84b-612e55bedc99",
        "W|Linux|16759|665ce6a1-e076-4c15-8e66-debcbb757cf5"
      ]
    }
  ]
}

On failure: Any non-2XX response

/release: Scheduler requests to release a specific worker group. The scheduler will ensure all workers within the worker group are no longer processing tasks before it requests to release the worker group.
- Payload:
```
{
  "worker_group_ids": ["10ca89ab-c432-4de1-a6b8-b23223af79eb"]
}
```
- Successful response:
```
{}
```
- On failure: Any non-2XX response

Worker Scaling Policy

Worker scaling policies are still in the research phase, but ideally there should be a one-size fits all policy with a single tunable parameter which we will call "responsiveness"

Responsiveness 0: Queue as many tasks as possible.
Responsiveness 1: Each task should be sent to an idle worker. If there is no idle worker, request new workers from the Butler.

Tag-Based Task Routing

Currently Scaler treats each worker the same for routing decisions despite possible differences in the underlying hardware or software. Tags ensure that tasks are routed to the workers that are most suited to handle the task. This is useful for workers with special hardware such as GPUs and associating a worker with a user/project.

Example Tags

Hardware Tags
- gpu:nvidia
- cpu:x86
- memory:64gb
Software Tags
- python:3.11
- java:11
User/Project Tags
- user:albert
- project:test_service

The scheduler will route a task to a worker when the tags in a task is a subset of the tags associated with the worker.

Object Storage Separation

Currently the scheduler keeps all objects in memory which leads to high memory usage. It is better to decouple object storage from the scheduler, so the scheduler is solely responsible for task scheduling and not object data management. Decoupling the object storage into a separate service will make it possible to reuse existing services such as Redis and Memcached.

Object Storage API

put(key, value)
get(key)
delete(key)
list(prefix)

Storage Backends

In-Memory: Redis, Memcached
Persistent / Disk: S3/MinIO, Local Filesystem

Performance Considerations

To mitigate possible negative performance effects of using an universal protocol such as HTTP. An storage backend specific adapter will be used on the Scheduler and Worker sides to use the storage backend's native protocol
Support object compression
Implement a Plasma-like object storage system to keep data close to each worker

Alternative Scaler Implementations

Scaler's protocol is language agnostic and can support programming languages other than Python. Scaler scheduler performance can be improved by switching to a high-performance language such as C++/Rust and client/worker implementations can be created to support workflows in other languages.

Additional Compute Backends

We've added support for IBM Spectrum Symphony and we're also looking at supporting more compute backends:

Containers (Butler required): Docker, Podman, LXC
Cloud (Butler required): AWS EC2, GCP Compute Engine, Azure VMs
HPC Platforms: SLURM, PBS, LFS

Note that certain compute backends utilize a Butler to manage workers and support elastic computing.

sharpener6 · 2025-02-28T18:01:07Z

@1597463007

please make bold one as title, each point should be a section
please elaborate the details of bulter, and future separation of scheduler and bulter, API between those, broadcasting, or bi-directional communication, different backends, they have different bulter implementations?
which parts are the backends will be differs (which parts in the chart need be re-implemented to adapt the API), you might need further break down the simple chart I posted above, here is the drawio file, feel free to change it
tag based, we should not support arithmetic resource calculation, those should be put into bulters
for object storage, first we should define the object storage API, then implementation as you said

use more bullet points instead of long sentences, with explanation/example for each section

gxuu · 2025-03-04T05:10:48Z

On "Tag-Based Task Routing".

It is aforementioned that

The scheduler will route a task to a worker when the tags in a task is a subset of the tags associated with the worker.

I propose to change the word "subset" to "mostly intersected set". Meaning we route it to the worker with maximum number of same tags.

This is because:

Some tasks don't care about specific hardware at all. They might leave all tags empty, which means that unless we have a default worker waiting, this task will never be route.
Some hardware info are hard to fetch without privileges. For example, GPU info used to be notoriously hard to fetch w/o root access. That means we might not be able to provide all the information that a worker asks.

Exact match or blur match?

Exact match means giving the client (tasks sender) the precise information about our hardware. If this client is compromised, then bad things happen.
Exact match requires the client to know more information than it needs. For example, a task that can run on Python 3.x should ask for "Python 3". They would fail to match tag "Python 3.11" provided by workers.

Tags priority?

Some tags have higher rank than others. If a task requires a NVDA card and expects to run with 64 GB RAM, then it is better to route the task to a worker with the card and 32 GB RAM instead of a worker with no card and 64 GB RAM.

Finite set or infinite set?

Limiting what information a tasks sender can inquire makes the implementation easier (workers don't have to dynamically fetch sysinfo).
Workers can send their hardware info to the scheduler when they are first connected with finite set. If the set is infinite then the scheduler would ask workers each time when a task contains new information that scheduler currently do not possess. Ergo, the performance with finite set impl will be better.

Since I have nothing better to do, I am happy to implement this. Please give me some advise on how to tackle these issues.

gxu

sharpener6 added documentation Improvements or additions to documentation enhancement New feature or request labels Feb 28, 2025

sharpener6 self-assigned this Feb 28, 2025

sharpener6 assigned 1597463007 Feb 28, 2025

sharpener6 changed the title ~~Road Map~~ Scaler Road Map Feb 28, 2025

sharpener6 changed the title ~~Scaler Road Map~~ Scaler Roadmap Feb 28, 2025

bansalr mentioned this issue Mar 4, 2025

openGRIS - Software Project Contribution and Onboarding finos/community#326

Open

97 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scaler Roadmap #60

Scaler Roadmap #60

sharpener6 commented Feb 28, 2025 •

edited

Loading

1597463007 commented Feb 28, 2025 •

edited

Loading

sharpener6 commented Feb 28, 2025

gxuu commented Mar 4, 2025

Scaler Roadmap #60

Scaler Roadmap #60

Comments

sharpener6 commented Feb 28, 2025 • edited Loading

1597463007 commented Feb 28, 2025 • edited Loading

Butler: Worker Orchestration Service

Butler Webhook API

Worker Scaling Policy

Tag-Based Task Routing

Example Tags

Object Storage Separation

Object Storage API

Storage Backends

Performance Considerations

Alternative Scaler Implementations

Additional Compute Backends

sharpener6 commented Feb 28, 2025

gxuu commented Mar 4, 2025

On "Tag-Based Task Routing".

Exact match or blur match?

Tags priority?

Finite set or infinite set?

sharpener6 commented Feb 28, 2025 •

edited

Loading

1597463007 commented Feb 28, 2025 •

edited

Loading