Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scaler Roadmap #60

Open
sharpener6 opened this issue Feb 28, 2025 · 3 comments
Open

Scaler Roadmap #60

sharpener6 opened this issue Feb 28, 2025 · 3 comments
Assignees
Labels
documentation Improvements or additions to documentation enhancement New feature or request

Comments

@sharpener6
Copy link
Collaborator

sharpener6 commented Feb 28, 2025

This issue is to describe what's the future state of scaler should be, so it will requires a lot works, and this is to describe what's the future state will be, below is an very simplified version of example layout:

Image

@sharpener6 sharpener6 added documentation Improvements or additions to documentation enhancement New feature or request labels Feb 28, 2025
@sharpener6 sharpener6 self-assigned this Feb 28, 2025
@1597463007
Copy link
Contributor

1597463007 commented Feb 28, 2025

Butler: Worker Orchestration Service

Butler is an optional service that handles worker lifecycle management on behalf of Scaler. Webhooks are used for Scheduler-Butler communication and anyone can write a Butler service implementation on the platform of their choice without knowing the internal mechanics of Scaler.

Butler Webhook API

  • /request: Scheduler requests additional worker groups. It's up to the Butler to create however many worker groups based on the tag info it received from the scheduler. A worker group is a collection of workers normally tied to a common hardware resource such as a container or a physical node.

    • Payload: The scheduler provides a summary of the tasks currently queued in which the Butler can use to start up workers of a specific type

      {
        "tasks": [
          {
            "tags": [],
            "count": 4716
          },
          {
            "tags": ["gpu:nvidia"],
            "count": 15
          }
        ]
      }
    • Successful response: Butler responds with the worker groups created

      {
        "worker_groups": [
          {
            "worker_group_id": "10ca89ab-c432-4de1-a6b8-b23223af79eb",
            "worker_ids": [
              "W|Linux|15940|ab0ee3df-a8f2-4c8f-b209-f3c71750e4d1",
              "W|Linux|15946|f3f6c893-275b-40fe-977e-486f62eaf552",
              "W|Linux|15942|b89e2510-85d7-4d7b-8052-f44e1c100664"
            ]
          },
          {
            "worker_group_id": "a94fc473-4995-42e8-9706-a969d455805f",
            "worker_ids": [
              "W|Linux|16752|657eaef3-eff5-44fd-904c-3528a00f033e",
              "W|Linux|16757|a1dd651f-db65-43ab-a84b-612e55bedc99",
              "W|Linux|16759|665ce6a1-e076-4c15-8e66-debcbb757cf5"
            ]
          }
        ]
      }
    • On failure: Any non-2XX response

  • /release: Scheduler requests to release a specific worker group. The scheduler will ensure all workers within the worker group are no longer processing tasks before it requests to release the worker group.

    • Payload:

      {
        "worker_group_ids": ["10ca89ab-c432-4de1-a6b8-b23223af79eb"]
      }
    • Successful response:

      {}
    • On failure: Any non-2XX response

Worker Scaling Policy

Worker scaling policies are still in the research phase, but ideally there should be a one-size fits all policy with a single tunable parameter which we will call "responsiveness"

  • Responsiveness 0: Queue as many tasks as possible.
  • Responsiveness 1: Each task should be sent to an idle worker. If there is no idle worker, request new workers from the Butler.

Tag-Based Task Routing

Currently Scaler treats each worker the same for routing decisions despite possible differences in the underlying hardware or software. Tags ensure that tasks are routed to the workers that are most suited to handle the task. This is useful for workers with special hardware such as GPUs and associating a worker with a user/project.

Example Tags

  • Hardware Tags
    • gpu:nvidia
    • cpu:x86
    • memory:64gb
  • Software Tags
    • python:3.11
    • java:11
  • User/Project Tags
    • user:albert
    • project:test_service

The scheduler will route a task to a worker when the tags in a task is a subset of the tags associated with the worker.

Object Storage Separation

Currently the scheduler keeps all objects in memory which leads to high memory usage. It is better to decouple object storage from the scheduler, so the scheduler is solely responsible for task scheduling and not object data management. Decoupling the object storage into a separate service will make it possible to reuse existing services such as Redis and Memcached.

Object Storage API

  • put(key, value)
  • get(key)
  • delete(key)
  • list(prefix)

Storage Backends

  • In-Memory: Redis, Memcached
  • Persistent / Disk: S3/MinIO, Local Filesystem

Performance Considerations

  • To mitigate possible negative performance effects of using an universal protocol such as HTTP. An storage backend specific adapter will be used on the Scheduler and Worker sides to use the storage backend's native protocol
  • Support object compression
  • Implement a Plasma-like object storage system to keep data close to each worker

Alternative Scaler Implementations

Scaler's protocol is language agnostic and can support programming languages other than Python. Scaler scheduler performance can be improved by switching to a high-performance language such as C++/Rust and client/worker implementations can be created to support workflows in other languages.

Additional Compute Backends

We've added support for IBM Spectrum Symphony and we're also looking at supporting more compute backends:

  • Containers (Butler required): Docker, Podman, LXC
  • Cloud (Butler required): AWS EC2, GCP Compute Engine, Azure VMs
  • HPC Platforms: SLURM, PBS, LFS

Note that certain compute backends utilize a Butler to manage workers and support elastic computing.

@sharpener6
Copy link
Collaborator Author

@1597463007

  1. please make bold one as title, each point should be a section
  2. please elaborate the details of bulter, and future separation of scheduler and bulter, API between those, broadcasting, or bi-directional communication, different backends, they have different bulter implementations?
  3. which parts are the backends will be differs (which parts in the chart need be re-implemented to adapt the API), you might need further break down the simple chart I posted above, here is the drawio file, feel free to change it
  4. tag based, we should not support arithmetic resource calculation, those should be put into bulters
  5. for object storage, first we should define the object storage API, then implementation as you said

use more bullet points instead of long sentences, with explanation/example for each section

@sharpener6 sharpener6 changed the title Road Map Scaler Road Map Feb 28, 2025
@sharpener6 sharpener6 changed the title Scaler Road Map Scaler Roadmap Feb 28, 2025
@gxuu
Copy link

gxuu commented Mar 4, 2025

On "Tag-Based Task Routing".

It is aforementioned that

The scheduler will route a task to a worker when the tags in a task is a subset of the tags associated with the worker.

I propose to change the word "subset" to "mostly intersected set". Meaning we route it to the worker with maximum number of same tags.

This is because:

  • Some tasks don't care about specific hardware at all. They might leave all tags empty, which means that unless we have a default worker waiting, this task will never be route.
  • Some hardware info are hard to fetch without privileges. For example, GPU info used to be notoriously hard to fetch w/o root access. That means we might not be able to provide all the information that a worker asks.

Exact match or blur match?

  • Exact match means giving the client (tasks sender) the precise information about our hardware. If this client is compromised, then bad things happen.
  • Exact match requires the client to know more information than it needs. For example, a task that can run on Python 3.x should ask for "Python 3". They would fail to match tag "Python 3.11" provided by workers.

Tags priority?

  • Some tags have higher rank than others. If a task requires a NVDA card and expects to run with 64 GB RAM, then it is better to route the task to a worker with the card and 32 GB RAM instead of a worker with no card and 64 GB RAM.

Finite set or infinite set?

  • Limiting what information a tasks sender can inquire makes the implementation easier (workers don't have to dynamically fetch sysinfo).
  • Workers can send their hardware info to the scheduler when they are first connected with finite set. If the set is infinite then the scheduler would ask workers each time when a task contains new information that scheduler currently do not possess. Ergo, the performance with finite set impl will be better.

Since I have nothing better to do, I am happy to implement this. Please give me some advise on how to tackle these issues.

gxu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants