Skip to content

[v0.7.3][CI] Add dispatch job to leverage dynamic devices (#251)#270

Merged
wangxiyuan merged 1 commit into
vllm-project:v0.7.3-devfrom
Yikun:251-dispatch
Mar 7, 2025
Merged

[v0.7.3][CI] Add dispatch job to leverage dynamic devices (#251)#270
wangxiyuan merged 1 commit into
vllm-project:v0.7.3-devfrom
Yikun:251-dispatch

Conversation

@Yikun
Copy link
Copy Markdown
Member

@Yikun Yikun commented Mar 7, 2025

What this PR does / why we need it?

Backport: #251

Add dispatch job to leverage jobs to dynamic devices include 2 stage as below:

The dispatch job will spend extra about 10s * parallel number + 30s time to wait other job launch container and release lock.

  • Stage 1: Acquire lock add a dispatch job, this job use lockfile to acquire locks and then get device number dynamically
  • Stage 2.1: Launch container with dynamic device pass the device number via output and start the container job with dynamic device
  • Stage 2.2: Release lock once the job started, release the lock.

In the backend, we use multiple path to setup multiple self host runners as load balancer:

$ pwd
/home/action
$ ll | grep actions
drwx------   6 action action 4096 Mar  7 08:55 actions-runner-01
drwx------   6 action action 4096 Mar  7 08:55 actions-runner-02
drwx------   6 action action 4096 Mar  7 08:55 actions-runner-03
drwx------   6 action action 4096 Mar  7 08:56 actions-runner-04
drwx------   4 action action 4096 Jan 24 22:08 actions-runner-05
drwx------   4 action action 4096 Jan 24 22:08 actions-runner-06
adduser -G docker action
su action
pip3 install docker prettytable
sudo yum install procmail

Does this PR introduce any user-facing change? NO

How was this patch tested?

  • CI passed
  • E2E test manully, triggered 3 jobs in parallel:
  • 1st job dispatch to /dev/davinci2.
  • 2nd job dispatch to /dev/davinci3
  • 3rd job dispatch to /dev/davinci4

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

### What this PR does / why we need it?
Add dispatch job to leverage jobs to dynamic devices include 2 stage as
below:

The dispatch job will spend extra about `10s * parallel number + 30s`
time to wait other job launch container and release lock.

- **Stage 1: Acquire lock**
add a dispatch job, this job use lockfile to acquire locks and then get
device number dynamically
- **Stage 2.1: Launch container with dynamic device**
pass the device number via output and start the container job with
dynamic device
- **Stage 2.2: Release lock**
once the job started, release the lock.

In the backend, we use multiple path to setup multiple self host runners
as load balancer:
```
$ pwd
/home/action
$ ll | grep actions
drwx------   6 action action 4096 Mar  7 08:55 actions-runner-01
drwx------   6 action action 4096 Mar  7 08:55 actions-runner-02
drwx------   6 action action 4096 Mar  7 08:55 actions-runner-03
drwx------   6 action action 4096 Mar  7 08:56 actions-runner-04
drwx------   4 action action 4096 Jan 24 22:08 actions-runner-05
drwx------   4 action action 4096 Jan 24 22:08 actions-runner-06
```

```
adduser -G docker action
su action
pip3 install docker prettytable
sudo yum install procmail
```

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?
- CI passed
- E2E test manully, triggered 3 jobs in parallel:
- [1st
job](https://github.com/vllm-project/vllm-ascend/actions/runs/13711345757/job/38348309297)
dispatch to /dev/davinci2.
- [2nd
job](https://github.com/vllm-project/vllm-ascend/actions/runs/13711348739/job/38348316250)
dispatch to /dev/davinci3
- [3rd
job](https://github.com/vllm-project/vllm-ascend/actions/runs/13711351493/job/38348324551)
dispatch to /dev/davinci4

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
@Yikun Yikun marked this pull request as draft March 7, 2025 12:03
@Yikun Yikun marked this pull request as ready for review March 7, 2025 12:28
@wangxiyuan wangxiyuan merged commit 806235f into vllm-project:v0.7.3-dev Mar 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants