Skip to content

[CI] Add dispatch job to leverage dynamic devices#251

Merged
wangxiyuan merged 1 commit intovllm-project:mainfrom
Yikun:lock
Mar 7, 2025
Merged

[CI] Add dispatch job to leverage dynamic devices#251
wangxiyuan merged 1 commit intovllm-project:mainfrom
Yikun:lock

Conversation

@Yikun
Copy link
Copy Markdown
Member

@Yikun Yikun commented Mar 6, 2025

What this PR does / why we need it?

Add dispatch job to leverage jobs to dynamic devices include 2 stage as below:
image

The dispatch job will spend extra about 10s * parallel number + 30s time to wait other job launch container and release lock.

  • Stage 1: Acquire lock
    add a dispatch job, this job use lockfile to acquire locks and then get device number dynamically
  • Stage 2.1: Launch container with dynamic device
    pass the device number via output and start the container job with dynamic device
  • Stage 2.2: Release lock
    once the job started, release the lock.

In the backend, we use multiple path to setup multiple self host runners as load balancer:

$ pwd
/home/action
$ ll | grep actions
drwx------   6 action action 4096 Mar  7 08:55 actions-runner-01
drwx------   6 action action 4096 Mar  7 08:55 actions-runner-02
drwx------   6 action action 4096 Mar  7 08:55 actions-runner-03
drwx------   6 action action 4096 Mar  7 08:56 actions-runner-04
drwx------   4 action action 4096 Jan 24 22:08 actions-runner-05
drwx------   4 action action 4096 Jan 24 22:08 actions-runner-06
adduser -G docker action
su action
pip3 install docker prettytable
sudo yum install procmail

Does this PR introduce any user-facing change?

NO

How was this patch tested?

  • CI passed
  • E2E test manully, triggered 3 jobs in parallel:

@Yikun Yikun changed the title Add dispatch job to leverage jobs to dynamic devices [CI] Add dispatch job to leverage jobs to dynamic devices Mar 6, 2025
@Yikun Yikun force-pushed the lock branch 5 times, most recently from 7a1440c to 47cad07 Compare March 6, 2025 12:31
@Yikun Yikun force-pushed the lock branch 4 times, most recently from 3e5af83 to fd4197e Compare March 7, 2025 00:55
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
@Yikun Yikun marked this pull request as ready for review March 7, 2025 01:42
@Yikun Yikun changed the title [CI] Add dispatch job to leverage jobs to dynamic devices [CI] Add dispatch job to leverage dynamic devices Mar 7, 2025
@wangxiyuan wangxiyuan merged commit 35cb7b5 into vllm-project:main Mar 7, 2025
Yikun added a commit to Yikun/vllm-ascend that referenced this pull request Mar 7, 2025
### What this PR does / why we need it?
Add dispatch job to leverage jobs to dynamic devices include 2 stage as
below:

The dispatch job will spend extra about `10s * parallel number + 30s`
time to wait other job launch container and release lock.

- **Stage 1: Acquire lock**
add a dispatch job, this job use lockfile to acquire locks and then get
device number dynamically
- **Stage 2.1: Launch container with dynamic device**
pass the device number via output and start the container job with
dynamic device
- **Stage 2.2: Release lock**
once the job started, release the lock.

In the backend, we use multiple path to setup multiple self host runners
as load balancer:
```
$ pwd
/home/action
$ ll | grep actions
drwx------   6 action action 4096 Mar  7 08:55 actions-runner-01
drwx------   6 action action 4096 Mar  7 08:55 actions-runner-02
drwx------   6 action action 4096 Mar  7 08:55 actions-runner-03
drwx------   6 action action 4096 Mar  7 08:56 actions-runner-04
drwx------   4 action action 4096 Jan 24 22:08 actions-runner-05
drwx------   4 action action 4096 Jan 24 22:08 actions-runner-06
```

```
adduser -G docker action
su action
pip3 install docker prettytable
sudo yum install procmail
```

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?
- CI passed
- E2E test manully, triggered 3 jobs in parallel:
- [1st
job](https://github.com/vllm-project/vllm-ascend/actions/runs/13711345757/job/38348309297)
dispatch to /dev/davinci2.
- [2nd
job](https://github.com/vllm-project/vllm-ascend/actions/runs/13711348739/job/38348316250)
dispatch to /dev/davinci3
- [3rd
job](https://github.com/vllm-project/vllm-ascend/actions/runs/13711351493/job/38348324551)
dispatch to /dev/davinci4

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
wangxiyuan pushed a commit that referenced this pull request Mar 7, 2025
### What this PR does / why we need it?
Backport: #251

Add dispatch job to leverage jobs to dynamic devices include 2 stage as
below:

The dispatch job will spend extra about `10s * parallel number + 30s`
time to wait other job launch container and release lock.

- **Stage 1: Acquire lock** add a dispatch job, this job use lockfile to
acquire locks and then get device number dynamically
- **Stage 2.1: Launch container with dynamic device** pass the device
number via output and start the container job with dynamic device
- **Stage 2.2: Release lock** once the job started, release the lock.

In the backend, we use multiple path to setup multiple self host runners
as load balancer:
```
$ pwd
/home/action
$ ll | grep actions
drwx------   6 action action 4096 Mar  7 08:55 actions-runner-01
drwx------   6 action action 4096 Mar  7 08:55 actions-runner-02
drwx------   6 action action 4096 Mar  7 08:55 actions-runner-03
drwx------   6 action action 4096 Mar  7 08:56 actions-runner-04
drwx------   4 action action 4096 Jan 24 22:08 actions-runner-05
drwx------   4 action action 4096 Jan 24 22:08 actions-runner-06
```

```
adduser -G docker action
su action
pip3 install docker prettytable
sudo yum install procmail
```

### Does this PR introduce _any_ user-facing change? NO

### How was this patch tested?
- CI passed
- E2E test manully, triggered 3 jobs in parallel:
- [1st
job](https://github.com/vllm-project/vllm-ascend/actions/runs/13711345757/job/38348309297)
dispatch to /dev/davinci2.
- [2nd
job](https://github.com/vllm-project/vllm-ascend/actions/runs/13711348739/job/38348316250)
dispatch to /dev/davinci3
- [3rd
job](https://github.com/vllm-project/vllm-ascend/actions/runs/13711351493/job/38348324551)
dispatch to /dev/davinci4

<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.

- Please clarify why the changes are needed. For instance, the use case
and bug description.

- Fixes #
-->

### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->

### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
ttanzhiqiang pushed a commit to ttanzhiqiang/vllm-ascend that referenced this pull request Apr 27, 2025
### What this PR does / why we need it?
Add dispatch job to leverage jobs to dynamic devices include 2 stage as
below:

The dispatch job will spend extra about `10s * parallel number + 30s`
time to wait other job launch container and release lock.

- **Stage 1: Acquire lock**
add a dispatch job, this job use lockfile to acquire locks and then get
device number dynamically
- **Stage 2.1: Launch container with dynamic device**
pass the device number via output and start the container job with
dynamic device
- **Stage 2.2: Release lock**
once the job started, release the lock.

In the backend, we use multiple path to setup multiple self host runners
as load balancer:
```
$ pwd
/home/action
$ ll | grep actions
drwx------   6 action action 4096 Mar  7 08:55 actions-runner-01
drwx------   6 action action 4096 Mar  7 08:55 actions-runner-02
drwx------   6 action action 4096 Mar  7 08:55 actions-runner-03
drwx------   6 action action 4096 Mar  7 08:56 actions-runner-04
drwx------   4 action action 4096 Jan 24 22:08 actions-runner-05
drwx------   4 action action 4096 Jan 24 22:08 actions-runner-06
```

```
adduser -G docker action
su action
pip3 install docker prettytable
sudo yum install procmail
```

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?
- CI passed
- E2E test manully, triggered 3 jobs in parallel:
- [1st
job](https://github.com/vllm-project/vllm-ascend/actions/runs/13711345757/job/38348309297)
dispatch to /dev/davinci2.
- [2nd
job](https://github.com/vllm-project/vllm-ascend/actions/runs/13711348739/job/38348316250)
dispatch to /dev/davinci3
- [3rd
job](https://github.com/vllm-project/vllm-ascend/actions/runs/13711351493/job/38348324551)
dispatch to /dev/davinci4

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
@Yikun
Copy link
Copy Markdown
Member Author

Yikun commented Jul 7, 2025

See also bug fix: 663dca7 to remove lock @pkking

  post_cleanup:
    name: vLLM Ascend test (post-cleanup)
    needs: [test]
    runs-on: ascend-ci-arm64  # actionlint-ignore: runner-label
    if: always()
    steps:
      - name: Remove dispatch lock if exists
        run: |
          if [ -f "/tmp/dispatch.lock" ]; then
            rm -f "/tmp/dispatch.lock"
          fi

yangzhe-2026 pushed a commit to yangzhe-2026/vllm-ascend that referenced this pull request May 6, 2026
### What this PR does / why we need it?
Add dispatch job to leverage jobs to dynamic devices include 2 stage as
below:

The dispatch job will spend extra about `10s * parallel number + 30s`
time to wait other job launch container and release lock.

- **Stage 1: Acquire lock**
add a dispatch job, this job use lockfile to acquire locks and then get
device number dynamically
- **Stage 2.1: Launch container with dynamic device**
pass the device number via output and start the container job with
dynamic device
- **Stage 2.2: Release lock**
once the job started, release the lock.

In the backend, we use multiple path to setup multiple self host runners
as load balancer:
```
$ pwd
/home/action
$ ll | grep actions
drwx------   6 action action 4096 Mar  7 08:55 actions-runner-01
drwx------   6 action action 4096 Mar  7 08:55 actions-runner-02
drwx------   6 action action 4096 Mar  7 08:55 actions-runner-03
drwx------   6 action action 4096 Mar  7 08:56 actions-runner-04
drwx------   4 action action 4096 Jan 24 22:08 actions-runner-05
drwx------   4 action action 4096 Jan 24 22:08 actions-runner-06
```

```
adduser -G docker action
su action
pip3 install docker prettytable
sudo yum install procmail
```

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?
- CI passed
- E2E test manully, triggered 3 jobs in parallel:
- [1st
job](https://github.com/vllm-project/vllm-ascend/actions/runs/13711345757/job/38348309297)
dispatch to /dev/davinci2.
- [2nd
job](https://github.com/vllm-project/vllm-ascend/actions/runs/13711348739/job/38348316250)
dispatch to /dev/davinci3
- [3rd
job](https://github.com/vllm-project/vllm-ascend/actions/runs/13711351493/job/38348324551)
dispatch to /dev/davinci4

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants