-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Modal CI #7289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modal CI #7289
Changes from 67 commits
0417dbe
8b49d0d
68a44dd
35547e9
0c267e4
ca48e9d
8f63a4c
b7e51d8
dba33e9
f688973
5a6d2c8
985c5f3
37a7ed1
c0c406b
8c76099
665770d
60c0e82
79a7a27
8839780
986008a
4e85caf
6dce798
b3f661e
d3c7c80
6925ed2
9756a38
fedfc31
77d069a
07094ca
7f27b79
c48255f
5975b11
26bf584
061ae64
ab44341
455ace7
2b94ca3
bfb10b8
feb5a67
45f6024
fd93348
1b42646
18e8624
adf268d
e38e74f
77295af
3dfc13b
997d195
afe54a7
cd4ea94
fe7df64
1cfbae1
76d2c5b
a66a193
d50b171
356d581
c42404e
1afe1b7
0537279
4eeff3c
f9e74bb
cfe6bef
9063108
2a563f6
2313757
e5b7e6b
801400f
e359fe0
684f44f
f30c188
7a55264
2042598
3751d52
0d1d4e3
4c01641
c923db3
b3a9d21
95a31d7
2bb78c5
09b91ec
6a15c62
d8e4077
9c51c2b
8ee00bf
2c5ba11
91a4da0
63fe8b8
7de52da
360664e
a36e3fa
7372b71
ba462d4
c12fe92
4b96926
6bf4d18
a4927c7
56c0aaf
134fb7d
b180173
63da02c
1bee006
a0b84a1
c2c6e58
7c9462a
034e61a
8a72a98
4b2959d
936bbc8
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,103 @@ | ||
| name: modal-accelerate | ||
|
|
||
| # This CI is running on modal.com's GPUs. | ||
| # | ||
| # It's set up here on github actions and then the cloned repo is sent to modal and everything | ||
| # happens on their hw - see deepspeed/modal_ci/accelerate.py for where the actual vm is loaded, updated and the tests are | ||
| # run. | ||
| # | ||
| # Both files are annotated to what's important and how one might change or update things if needed. | ||
| # | ||
| # Note that since this is a Required job we can't use `on.push.path` file filter - we are using | ||
| # collect-tests job to do the filtering for us so that the job can be skipped and satisfy the | ||
| # Required status for PRs to pass. | ||
| # | ||
|
|
||
|
|
||
| on: | ||
| workflow_dispatch: | ||
| push: | ||
| branches: | ||
| - master | ||
|
|
||
| pull_request: | ||
| paths-ignore: | ||
| - 'docs/**' | ||
| - 'blogs/**' | ||
| - 'deepspeed/inference/v2/**' | ||
| - 'tests/unit/inference/v2/**' | ||
| types: [draft, opened, ready_for_review, synchronize] | ||
| branches: | ||
| - master | ||
|
|
||
| concurrency: | ||
| group: ${{ github.workflow }}-${{ github.ref || github.run_id }} | ||
| cancel-in-progress: true | ||
|
|
||
| jobs: | ||
| collect-tests: | ||
| name: Collect tests to run | ||
| runs-on: ubuntu-latest | ||
| permissions: | ||
| contents: read | ||
| pull-requests: read | ||
| outputs: | ||
| deepspeed: ${{ steps.filter.outputs.deepspeed }} | ||
|
|
||
| steps: | ||
| - name: Checkout repository | ||
| uses: actions/checkout@v4 | ||
| with: | ||
| lfs: true | ||
|
|
||
| - name: Filter changed files | ||
| uses: dorny/paths-filter@v2 | ||
| id: filter | ||
| with: | ||
| token: ${{ secrets.GITHUB_TOKEN }} | ||
| filters: | | ||
| deepspeed: | ||
| - 'deepspeed/**/*.py' | ||
| - '.github/workflows/deepspeed.yml' | ||
| deploy: | ||
| name: DeepSpeedAI CI | ||
| runs-on: ubuntu-latest | ||
| needs: collect-tests | ||
| env: | ||
| # these are created at https://modal.com/settings/deepspeedai/tokens | ||
| # they are then added to the repo's secrets at https://github.com/deepspeedai/deepspeed/settings/secrets/actions | ||
| MODAL_TOKEN_ID: ${{ secrets.MODAL_TOKEN_ID }} | ||
| MODAL_TOKEN_SECRET: ${{ secrets.MODAL_TOKEN_SECRET }} | ||
| # this one comes from https://huggingface.co/settings/profile of the bot user | ||
| # and it too is then updated at https://github.com/deepspeedai/deepspeed/settings/secrets/actions | ||
| HF_TOKEN: ${{ secrets.HF_TOKEN }} | ||
|
|
||
| if: needs.collect-tests.outputs.deepspeed == 'true' | ||
| steps: | ||
| - name: Checkout Repository | ||
| uses: actions/checkout@v4 | ||
| with: | ||
| lfs: true | ||
|
|
||
| - name: Install Python | ||
| uses: actions/setup-python@v5 | ||
| with: | ||
| python-version: "3.10" | ||
| cache: 'pip' # caching pip dependencies | ||
|
|
||
| - name: Install build dependencies | ||
| # TODO: Need to determine whether these installs are redundant by subesquent modal.Image call. | ||
| run: | | ||
| pip install uv==0.4.0 # much faster than pip | ||
| uv pip install --system modal | ||
| uv pip install --system .[dev,autotuning] | ||
| ds_report | ||
| # time uv pip compile arctic_training/setup.py --extra all -o arctic_training/ci-requirements.txt | ||
| # # add vllm manually to deps since it fails pip compile w/o CUDA_HOME being set in github actions | ||
| # # if changing the version here also change it in setup.py to match | ||
| # echo 'vllm==0.6.2' >> arctic_training/ci-requirements.txt | ||
| - name: Run tests | ||
| run: | | ||
| modal run -m deepspeed.modal_ci.accelerate | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,103 @@ | ||
| name: modal-torch-latest | ||
|
|
||
| # This CI is running on modal.com's GPUs. | ||
| # | ||
| # It's set up here on github actions and then the cloned repo is sent to modal and everything | ||
| # happens on their hw - see deepspeed/modal_ci/torch_latest.py for where the actual vm is loaded, updated and the tests are | ||
| # run. | ||
| # | ||
| # Both files are annotated to what's important and how one might change or update things if needed. | ||
| # | ||
| # Note that since this is a Required job we can't use `on.push.path` file filter - we are using | ||
| # collect-tests job to do the filtering for us so that the job can be skipped and satisfy the | ||
| # Required status for PRs to pass. | ||
| # | ||
|
|
||
|
|
||
| on: | ||
| workflow_dispatch: | ||
| push: | ||
| branches: | ||
| - master | ||
|
|
||
| pull_request: | ||
| paths-ignore: | ||
| - 'docs/**' | ||
| - 'blogs/**' | ||
| - 'deepspeed/inference/v2/**' | ||
| - 'tests/unit/inference/v2/**' | ||
| types: [draft, opened, ready_for_review, synchronize] | ||
| branches: | ||
| - master | ||
|
|
||
| concurrency: | ||
| group: ${{ github.workflow }}-${{ github.ref || github.run_id }} | ||
| cancel-in-progress: true | ||
|
|
||
| jobs: | ||
| collect-tests: | ||
| name: Collect tests to run | ||
| runs-on: ubuntu-latest | ||
| permissions: | ||
| contents: read | ||
| pull-requests: read | ||
| outputs: | ||
| deepspeed: ${{ steps.filter.outputs.deepspeed }} | ||
|
|
||
| steps: | ||
| - name: Checkout repository | ||
| uses: actions/checkout@v4 | ||
| with: | ||
| lfs: true | ||
|
|
||
| - name: Filter changed files | ||
| uses: dorny/paths-filter@v2 | ||
| id: filter | ||
| with: | ||
| token: ${{ secrets.GITHUB_TOKEN }} | ||
| filters: | | ||
| deepspeed: | ||
| - 'deepspeed/**/*.py' | ||
| - '.github/workflows/deepspeed.yml' | ||
|
|
||
| deploy: | ||
| name: DeepSpeedAI CI | ||
| runs-on: ubuntu-latest | ||
| needs: collect-tests | ||
| env: | ||
| # these are created at https://modal.com/settings/deepspeedai/tokens | ||
| # they are then added to the repo's secrets at https://github.com/deepspeedai/deepspeed/settings/secrets/actions | ||
| MODAL_TOKEN_ID: ${{ secrets.MODAL_TOKEN_ID }} | ||
| MODAL_TOKEN_SECRET: ${{ secrets.MODAL_TOKEN_SECRET }} | ||
| # this one comes from https://huggingface.co/settings/profile of the bot user | ||
| # and it too is then updated at https://github.com/deepspeedai/deepspeed/settings/secrets/actions | ||
| HF_TOKEN: ${{ secrets.HF_TOKEN }} | ||
|
|
||
| if: needs.collect-tests.outputs.deepspeed == 'true' | ||
| steps: | ||
| - name: Checkout Repository | ||
| uses: actions/checkout@v4 | ||
| with: | ||
| lfs: true | ||
|
|
||
| - name: Install Python | ||
| uses: actions/setup-python@v5 | ||
| with: | ||
| python-version: "3.10" | ||
| cache: 'pip' # caching pip dependencies | ||
|
|
||
| - name: Install build dependencies | ||
| # TODO: Need to determine whether these installs are redundant by subsequent modal.Image call. | ||
| run: | | ||
| pip install uv==0.4.0 # much faster than pip | ||
| uv pip install --system modal | ||
| uv pip install --system .[dev,1bit,autotuning,deepcompile] | ||
| ds_report | ||
| # time uv pip compile arctic_training/setup.py --extra all -o arctic_training/ci-requirements.txt | ||
| # # add vllm manually to deps since it fails pip compile w/o CUDA_HOME being set in github actions | ||
| # # if changing the version here also change it in setup.py to match | ||
| # echo 'vllm==0.6.2' >> arctic_training/ci-requirements.txt | ||
|
|
||
| - name: Run tests | ||
| run: | | ||
| modal run -m ci.torch_latest |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,4 @@ | ||
| # Copyright (c) DeepSpeed Team. | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
|
|
||
| # DeepSpeed Team |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,53 @@ | ||
| # Copyright (c) Snowflake. | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
|
|
||
| # DeepSpeed Team | ||
|
|
||
| from pathlib import Path | ||
|
|
||
| import modal | ||
|
|
||
| ROOT_PATH = Path(__file__).parents[1] | ||
|
|
||
| # yapf: disable | ||
| image = (modal.Image | ||
| .from_registry("pytorch/pytorch:2.6.0-cuda12.4-cudnn9-devel", add_python="3.10") | ||
loadams marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| .run_commands("apt update && apt install -y libaio-dev") | ||
| .apt_install("git") | ||
| .run_commands("uv pip install --system --compile-bytecode datasets==3.6.0") | ||
| .run_commands( | ||
| "git clone https://github.com/huggingface/accelerate && \ | ||
| uv pip install --system --compile-bytecode ./accelerate[testing]" | ||
| ) | ||
| .run_commands("uv pip install --system --compile-bytecode protobuf") | ||
| .run_commands("uv pip list") | ||
| .pip_install_from_requirements(ROOT_PATH / "requirements/requirements.txt", gpu="any") | ||
| .pip_install_from_requirements(ROOT_PATH / "requirements/requirements-dev.txt", gpu="any") | ||
| .run_commands("pip show deepspeed") | ||
| .add_local_dir(ROOT_PATH / "deepspeed", remote_path="/root/deepspeed") | ||
| .add_local_dir(ROOT_PATH / "accelerator", remote_path="/root/accelerator") | ||
| .add_local_dir(ROOT_PATH / "accelerator", remote_path="/root/deepspeed/accelerator") | ||
| .add_local_dir(ROOT_PATH / "csrc", remote_path="/root/csrc") | ||
| .add_local_dir(ROOT_PATH / "csrc", remote_path="/root/deepspeed/ops/csrc") | ||
| .add_local_dir(ROOT_PATH / "op_builder", remote_path="/root/op_builder") | ||
| .add_local_dir(ROOT_PATH / "op_builder", remote_path="/root/deepspeed/ops/op_builder") | ||
| .add_local_dir(ROOT_PATH / "tests", remote_path="/root/tests") | ||
| .add_local_dir(ROOT_PATH / "ci", remote_path="/root/ci") | ||
| .add_local_dir(ROOT_PATH / "ci", remote_path="/root/deepspeed/ci") | ||
| ) | ||
|
|
||
| app = modal.App("deepspeedai-accelerate-ci", image=image) | ||
|
|
||
| @app.function( | ||
| gpu="l40s:1", | ||
| # gpu="a10g:2", | ||
| # secrets=[modal.Secret.from_local_environ(["HF_TOKEN"])], | ||
| timeout=1800, | ||
| ) | ||
| def pytest(): | ||
| import subprocess | ||
| subprocess.run( | ||
| "pytest -sv /accelerate/tests/deepspeed".split(), | ||
| check=True, | ||
| cwd=ROOT_PATH / ".", | ||
| ) | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,47 @@ | ||
| # Copyright (c) Snowflake. | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
|
|
||
| # DeepSpeed Team | ||
|
|
||
| from pathlib import Path | ||
|
|
||
| import modal | ||
|
|
||
| ROOT_PATH = Path(__file__).parents[1] | ||
|
|
||
| # yapf: disable | ||
| image = (modal.Image | ||
| .from_registry("pytorch/pytorch:2.6.0-cuda12.4-cudnn9-devel", add_python="3.10") | ||
| .run_commands("apt update && apt install -y libaio-dev") | ||
| .run_commands("uv pip list") | ||
| .pip_install_from_requirements(ROOT_PATH / "requirements/requirements.txt", gpu="any") | ||
| .pip_install_from_requirements(ROOT_PATH / "requirements/requirements-dev.txt", gpu="any") | ||
| .add_local_dir(ROOT_PATH / "deepspeed", remote_path="/root/deepspeed") | ||
| .add_local_dir(ROOT_PATH / "accelerator", remote_path="/root/accelerator") | ||
| .add_local_dir(ROOT_PATH / "accelerator", remote_path="/root/deepspeed/accelerator") | ||
| .add_local_dir(ROOT_PATH / "csrc", remote_path="/root/csrc") | ||
| .add_local_dir(ROOT_PATH / "csrc", remote_path="/root/deepspeed/ops/csrc") | ||
| .add_local_dir(ROOT_PATH / "op_builder", remote_path="/root/op_builder") | ||
| .add_local_dir(ROOT_PATH / "op_builder", remote_path="/root/deepspeed/ops/op_builder") | ||
| .add_local_dir(ROOT_PATH / "tests", remote_path="/root/tests") | ||
| .add_local_dir(ROOT_PATH / "ci", remote_path="/root/ci") | ||
| .add_local_dir(ROOT_PATH / "ci", remote_path="/root/deepspeed/ci") | ||
| ) | ||
|
|
||
|
|
||
| app = modal.App("deepspeedai-torch-latest-ci", image=image) | ||
|
|
||
|
|
||
| @app.function( | ||
| gpu="l40s:2", | ||
| # gpu="a10g:2", | ||
| # secrets=[modal.Secret.from_local_environ(["HF_TOKEN"])], | ||
| timeout=1800, | ||
| ) | ||
| def pytest(): | ||
| import subprocess | ||
| subprocess.run( | ||
| "pytest -n 4 --verbose tests/unit/runtime/zero/test_zero.py tests/unit/runtime/half_precision/test_bf16.py --torch_ver=2.6 --cuda_ver=12.4".split(), | ||
| check=True, | ||
| cwd=ROOT_PATH / ".", | ||
| ) |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,4 @@ | ||
| # Copyright (c) DeepSpeed Team. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why is
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am open to suggestions here. My thinking is that CI assets reside in the package. In this case, it looked like we are transferring logic from
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Alternatively, you could also do:
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks. I will go with |
||
| # SPDX-License-Identifier: Apache-2.0 | ||
|
|
||
| # DeepSpeed Team | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,49 @@ | ||
| # Copyright (c) Snowflake. | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
|
|
||
| # DeepSpeed Team | ||
|
|
||
| from pathlib import Path | ||
|
|
||
| import modal | ||
|
|
||
| ROOT_PATH = Path(__file__).parents[2] | ||
|
|
||
| # yapf: disable | ||
| image = (modal.Image | ||
| .from_registry("pytorch/pytorch:2.7.0-cuda12.6-cudnn9-devel", add_python="3.10") | ||
| .run_commands("apt update && apt install -y libaio-dev") | ||
| .apt_install("git") | ||
| .run_commands("uv pip install --system --compile-bytecode datasets==3.6.0") | ||
| .run_commands( | ||
| "git clone https://github.com/huggingface/accelerate && \ | ||
| uv pip install --system --compile-bytecode accelerate[testing]" | ||
| ) | ||
| .run_commands("uv pip install --system --compile-bytecode protobuf") | ||
| .run_commands("uv pip list") | ||
| .pip_install_from_requirements(ROOT_PATH / "requirements/requirements.txt", gpu="any") | ||
| .pip_install_from_requirements(ROOT_PATH / "requirements/requirements-dev.txt", gpu="any") | ||
| .add_local_dir(ROOT_PATH / "accelerator", remote_path="/root/accelerator") | ||
| .add_local_dir(ROOT_PATH / "accelerator", remote_path="/root/deepspeed/accelerator") | ||
| .add_local_dir(ROOT_PATH / "csrc", remote_path="/root/csrc") | ||
| .add_local_dir(ROOT_PATH / "csrc", remote_path="/root/deepspeed/ops/csrc") | ||
| .add_local_dir(ROOT_PATH / "op_builder", remote_path="/root/op_builder") | ||
| .add_local_dir(ROOT_PATH / "op_builder", remote_path="/root/deepspeed/ops/op_builder") | ||
| .add_local_dir(ROOT_PATH / "tests", remote_path="/root/tests") | ||
| ) | ||
|
|
||
| app = modal.App("deepspeedai-accelerate-ci", image=image) | ||
|
|
||
| @app.function( | ||
| gpu="l40s:1", | ||
| # gpu="a10g:2", | ||
| # secrets=[modal.Secret.from_local_environ(["HF_TOKEN"])], | ||
| timeout=1800, | ||
| ) | ||
| def pytest(): | ||
| import subprocess | ||
| subprocess.run( | ||
| "pytest -n 4 --verbose /accelerate/tests/deepspeed".split(), | ||
|
||
| check=True, | ||
| cwd=ROOT_PATH / ".", | ||
| ) | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stas00 any thoughts on this concern.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as discussed on slack, the only thing you need to install on GH actions is what
modalrequires to launch itself - it doesn't care for deepspeed install at this point.so no deepspeed installing.