-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add AWS GPU Runner #107
Add AWS GPU Runner #107
Changes from 6 commits
6d4e00e
712af05
2c4dde6
e488b36
552c4f0
77df4f8
ff2bf34
0e733e3
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,151 @@ | ||
name: self-hosted-gpu-test | ||
on: | ||
push: | ||
branches: | ||
- master | ||
workflow_dispatch: | ||
schedule: | ||
# weekly tests | ||
- cron: "0 0 * * SUN" | ||
jobs: | ||
start-runner: | ||
name: Start self-hosted EC2 runner | ||
runs-on: ubuntu-latest | ||
outputs: | ||
label: ${{ steps.start-ec2-runner.outputs.label }} | ||
ec2-instance-id: ${{ steps.start-ec2-runner.outputs.ec2-instance-id }} | ||
steps: | ||
- name: Configure AWS credentials | ||
uses: aws-actions/configure-aws-credentials@v1 | ||
with: | ||
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }} | ||
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }} | ||
aws-region: ${{ secrets.AWS_REGION }} | ||
- name: Try to start EC2 runner | ||
id: start-ec2-runner | ||
uses: machulav/ec2-github-runner@main | ||
with: | ||
mode: start | ||
github-token: ${{ secrets.GH_PERSONAL_ACCESS_TOKEN }} | ||
ec2-image-id: ami-04d16a12bbc76ff0b | ||
ec2-instance-type: g4dn.xlarge | ||
subnet-id: subnet-0dee8543e12afe0cd # us-east-1a | ||
security-group-id: sg-0f9809618550edb98 | ||
# iam-role-name: self-hosted-runner # optional, requires additional permissions | ||
aws-resource-tags: > # optional, requires additional permissions | ||
[ | ||
{"Key": "Name", "Value": "ec2-github-runner"}, | ||
{"Key": "GitHubRepository", "Value": "${{ github.repository }}"} | ||
] | ||
|
||
do-the-job: | ||
name: Do the job on the runner | ||
needs: start-runner # required to start the main job when the runner is ready | ||
runs-on: ${{ needs.start-runner.outputs.label }} # run the job on the newly created runner | ||
timeout-minutes: 1200 # 20 hrs | ||
env: | ||
HOME: /home/ec2-user | ||
os: ubuntu-22.04 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Above it says ubuntu-latest, but here it is ubuntu-22.04, is this intentional? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That is to make the way the env file is chosen For the |
||
cuda-version: "11.7" | ||
gcc-version: "10.3.*" | ||
nvcc-version: "11.7" | ||
python-version: "3.10" | ||
pytorch-version: "1.12.*" | ||
|
||
|
||
defaults: | ||
run: | ||
shell: bash -l {0} | ||
steps: | ||
|
||
- uses: actions/checkout@v3 | ||
- name: "Update the conda enviroment file" | ||
uses: cschleiden/replace-tokens@v1 | ||
with: | ||
tokenPrefix: '@' | ||
tokenSuffix: '@' | ||
files: devtools/conda-envs/build-${{ env.os }}.yml | ||
env: | ||
CUDATOOLKIT_VERSION: ${{ env.cuda-version }} | ||
GCC_VERSION: ${{ env.gcc-version }} | ||
NVCC_VERSION: ${{ env.nvcc-version }} | ||
PYTORCH_VERSION: ${{ env.pytorch-version }} | ||
|
||
- uses: mamba-org/provision-with-micromamba@main | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Switching to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Well here is the diff between the envs:
Where There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My bet is on the ocl-icd package being somehow overriden by the system one. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So now do you want me to add pins to the environment yaml? Previously, you didn't want me to do that. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @mikemhenry I say if pinning some versions its what it takes so be it. We can later on relax these after merging. AFAIK I cannot play around with this at the moment. I would start by trying if installing llvm-openmp is enough. Another option could be to use the OCL library that comes with CUDA. -DOPENCL_INCLUDE_DIR=${CONDA_PREFIX}/include \
-DOPENCL_LIBRARY=${CONDA_PREFIX}/lib/libOpenCL${SHLIB_EXT} I would really like this merged ASAP so we can also merge #106 |
||
name: "Install dependencies with MicroMamba" | ||
with: | ||
environment-file: devtools/conda-envs/build-${{ env.os }}.yml | ||
extra-specs: | | ||
python==${{ env.python-version }} | ||
|
||
- name: "List conda packages" | ||
shell: bash -l {0} | ||
run: | | ||
micromamba list | ||
micromamba info | ||
|
||
- name: "Configure" | ||
shell: bash -l {0} | ||
run: | | ||
mkdir build | ||
cd build | ||
|
||
SHLIB_EXT=".so" | ||
|
||
cmake .. \ | ||
-DCMAKE_BUILD_TYPE=Release \ | ||
-DCMAKE_INSTALL_PREFIX=${CONDA_PREFIX} \ | ||
-DOPENMM_DIR=${CONDA_PREFIX} \ | ||
-DTorch_DIR=${CONDA_PREFIX}/lib/python${{ env.python-version }}/site-packages/torch/share/cmake/Torch \ | ||
-DNN_BUILD_OPENCL_LIB=ON \ | ||
-DOPENCL_INCLUDE_DIR=${CONDA_PREFIX}/include \ | ||
-DOPENCL_LIBRARY=${CONDA_PREFIX}/lib/libOpenCL${SHLIB_EXT} | ||
|
||
- name: "Build" | ||
shell: bash -l {0} | ||
run: | | ||
cd build | ||
make -j2 install | ||
make -j2 PythonInstall | ||
|
||
- name: "List plugins" | ||
shell: bash -l {0} | ||
run: | | ||
export LD_LIBRARY_PATH="${CONDA_PREFIX}/lib/python${{ env.python-version }}/site-packages/torch/lib:${LD_LIBRARY_PATH}" | ||
python -c "import openmm as mm; print('---Loaded---', *mm.pluginLoadedLibNames, '---Failed---', *mm.Platform.getPluginLoadFailures(), sep='\n')" | ||
|
||
- name: "Run C++ test" | ||
shell: bash -l {0} | ||
run: | | ||
export LD_LIBRARY_PATH="${CONDA_PREFIX}/lib/python${{ env.python-version }}/site-packages/torch/lib:${LD_LIBRARY_PATH}" | ||
cd build | ||
ctest --output-on-failure | ||
|
||
- name: "Run Python test" | ||
shell: bash -l {0} | ||
run: | | ||
export LD_LIBRARY_PATH="${CONDA_PREFIX}/lib/python${{ env.python-version }}/site-packages/torch/lib:${LD_LIBRARY_PATH}" | ||
cd python/tests | ||
pytest --verbose Test* | ||
|
||
stop-runner: | ||
name: Stop self-hosted EC2 runner | ||
needs: | ||
- start-runner # required to get output from the start-runner job | ||
- do-the-job # required to wait when the main job is done | ||
runs-on: ubuntu-latest | ||
if: ${{ always() }} # required to stop the runner even if the error happened in the previous jobs | ||
steps: | ||
- name: Configure AWS credentials | ||
uses: aws-actions/configure-aws-credentials@v1 | ||
with: | ||
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }} | ||
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }} | ||
aws-region: ${{ secrets.AWS_REGION }} | ||
- name: Stop EC2 runner | ||
uses: machulav/ec2-github-runner@main | ||
with: | ||
mode: stop | ||
github-token: ${{ secrets.GH_PERSONAL_ACCESS_TOKEN }} | ||
label: ${{ needs.start-runner.outputs.label }} | ||
ec2-instance-id: ${{ needs.start-runner.outputs.ec2-instance-id }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that it costs money, I would lower this. In my machine the tests take ~1 minute. Maybe 120 mins should give enough room for compilation, etc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forgot to take into account the python tests, which do take a while in the CPU. Although 2 hours should be enough, perhaps we can skip CPU tests in this runner:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this is a great idea, I didn't think about that. I am trying to think if there is a case where it is useful to run the CPU tests on this runner... something else to play with is pytest using xdist since these boxes are MUCH more powerful than what GHA gives us.
So do you think there is any value in running the CPU tests on the GPU runner?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, on one hand they are already being ran on the normal CI, OTOH I can see bugs arising in the CPU version only on a GPU env or the other way . To give some examples this conda-forge/openmm-torch-feedstock#37 or the problems we could not eventually crack on this very PR.
The safe thing to do is probably just give it 3 hours and run every test.
Another option could be to run CPU tests in parallel, which pytest allows AFAIK. Something like:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With
-n auto
the tests takes ~11 minutes to run so I think it is worth running the CPU tests since for troubleshooting one of the first things I would want to do is see if the CPU tests work.