Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AWS GPU Runner #107

Merged
merged 8 commits into from
Sep 6, 2023
Merged
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
151 changes: 151 additions & 0 deletions .github/workflows/self-hosted-gpu-test.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@
name: self-hosted-gpu-test
on:
push:
branches:
- master
workflow_dispatch:
schedule:
# weekly tests
- cron: "0 0 * * SUN"
jobs:
start-runner:
name: Start self-hosted EC2 runner
runs-on: ubuntu-latest
outputs:
label: ${{ steps.start-ec2-runner.outputs.label }}
ec2-instance-id: ${{ steps.start-ec2-runner.outputs.ec2-instance-id }}
steps:
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v1
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: ${{ secrets.AWS_REGION }}
- name: Try to start EC2 runner
id: start-ec2-runner
uses: machulav/ec2-github-runner@main
with:
mode: start
github-token: ${{ secrets.GH_PERSONAL_ACCESS_TOKEN }}
ec2-image-id: ami-04d16a12bbc76ff0b
ec2-instance-type: g4dn.xlarge
subnet-id: subnet-0dee8543e12afe0cd # us-east-1a
security-group-id: sg-0f9809618550edb98
# iam-role-name: self-hosted-runner # optional, requires additional permissions
aws-resource-tags: > # optional, requires additional permissions
[
{"Key": "Name", "Value": "ec2-github-runner"},
{"Key": "GitHubRepository", "Value": "${{ github.repository }}"}
]

do-the-job:
name: Do the job on the runner
needs: start-runner # required to start the main job when the runner is ready
runs-on: ${{ needs.start-runner.outputs.label }} # run the job on the newly created runner
timeout-minutes: 1200 # 20 hrs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that it costs money, I would lower this. In my machine the tests take ~1 minute. Maybe 120 mins should give enough room for compilation, etc?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to take into account the python tests, which do take a while in the CPU. Although 2 hours should be enough, perhaps we can skip CPU tests in this runner:

$ pytest -v -s -k "not Reference and not CPU" Test*py

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is a great idea, I didn't think about that. I am trying to think if there is a case where it is useful to run the CPU tests on this runner... something else to play with is pytest using xdist since these boxes are MUCH more powerful than what GHA gives us.

So do you think there is any value in running the CPU tests on the GPU runner?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, on one hand they are already being ran on the normal CI, OTOH I can see bugs arising in the CPU version only on a GPU env or the other way . To give some examples this conda-forge/openmm-torch-feedstock#37 or the problems we could not eventually crack on this very PR.
The safe thing to do is probably just give it 3 hours and run every test.
Another option could be to run CPU tests in parallel, which pytest allows AFAIK. Something like:

$ pytest -n 4 -v -s -k "Reference or CPU" Test*py &
$ pytest -v -s -k "not Reference and not CPU" Test*py
$ wait

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With -n auto the tests takes ~11 minutes to run so I think it is worth running the CPU tests since for troubleshooting one of the first things I would want to do is see if the CPU tests work.

env:
HOME: /home/ec2-user
os: ubuntu-22.04
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above it says ubuntu-latest, but here it is ubuntu-22.04, is this intentional?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is to make the way the env file is chosen devtools/conda-envs/build-${{ env.os }}.yml match how we do the CI on GHA.

For the start-runner block, we just need a VM to spin up and turn on the GPU runner at AWS, so I chose to use ubuntu-latest since it doesn't really matter the version and I rather not have to worry about it, it should work fine when 24.04 comes out for example.

cuda-version: "11.7"
gcc-version: "10.3.*"
nvcc-version: "11.7"
python-version: "3.10"
pytorch-version: "1.12.*"


defaults:
run:
shell: bash -l {0}
steps:

- uses: actions/checkout@v3
- name: "Update the conda enviroment file"
uses: cschleiden/replace-tokens@v1
with:
tokenPrefix: '@'
tokenSuffix: '@'
files: devtools/conda-envs/build-${{ env.os }}.yml
env:
CUDATOOLKIT_VERSION: ${{ env.cuda-version }}
GCC_VERSION: ${{ env.gcc-version }}
NVCC_VERSION: ${{ env.nvcc-version }}
PYTORCH_VERSION: ${{ env.pytorch-version }}

- uses: mamba-org/provision-with-micromamba@main
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switching to mciromamba is not a solution. It just hides some issue with the dependencies or mamba.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well here is the diff between the envs:

2c2
< _openmp_mutex 4.5 2_gnu conda-forge
---
> _openmp_mutex 4.5 2_kmp_llvm conda-forge
33d32
< intel-openmp 2022.1.0 h9e868ea_3769 
69a69
> llvm-openmp 16.0.2 h4dfa4b3_0 conda-forge
72c72
< mkl 2022.1.0 hc2b9512_224 
---
> mkl 2022.2.1 h84fe81f_16997 conda-forge
92c92
< python 3.10.10 he550d4f_0_cpython conda-forge
---
> python 3.10.0 h543edf9_3_cpython conda-forge
103a104
> sqlite 3.40.0 h4ff8645_1 conda-forge
105a107
> tbb 2021.8.0 hf52228f_0 conda-forge

Where > is the working one, so it looks like sqlite & tbb pacakges are added, and llvm-openmp instead of intel's implementation was chosen. I am not sure which pins need to be adjusted. Why only the openCL tests would fail with a different mutex flavor, mkl version, and openmp implementation and none of the others, I don't have any idea.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bet is on the ocl-icd package being somehow overriden by the system one.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So now do you want me to add pins to the environment yaml? Previously, you didn't want me to do that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mikemhenry I say if pinning some versions its what it takes so be it. We can later on relax these after merging. AFAIK I cannot play around with this at the moment.

I would start by trying if installing llvm-openmp is enough. Another option could be to use the OCL library that comes with CUDA.
Maybe skipping these two is enough for cmake to pick the CUDA ones?

-DOPENCL_INCLUDE_DIR=${CONDA_PREFIX}/include \
-DOPENCL_LIBRARY=${CONDA_PREFIX}/lib/libOpenCL${SHLIB_EXT}

I would really like this merged ASAP so we can also merge #106

name: "Install dependencies with MicroMamba"
with:
environment-file: devtools/conda-envs/build-${{ env.os }}.yml
extra-specs: |
python==${{ env.python-version }}

- name: "List conda packages"
shell: bash -l {0}
run: |
micromamba list
micromamba info

- name: "Configure"
shell: bash -l {0}
run: |
mkdir build
cd build

SHLIB_EXT=".so"

cmake .. \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_INSTALL_PREFIX=${CONDA_PREFIX} \
-DOPENMM_DIR=${CONDA_PREFIX} \
-DTorch_DIR=${CONDA_PREFIX}/lib/python${{ env.python-version }}/site-packages/torch/share/cmake/Torch \
-DNN_BUILD_OPENCL_LIB=ON \
-DOPENCL_INCLUDE_DIR=${CONDA_PREFIX}/include \
-DOPENCL_LIBRARY=${CONDA_PREFIX}/lib/libOpenCL${SHLIB_EXT}

- name: "Build"
shell: bash -l {0}
run: |
cd build
make -j2 install
make -j2 PythonInstall

- name: "List plugins"
shell: bash -l {0}
run: |
export LD_LIBRARY_PATH="${CONDA_PREFIX}/lib/python${{ env.python-version }}/site-packages/torch/lib:${LD_LIBRARY_PATH}"
python -c "import openmm as mm; print('---Loaded---', *mm.pluginLoadedLibNames, '---Failed---', *mm.Platform.getPluginLoadFailures(), sep='\n')"

- name: "Run C++ test"
shell: bash -l {0}
run: |
export LD_LIBRARY_PATH="${CONDA_PREFIX}/lib/python${{ env.python-version }}/site-packages/torch/lib:${LD_LIBRARY_PATH}"
cd build
ctest --output-on-failure

- name: "Run Python test"
shell: bash -l {0}
run: |
export LD_LIBRARY_PATH="${CONDA_PREFIX}/lib/python${{ env.python-version }}/site-packages/torch/lib:${LD_LIBRARY_PATH}"
cd python/tests
pytest --verbose Test*

stop-runner:
name: Stop self-hosted EC2 runner
needs:
- start-runner # required to get output from the start-runner job
- do-the-job # required to wait when the main job is done
runs-on: ubuntu-latest
if: ${{ always() }} # required to stop the runner even if the error happened in the previous jobs
steps:
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v1
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: ${{ secrets.AWS_REGION }}
- name: Stop EC2 runner
uses: machulav/ec2-github-runner@main
with:
mode: stop
github-token: ${{ secrets.GH_PERSONAL_ACCESS_TOKEN }}
label: ${{ needs.start-runner.outputs.label }}
ec2-instance-id: ${{ needs.start-runner.outputs.ec2-instance-id }}