bigscience-workshop · mayank31398 · Jul 28, 2021 · Jul 29, 2021 · Jul 29, 2021 · Jul 30, 2021
diff --git a/.github/workflows/ci.md b/.github/workflows/ci.md
@@ -0,0 +1,145 @@
+# CI setup
+
+The CI is setup with github actions using the on-demand EC2 backend.
+
+This setup currently uses a 4gpu instance p3.8xlarge - to test tp=2, pp=2.
+
+**Unfortunately this only works for PRs created from non-forked branches**
+
+
+## The workflow file
+
+The workflow file is at `.github/workflows/main.yml`
+
+
+```
+      - name: Configure AWS credentials
+        uses: aws-actions/configure-aws-credentials@v1
+        with:
+          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
+          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
+          aws-region: us-east-1
+      - name: Start EC2 runner
+        id: start-ec2-runner
+        uses: machulav/ec2-github-runner@v2
+        with:
+          mode: start
+          github-token: ${{ secrets.GH_PERSONAL_ACCESS_TOKEN }}
+          ec2-image-id: ami-0dfaabfa78a779fbc
+          ec2-instance-type: p3.8xlarge
+          subnet-id: subnet-3502b45e
+          security-group-id: sg-e8f46d9d
+```
+
+- `ec2-image-id` is the AMI, which has to be created, or copied to the corresponding `aws-region` region the script requests.
+- `subnet-id` comes from: https://console.aws.amazon.com/vpc/home?region=us-east-1#subnets:
+- `security-group-id` comes from: https://console.aws.amazon.com/ec2/v2/home?region=us-east-1#SecurityGroups:
+
+
+It was later updated to use a fault-tolerant solution by trying to start the EC2 on 3 different sub-regions to cope with situations where EC2 reports it doesn't have resources to start the desired instance.
+
+
+
+## Connect to instance
+
+To pre-install things connect to the instance manually and install what's desired
+
+1. choose and start an EC2 instance
+2. connect to it as `ubuntu`, then `sudo su` as the runner runs as `root`. I couldn't find a way around it.
+```
+ssh -l ubuntu -i "~/.ssh/bigscience-aim.pem" [email protected]
+```
+
+Once installed, stop the instance.
+
+Then create a new AMI (see below) and update the script using the new AMI.
+
+
+## Prepare the machine
+
+Steps used to setup fixed software (won't be installed at test time)
+
+- install cuda:
+https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=20.04&target_type=deb_local
+https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#ubuntu-installation
+
+### install fixed packages
+
+- `torch 1.9.0/cu-11.1`
+
+```
+pip3 install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html
+```
+
+- all kinds of prerequisites
+```
+pip install transformers
+wget https://raw.githubusercontent.com/microsoft/DeepSpeed/master/requirements/requirements.txt -O requirements-ds.txt
+pip install -r requirements-ds.txt
+wget https://raw.githubusercontent.com/bigscience-workshop/Megatron-DeepSpeed/main/requirements.txt -O requirements-ms.txt
+pip install -r requirements-ms.txt
+
+```
+
+- apex - needs a hack to deal with mismatching minor cuda versions (and it takes forever to build), so using this patch:
+
+XXX: this no longer works - had to manually patch pytorch to avoid mismatch failure
+
+```
+--- a/setup.py
++++ b/setup.py
+@@ -99,6 +99,7 @@ def check_cuda_torch_binary_vs_bare_metal(cuda_dir):
+     print(raw_output + "from " + cuda_dir + "/bin\n")
+
+     if (bare_metal_major != torch_binary_major) or (bare_metal_minor != torch_binary_minor):
++        return
+         raise RuntimeError("Cuda extensions are being compiled with a version of Cuda that does " +
+                            "not match the version used to compile Pytorch binaries.  " +
+                            "Pytorch binaries were compiled with Cuda {}.\n".format(torch.version.cuda) +
+
+```
+
+install it: (it was cloned from `git clone https://github.com/NVIDIA/apex`)
+
+```
+cd code/apex
+# I copied this script from my setup
+./build.sh
+```
+
+
+## make a new AMI image
+
+Once the needed things got installed (and every time anything new is installed) a new AMI must be created (this is like an .iso image snapshot)
+
+1. go to https://us-east-1.console.aws.amazon.com/ec2/v2/home?region=us-east-1#Instances:
+2. choose the instance to create a new image from
+3. Actions -> Image and Templates -> Create Image
+
+Must ensure it's created in the correct region (same as in script) - or can copy it to the right region.
+
+The process of creating the image can be done while the instance that has been updated is still running.
+
+Just don't forget to turn the instance off when validated it to work.
+
+Finally, once created, the script needs to be updated to that new AMI id (key `ec2-image-id`) in `.github/workflows/main.py`
+
+
+## Stop instance alarm
+
+It looks like occasionally the instance doesn't stop and continues running.
+
+I added a stop alarm to automatically kill the instance after 1h if util < 10% following the exact instructions from:
+https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/UsingAlarmActions.html
+
+
+## Guides
+
+Set up guide: https://github.com/machulav/ec2-github-runner
+
+Launching an EC2 instance:
+https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2_GetStarted.html?icmpid=docs_ec2_console
+
+https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/concepts.html
+
+- All available instances: https://aws.amazon.com/ec2/instance-types/
diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml
@@ -0,0 +1,211 @@
+name: Run all tests
+on:
+  # enable to manually trigger the tests
+  workflow_dispatch:
+  pull_request:
+    paths:
+      - "**.py"
+
+jobs:
+
+# GPU sizes and types that we could use:
+# g4dn.12xlarge  4x 16GB T4 (CC 7.5) (low availability)
+# p3.8xlarge     4x 16GB V100 (CC 7.0) (very low availability)
+
+# Unfit:
+# g3.16xlarge    4x 8GB Tesla M60 (CC 5.2) (not supported by cuda-11)
+# p2.8xlarge     8x 12GB K80 (CC 3.7 not supported by cuda-11)
+
+  start-runner:
+    name: Start self-hosted EC2 runner
+    runs-on: ubuntu-latest
+    outputs:
+      label: ${{ steps.start-ec2-runner.outputs.label }}
+      ec2-instance-id: ${{ steps.start-ec2-runner.outputs.ec2-instance-id }}
+    steps:
+      - name: Configure AWS credentials
+        uses: aws-actions/configure-aws-credentials@v1
+        with:
+          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
+          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
+          aws-region: us-east-1
+      # don't use the following subnets as p3.8xlarge is not supported there:
+      # - subnet-06576a4b # us-east-1d
+      # - subnet-859322b4 # us-east-1e
+      # - subnet-47cfad21 # us-east-1b
+      - name: Try to start EC2 runner (a)
+        id: try-us-east-1a
+        uses: machulav/ec2-github-runner@v2
+        continue-on-error: true
+        with:
+          mode: start
+          github-token: ${{ secrets.GH_PERSONAL_ACCESS_TOKEN }}
+          ec2-image-id: ami-0ad997818d90480f2
+          ec2-instance-type: g4dn.12xlarge
+          security-group-id: sg-f2a4e2fc
+          subnet-id: subnet-b7533b96 # us-east-1c
+          aws-resource-tags: > # optional, requires additional permissions
+            [
+              {"Key": "Name", "Value": "ec2-github-runner"},
+              {"Key": "GitHubRepository", "Value": "${{ github.repository }}"}
+            ]
+      - name: Try to start EC2 runner (b)
+        id: try-us-east-1b
+        if: steps.try-us-east-1a.outcome == 'failure'
+        uses: machulav/ec2-github-runner@v2
+        continue-on-error: true
+        with:
+          mode: start
+          github-token: ${{ secrets.GH_PERSONAL_ACCESS_TOKEN }}
+          ec2-image-id: ami-0ad997818d90480f2
+          ec2-instance-type: g4dn.12xlarge
+          security-group-id: sg-f2a4e2fc
+          subnet-id: subnet-a396b2ad # us-east-1f
+          aws-resource-tags: > # optional, requires additional permissions
+            [
+              {"Key": "Name", "Value": "ec2-github-runner"},
+              {"Key": "GitHubRepository", "Value": "${{ github.repository }}"}
+            ]
+      - name: Try to start EC2 runner (c)
+        id: try-us-east-1c
+        if: steps.try-us-east-1b.outcome == 'failure'
+        uses: machulav/ec2-github-runner@v2
+        continue-on-error: true
+        with:
+          mode: start
+          github-token: ${{ secrets.GH_PERSONAL_ACCESS_TOKEN }}
+          ec2-image-id: ami-0ad997818d90480f2
+          ec2-instance-type: g4dn.12xlarge
+          security-group-id: sg-f2a4e2fc
+          subnet-id: subnet-df0f6180 # us-east-1a
+          aws-resource-tags: > # optional, requires additional permissions
+            [
+              {"Key": "Name", "Value": "ec2-github-runner"},
+              {"Key": "GitHubRepository", "Value": "${{ github.repository }}"}
+            ]
+
+      - name: Try to start EC2 runner (a-2)
+        id: try-us-east-1a-2
+        if: steps.try-us-east-1c.outcome == 'failure'
+        uses: machulav/ec2-github-runner@v2
+        continue-on-error: true
+        with:
+          mode: start
+          github-token: ${{ secrets.GH_PERSONAL_ACCESS_TOKEN }}
+          ec2-image-id: ami-0ad997818d90480f2
+          ec2-instance-type: p3.8xlarge
+          security-group-id: sg-f2a4e2fc
+          subnet-id: subnet-b7533b96 # us-east-1c
+          aws-resource-tags: > # optional, requires additional permissions
+            [
+              {"Key": "Name", "Value": "ec2-github-runner"},
+              {"Key": "GitHubRepository", "Value": "${{ github.repository }}"}
+            ]
+      - name: Try to start EC2 runner (b-2)
+        id: try-us-east-1b-2
+        if: steps.try-us-east-1a-2.outcome == 'failure'
+        uses: machulav/ec2-github-runner@v2
+        continue-on-error: true
+        with:
+          mode: start
+          github-token: ${{ secrets.GH_PERSONAL_ACCESS_TOKEN }}
+          ec2-image-id: ami-0ad997818d90480f2
+          ec2-instance-type: p3.8xlarge
+          security-group-id: sg-f2a4e2fc
+          subnet-id: subnet-a396b2ad # us-east-1f
+          aws-resource-tags: > # optional, requires additional permissions
+            [
+              {"Key": "Name", "Value": "ec2-github-runner"},
+              {"Key": "GitHubRepository", "Value": "${{ github.repository }}"}
+            ]
+      - name: Try to start EC2 runner (c-2)
+        id: try-us-east-1c-2
+        if: steps.try-us-east-1b-2.outcome == 'failure'
+        uses: machulav/ec2-github-runner@v2
+        with:
+          mode: start
+          github-token: ${{ secrets.GH_PERSONAL_ACCESS_TOKEN }}
+          ec2-image-id: ami-0ad997818d90480f2
+          ec2-instance-type: p3.8xlarge
+          security-group-id: sg-f2a4e2fc
+          subnet-id: subnet-df0f6180 # us-east-1a
+          aws-resource-tags: > # optional, requires additional permissions
+            [
+              {"Key": "Name", "Value": "ec2-github-runner"},
+              {"Key": "GitHubRepository", "Value": "${{ github.repository }}"}
+            ]
+
+      - name: See if any of 3 sub-regions had the resource
+        id: start-ec2-runner
+        run: |
+          if [ "${{ steps.try-us-east-1a.outcome }}" = "success" ]; then
+            echo "::set-output name=label::${{ steps.try-us-east-1a.outputs.label }}"
+            echo "::set-output name=ec2-instance-id::${{ steps.try-us-east-1a.outputs.ec2-instance-id }}"
+          fi
+          if [ "${{ steps.try-us-east-1b.outcome }}" = "success" ]; then
+            echo "::set-output name=label::${{ steps.try-us-east-1b.outputs.label }}"
+            echo "::set-output name=ec2-instance-id::${{ steps.try-us-east-1b.outputs.ec2-instance-id }}"
+          fi
+          if [ "${{ steps.try-us-east-1c.outcome }}" = "success" ]; then
+            echo "::set-output name=label::${{ steps.try-us-east-1c.outputs.label }}"
+            echo "::set-output name=ec2-instance-id::${{ steps.try-us-east-1c.outputs.ec2-instance-id }}"
+          fi
+          if [ "${{ steps.try-us-east-1a-2.outcome }}" = "success" ]; then
+            echo "::set-output name=label::${{ steps.try-us-east-1a-2.outputs.label }}"
+            echo "::set-output name=ec2-instance-id::${{ steps.try-us-east-1a-2.outputs.ec2-instance-id }}"
+          fi
+          if [ "${{ steps.try-us-east-1b-2.outcome }}" = "success" ]; then
+            echo "::set-output name=label::${{ steps.try-us-east-1b-2.outputs.label }}"
+            echo "::set-output name=ec2-instance-id::${{ steps.try-us-east-1b-2.outputs.ec2-instance-id }}"
+          fi
+          if [ "${{ steps.try-us-east-1c-2.outcome }}" = "success" ]; then
+            echo "::set-output name=label::${{ steps.try-us-east-1c-2.outputs.label }}"
+            echo "::set-output name=ec2-instance-id::${{ steps.try-us-east-1c-2.outputs.ec2-instance-id }}"
+          fi
+
+
+  do-the-job:
+    name: Do the job on the runner
+    needs: start-runner # required to start the main job when the runner is ready
+    # need to figure out how to cancel the previous build if a new push was made the old test is still running
+    # concurrency: # cancel previous build on a new push
+    #   group: ${{ github.ref }} # https://docs.github.com/en/actions/reference/context-and-expression-syntax-for-github-actions#github-context
+    #   cancel-in-progress: true
+    runs-on: ${{ needs.start-runner.outputs.label }} # run the job on the newly created runner
+    steps:
+      - name: NVIDIA-SMI
+        run: nvidia-smi
+
+      - name: Checkout
+        uses: actions/checkout@v2
+
+      - name: Install Dependencies
+        run: |
+          pip install --upgrade pip
+          pip install -r requirements.txt
+          pip install pytest-timeout
+
+      - name: Run tests
+        run: pytest --timeout=600 tests
+
+  stop-runner:
+    name: Stop self-hosted EC2 runner
+    needs:
+      - start-runner # required to get output from the start-runner job
+      - do-the-job # required to wait when the main job is done
+    runs-on: ubuntu-latest
+    if: ${{ always() }} # required to stop the runner even if the error happened in the previous jobs
+    steps:
+      - name: Configure AWS credentials
+        uses: aws-actions/configure-aws-credentials@v1
+        with:
+          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
+          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
+          aws-region: us-east-1
+      - name: Stop EC2 runner
+        uses: machulav/ec2-github-runner@v2
+        with:
+          mode: stop
+          github-token: ${{ secrets.GH_PERSONAL_ACCESS_TOKEN }}
+          label: ${{ needs.start-runner.outputs.label }}
+          ec2-instance-id: ${{ needs.start-runner.outputs.ec2-instance-id }}