Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't use AWS Instance GPU on GITLAB CI and CML-RUNNER #848

Closed
leoitcode opened this issue Dec 17, 2021 · 24 comments
Closed

Can't use AWS Instance GPU on GITLAB CI and CML-RUNNER #848

leoitcode opened this issue Dec 17, 2021 · 24 comments
Labels
ci-gitlab cml-runner Subcommand p0-critical Max priority (ASAP)

Comments

@leoitcode
Copy link

leoitcode commented Dec 17, 2021

I have this gitlab-ci.yml:

stages:
  - test
  - deploy
  - train

sast:
  stage: test
include:
- template: Security/SAST.gitlab-ci.yml

deploy_job:
  stage: deploy
  when: always
  image: iterativeai/cml:0-dvc2-base1
  script:
    - cml-runner
      --cloud aws
      --cloud-region us-east-1
      --cloud-type g3.4xlarge
      --cloud-hdd-size 64
      --cloud-aws-security-group="cml-runners-sg"
      --labels=cml-runner-gpu
      --idle-timeout=120
train_job:
  stage: train
  when: on_success
  image: iterativeai/cml:0-dvc2-base1-gpu
  tags:
    - cml-runner-gpu
  before_script:
    - pip install poetry
    - poetry --version
    - poetry config virtualenvs.create false
    - poetry install -vv
    - nvdia-smi
  script:
    # DVC Stuff
    - dvc pull
    - dvc repro -m
    - dvc push
    # Report metrics
    - echo "## Metrics" >> report.md
    - echo "\`\`\`json" >> report.md
    - cat metrics/best-meta.json >> report.md
    - echo "\`\`\`" >> report.md
    # Report GPU details
    - echo "## GPU info" >> report.md
    - cat gpu_info.txt >> report.md
    # Send comment
    - cml-send-comment report.md

But, the container can't recognize driver or GPU, on nvidia-smi command I had the following error:

/usr/bin/bash: line 133: nvdia-smi: command not found

I realized that iterativeai/cml:0-dvc2-base1-gpu can't use instance GPU. How could I install nvidia drivers and the nvidia-docker and activate
--gpus option on this docker?

Thank you

@0x2b3bfa0
Copy link
Member

👋 Hello, @leoitcode! Can you please connect to the instance through SSH and retrieve some additional information?

$ npm --version
$ cat /var/log/*cloud*
$ nvidia-smi

@leoitcode
Copy link
Author

Hello

wave Hello, @leoitcode! Can you please connect to the instance through SSH and retrieve some additional information?

$ npm --version
$ cat /var/log/*cloud*
$ nvidia-smi

I tried to follow the tutorial but I'm having this error:

image

Unknown argument: --cloud-ssh-private=MY_RSA_KEY

I'm using this command:
cml-runner --cloud aws --cloud-region us-east-1 --cloud-type g3.4xlarge --cloud-hdd-size 64 --cloud-aws-security-group="cml-runners-sg" --labels=cml-runner-gpu --idle-timeout=-1 --cloud-ssh-private="$(cat cml.pem)"

@leoitcode
Copy link
Author

Just adding the job log on CI of the deploy_job step:
deploy_job.txt

and the train_job step:
job_log.txt

@leoitcode leoitcode changed the title Can't use AWS Istance GPU on GITLAB CI and CML-RUNNER Can't use AWS Instance GPU on GITLAB CI and CML-RUNNER Dec 17, 2021
@dacbd
Copy link
Contributor

dacbd commented Dec 17, 2021

@leoitcode My guess is that it is having trouble parsing the output from cat as an argument with the spaces/dashes, and the like...

🤔 there might be a bug in handling that option, this is the first time I've seen it used. In the meantime, you can probably use the AWS web console to connect to the instance instead of trying to pass your private key.

@leoitcode
Copy link
Author

@leoitcode My guess is that it is having trouble parsing the output from cat as an argument with the spaces/dashes, and the like...

thinking there might be a bug in handling that option, this is the first time I've seen it used. In the meantime, you can probably use the AWS web console to connect to the instance instead of trying to pass your private key.

I tried access by AWS Console, but I got the following error:

image

Some usernames that I tried:
ubuntu
ec2-user
root

@gitdoluquita
Copy link

gitdoluquita commented Dec 17, 2021

@leoitcode My guess is that it is having trouble parsing the output from cat as an argument with the spaces/dashes, and the like...

thinking there might be a bug in handling that option, this is the first time I've seen it used. In the meantime, you can probably use the AWS web console to connect to the instance instead of trying to pass your private key.

@dacbd I managed to make it work by adding EOF to my pem file:

<< EOF
-----BEGIN RSA PRIVATE KEY-----
MY PRIVATE KEY HERE
-----END RSA PRIVATE KEY-----
EOF

To be honest I have no idea of how this works, I just imagined it could be that by looking at what you did here:
iterative/terraform-provider-iterative#232 (comment)

Maybe there is a more elegant way of doing this 😆

@gitdoluquita
Copy link

Me and @leoitcode are working together at this.

I connected to the deployed instance and managed to execute the nvidia-smi command:

ubuntu@ip-10-0-0-224:~$ nvidia-smi
Fri Dec 17 18:41:49 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.63.01    Driver Version: 470.63.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla M60           Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   43C    P0    37W / 150W |      0MiB /  7618MiB |     92%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

But no cuda-smi command

ubuntu@ip-10-0-0-224:~$ cuda-smi
cuda-smi: command not found

I thought this might be a problem by looking at this line, but I'm not sure:
https://github.com/iterative/cml/blob/master/src/drivers/gitlab.js#L178

Testing container

Inside of the instance, if I do this:
docker run -it iterativeai/cml:0-dvc2-base1-gpu bash
I can't run nvidia-smi, as expected.

But if I do this:
docker run -it --gpus all iterativeai/cml:0-dvc2-base1-gpu bash
It works.

I couldn't find where something like this (--gpus all) is setted.

Maybe there is a config file for the gitlab-runner that should be changed for something like this: https://docs.gitlab.com/runner/configuration/gpus.html

I also notice that at Github Actions there is this option, like in this snippet:

  run:
    needs: deploy-runner
    runs-on: [self-hosted,cml-runner]
    container: 
      image: docker://iterativeai/cml:0-dvc2-base1-gpu
      options: --gpus all
 
    steps:
    - uses: actions/checkout@v2

I'm attaching the result of cat /var/log/*cloud* too, as @0x2b3bfa0 has asked:
result.log
I just edited one line for removing the GL_TOKEN that is printed, for security reasons.

Let us know if we can provide any other information.

Thanks.

@dacbd
Copy link
Contributor

dacbd commented Dec 17, 2021

Wish I could help but I think we need to wait for the world to rotate back to @0x2b3bfa0 side 🌍

I see some npm install errors, you could try and add a startup script with sudo apt-get update && sudo apt-get upgrade -y && sudo apt-get install build-essential ? but I wouldn't hold my breath for that to make it work.

are your using gitlab-ci or github actions?

@DavidGOrtega DavidGOrtega added ci-gitlab cml-runner Subcommand p0-critical Max priority (ASAP) labels Dec 18, 2021
@DavidGOrtega
Copy link
Contributor

DavidGOrtega commented Dec 18, 2021

I connected to the deployed instance and managed to execute the nvidia-smi command:

Having seen that nvidia-smi works cml should have setup the runner with the nvidia executor automatically

--docker-runtime "${gpu ? 'nvidia' : ''}" \

@gitdoluquita
Copy link

Having seen that nvidia-smi works cml should have setup the runner with the nvidia executor automatically

@DavidGOrtega this lines wouldn't make gpu false if cuda-smi is not working?

cml/src/drivers/gitlab.js

Lines 177 to 181 in e338266

try {
await exec('cuda-smi');
} catch (err) {
gpu = false;
}

I think this might be the problem.

If not that what else could I be missing?

are your using gitlab-ci or github actions?

@dacbd I'm using gitlab-ci, with the yaml file that @leoitcode has posted in the first comment of the issue.

@0x2b3bfa0
Copy link
Member

If nvidia-smi works, these lines won't run at all.

@gitdoluquita
Copy link

If nvidia-smi works, these lines won't run at all.

You're right, my bad.

@gitdoluquita
Copy link

We still can't make this work, is there any other thing we can try? Or any other information, log etc that we can provide?

@dacbd
Copy link
Contributor

dacbd commented Dec 22, 2021

Just adding the job log on CI of the deploy_job step: deploy_job.txt

and the train_job step: job_log.txt

I see nvdia-smi bash line: 125 ? There looks to be typo in your job?

@leoitcode
Copy link
Author

Just adding the job log on CI of the deploy_job step: deploy_job.txt
and the train_job step: job_log.txt

I see nvdia-smi bash line: 125 ? There looks to be typo in your job?

O.o'' @dacbd I thank you so much.. I can't believe we couldn't see it..

@dacbd
Copy link
Contributor

dacbd commented Dec 22, 2021

no worries, its safe to say we all do it 🙈

@gitdoluquita
Copy link

OMG I hate typos! 😳

We were biased because in our first try the GPU didn't work, but now we know it was a configuration problem.

Anyway, I'm so sorry for this and thank you all so much for the patience and attention.

We will close this now 🙈🙈🙈

@gitdoluquita
Copy link

@leoitcode My guess is that it is having trouble parsing the output from cat as an argument with the spaces/dashes, and the like...
thinking there might be a bug in handling that option, this is the first time I've seen it used. In the meantime, you can probably use the AWS web console to connect to the instance instead of trying to pass your private key.

@dacbd I managed to make it work by adding EOF to my pem file:

<< EOF
-----BEGIN RSA PRIVATE KEY-----
MY PRIVATE KEY HERE
-----END RSA PRIVATE KEY-----
EOF

To be honest I have no idea of how this works, I just imagined it could be that by looking at what you did here: iterative/terraform-provider-iterative#232 (comment)

Maybe there is a more elegant way of doing this laughing

What about this one @dacbd ? Can I contribute somehow with this? Is this really a bug or is just a encoding problem of myself?

@DavidGOrtega
Copy link
Contributor

What about this one @dacbd ? Can I contribute somehow with this?

That would be amazing! You could create a PR in TPI

@dacbd
Copy link
Contributor

dacbd commented Dec 22, 2021

@gitdoluquita if you are looking for a good thing to try and contribute, I think there is some value here.

I would argue that how the param was used --param=$(cat key.pem) was an intuitive use, which resulted in an error.

I think that this most likely has to do with yargs parsing and that would have to be an upstream change, however, I suspect that --cloud-ssh-private flag is infrequently used.

I would

  • A) change that to take something to the effect of cat key.pem | base64 so that the -----, \n , , etc are stripped out
  • B) change this to take a file path /home/user/cml_runner_key.pem which is read and used.
  • C) add a param to TPI that takes a public key and just adds it to the authorized_keys file for the ubuntu user.1

Footnotes

  1. Something I have thought of adding several times but I always found an alternative means to debug the instance or solved the problem before it bugged me enough to implement. There would be a terraform field like authorized_key = "ecdsa-sha2-nistp384 AAAAE2VjZHNhLXNoYTItbmlzdHAzODQAAAAIbmlzdHAzODQAAABhBDYd3ssa6L15jQC5bckJ2viWlA1tBygxeWoy3s0S14ZHMxUMfp7u2yqficpHO5b+pjgg7Lz+80Ibw157waTZPM+xbF2/KGqS7aYV0L/R8VbWjVEpzxZEeoxSCwFA1tHWUg==" that basically just echo [key] >> ~/.ssh/authorized_keys

@dacbd
Copy link
Contributor

dacbd commented Dec 22, 2021

What about this one @dacbd ? Can I contribute somehow with this?

That would be amazing! You could create a PR in TPI

@gitdoluquita if you want help getting started poke me on discord, happy to help get you going on any thing if you need it. I think our time zones overlap more 🌎

@casperdcl
Copy link
Contributor

casperdcl commented Dec 23, 2021

Opened #852 for the SSH key passing issue :) PRs welcome!

@gitdoluquita
Copy link

What about this one @dacbd ? Can I contribute somehow with this?

That would be amazing! You could create a PR in TPI

@gitdoluquita if you want help getting started poke me on discord, happy to help get you going on any thing if you need it. I think our time zones overlap more earth_americas

It would be awesome @dacbd I'm planning to work on this soon. What is your nickname there? I'm in the DVC discord server, luccasqdrs there.

@dacbd
Copy link
Contributor

dacbd commented Dec 24, 2021

@gitdoluquita dabarnes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci-gitlab cml-runner Subcommand p0-critical Max priority (ASAP)
Projects
None yet
Development

No branches or pull requests

6 participants