Can't use AWS Instance GPU on GITLAB CI and CML-RUNNER #848

leoitcode · 2021-12-17T12:18:04Z

I have this gitlab-ci.yml:

stages:
  - test
  - deploy
  - train

sast:
  stage: test
include:
- template: Security/SAST.gitlab-ci.yml

deploy_job:
  stage: deploy
  when: always
  image: iterativeai/cml:0-dvc2-base1
  script:
    - cml-runner
      --cloud aws
      --cloud-region us-east-1
      --cloud-type g3.4xlarge
      --cloud-hdd-size 64
      --cloud-aws-security-group="cml-runners-sg"
      --labels=cml-runner-gpu
      --idle-timeout=120
train_job:
  stage: train
  when: on_success
  image: iterativeai/cml:0-dvc2-base1-gpu
  tags:
    - cml-runner-gpu
  before_script:
    - pip install poetry
    - poetry --version
    - poetry config virtualenvs.create false
    - poetry install -vv
    - nvdia-smi
  script:
    # DVC Stuff
    - dvc pull
    - dvc repro -m
    - dvc push
    # Report metrics
    - echo "## Metrics" >> report.md
    - echo "\`\`\`json" >> report.md
    - cat metrics/best-meta.json >> report.md
    - echo "\`\`\`" >> report.md
    # Report GPU details
    - echo "## GPU info" >> report.md
    - cat gpu_info.txt >> report.md
    # Send comment
    - cml-send-comment report.md

But, the container can't recognize driver or GPU, on nvidia-smi command I had the following error:

/usr/bin/bash: line 133: nvdia-smi: command not found

I realized that iterativeai/cml:0-dvc2-base1-gpu can't use instance GPU. How could I install nvidia drivers and the nvidia-docker and activate
--gpus option on this docker?

Thank you

The text was updated successfully, but these errors were encountered:

0x2b3bfa0 · 2021-12-17T12:37:49Z

👋 Hello, @leoitcode! Can you please connect to the instance through SSH and retrieve some additional information?

$ npm --version
$ cat /var/log/*cloud*
$ nvidia-smi

leoitcode · 2021-12-17T14:24:45Z

Hello

wave Hello, @leoitcode! Can you please connect to the instance through SSH and retrieve some additional information?
$ npm --version
$ cat /var/log/*cloud*
$ nvidia-smi

I tried to follow the tutorial but I'm having this error:

Unknown argument: --cloud-ssh-private=MY_RSA_KEY

I'm using this command:
cml-runner --cloud aws --cloud-region us-east-1 --cloud-type g3.4xlarge --cloud-hdd-size 64 --cloud-aws-security-group="cml-runners-sg" --labels=cml-runner-gpu --idle-timeout=-1 --cloud-ssh-private="$(cat cml.pem)"

leoitcode · 2021-12-17T14:56:20Z

Just adding the job log on CI of the deploy_job step:
deploy_job.txt

and the train_job step:
job_log.txt

dacbd · 2021-12-17T16:58:39Z

@leoitcode My guess is that it is having trouble parsing the output from cat as an argument with the spaces/dashes, and the like...

🤔 there might be a bug in handling that option, this is the first time I've seen it used. In the meantime, you can probably use the AWS web console to connect to the instance instead of trying to pass your private key.

leoitcode · 2021-12-17T17:54:57Z

@leoitcode My guess is that it is having trouble parsing the output from cat as an argument with the spaces/dashes, and the like...

thinking there might be a bug in handling that option, this is the first time I've seen it used. In the meantime, you can probably use the AWS web console to connect to the instance instead of trying to pass your private key.

I tried access by AWS Console, but I got the following error:

Some usernames that I tried:
ubuntu
ec2-user
root

gitdoluquita · 2021-12-17T18:29:17Z

@leoitcode My guess is that it is having trouble parsing the output from cat as an argument with the spaces/dashes, and the like...

thinking there might be a bug in handling that option, this is the first time I've seen it used. In the meantime, you can probably use the AWS web console to connect to the instance instead of trying to pass your private key.

@dacbd I managed to make it work by adding EOF to my pem file:

<< EOF
-----BEGIN RSA PRIVATE KEY-----
MY PRIVATE KEY HERE
-----END RSA PRIVATE KEY-----
EOF

To be honest I have no idea of how this works, I just imagined it could be that by looking at what you did here:
iterative/terraform-provider-iterative#232 (comment)

Maybe there is a more elegant way of doing this 😆

gitdoluquita · 2021-12-17T19:14:23Z

Me and @leoitcode are working together at this.

I connected to the deployed instance and managed to execute the nvidia-smi command:

ubuntu@ip-10-0-0-224:~$ nvidia-smi
Fri Dec 17 18:41:49 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.63.01    Driver Version: 470.63.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla M60           Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   43C    P0    37W / 150W |      0MiB /  7618MiB |     92%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

But no cuda-smi command

ubuntu@ip-10-0-0-224:~$ cuda-smi
cuda-smi: command not found

I thought this might be a problem by looking at this line, but I'm not sure:
https://github.com/iterative/cml/blob/master/src/drivers/gitlab.js#L178

Testing container

Inside of the instance, if I do this:
docker run -it iterativeai/cml:0-dvc2-base1-gpu bash
I can't run nvidia-smi, as expected.

But if I do this:
docker run -it --gpus all iterativeai/cml:0-dvc2-base1-gpu bash
It works.

I couldn't find where something like this (--gpus all) is setted.

Maybe there is a config file for the gitlab-runner that should be changed for something like this: https://docs.gitlab.com/runner/configuration/gpus.html

I also notice that at Github Actions there is this option, like in this snippet:

  run:
    needs: deploy-runner
    runs-on: [self-hosted,cml-runner]
    container: 
      image: docker://iterativeai/cml:0-dvc2-base1-gpu
      options: --gpus all
 
    steps:
    - uses: actions/checkout@v2

I'm attaching the result of cat /var/log/*cloud* too, as @0x2b3bfa0 has asked:
result.log
I just edited one line for removing the GL_TOKEN that is printed, for security reasons.

Let us know if we can provide any other information.

Thanks.

dacbd · 2021-12-17T20:57:37Z

Wish I could help but I think we need to wait for the world to rotate back to @0x2b3bfa0 side 🌍

I see some npm install errors, you could try and add a startup script with sudo apt-get update && sudo apt-get upgrade -y && sudo apt-get install build-essential ? but I wouldn't hold my breath for that to make it work.

are your using gitlab-ci or github actions?

DavidGOrtega · 2021-12-18T13:38:28Z

I connected to the deployed instance and managed to execute the nvidia-smi command:

Having seen that nvidia-smi works cml should have setup the runner with the nvidia executor automatically

cml/src/drivers/gitlab.js

Line 204 in e338266

--docker-runtime "${gpu ? 'nvidia' : ''}" \

gitdoluquita · 2021-12-18T18:22:19Z

Having seen that nvidia-smi works cml should have setup the runner with the nvidia executor automatically

@DavidGOrtega this lines wouldn't make gpu false if cuda-smi is not working?

cml/src/drivers/gitlab.js

Lines 177 to 181 in e338266

    
           try { 
        
             await exec('cuda-smi'); 
        
           } catch (err) { 
        
             gpu = false; 
        
           }

I think this might be the problem.

If not that what else could I be missing?

are your using gitlab-ci or github actions?

@dacbd I'm using gitlab-ci, with the yaml file that @leoitcode has posted in the first comment of the issue.

0x2b3bfa0 · 2021-12-18T19:24:49Z

If nvidia-smi works, these lines won't run at all.

gitdoluquita · 2021-12-18T20:52:17Z

If nvidia-smi works, these lines won't run at all.

You're right, my bad.

gitdoluquita · 2021-12-22T14:34:39Z

We still can't make this work, is there any other thing we can try? Or any other information, log etc that we can provide?

dacbd · 2021-12-22T15:51:03Z

Just adding the job log on CI of the deploy_job step: deploy_job.txt

and the train_job step: job_log.txt

I see nvdia-smi bash line: 125 ? There looks to be typo in your job?

leoitcode · 2021-12-22T16:54:32Z

Just adding the job log on CI of the deploy_job step: deploy_job.txt
and the train_job step: job_log.txt

I see nvdia-smi bash line: 125 ? There looks to be typo in your job?

O.o'' @dacbd I thank you so much.. I can't believe we couldn't see it..

dacbd · 2021-12-22T16:58:55Z

no worries, its safe to say we all do it 🙈

gitdoluquita · 2021-12-22T17:28:37Z

OMG I hate typos! 😳

We were biased because in our first try the GPU didn't work, but now we know it was a configuration problem.

Anyway, I'm so sorry for this and thank you all so much for the patience and attention.

We will close this now 🙈🙈🙈

gitdoluquita · 2021-12-22T17:30:13Z

@leoitcode My guess is that it is having trouble parsing the output from cat as an argument with the spaces/dashes, and the like...
thinking there might be a bug in handling that option, this is the first time I've seen it used. In the meantime, you can probably use the AWS web console to connect to the instance instead of trying to pass your private key.

@dacbd I managed to make it work by adding EOF to my pem file:
<< EOF
-----BEGIN RSA PRIVATE KEY-----
MY PRIVATE KEY HERE
-----END RSA PRIVATE KEY-----
EOF
To be honest I have no idea of how this works, I just imagined it could be that by looking at what you did here: iterative/terraform-provider-iterative#232 (comment)

Maybe there is a more elegant way of doing this laughing

What about this one @dacbd ? Can I contribute somehow with this? Is this really a bug or is just a encoding problem of myself?

DavidGOrtega · 2021-12-22T18:04:32Z

What about this one @dacbd ? Can I contribute somehow with this?

That would be amazing! You could create a PR in TPI

dacbd · 2021-12-22T18:07:32Z

@gitdoluquita if you are looking for a good thing to try and contribute, I think there is some value here.

I would argue that how the param was used --param=$(cat key.pem) was an intuitive use, which resulted in an error.

I think that this most likely has to do with yargs parsing and that would have to be an upstream change, however, I suspect that --cloud-ssh-private flag is infrequently used.

I would

A) change that to take something to the effect of cat key.pem | base64 so that the -----, \n , , etc are stripped out
B) change this to take a file path /home/user/cml_runner_key.pem which is read and used.
C) add a param to TPI that takes a public key and just adds it to the authorized_keys file for the ubuntu user.¹

Something I have thought of adding several times but I always found an alternative means to debug the instance or solved the problem before it bugged me enough to implement. There would be a terraform field like authorized_key = "ecdsa-sha2-nistp384 AAAAE2VjZHNhLXNoYTItbmlzdHAzODQAAAAIbmlzdHAzODQAAABhBDYd3ssa6L15jQC5bckJ2viWlA1tBygxeWoy3s0S14ZHMxUMfp7u2yqficpHO5b+pjgg7Lz+80Ibw157waTZPM+xbF2/KGqS7aYV0L/R8VbWjVEpzxZEeoxSCwFA1tHWUg==" that basically just echo [key] >> ~/.ssh/authorized_keys ↩

dacbd · 2021-12-22T18:12:43Z

What about this one @dacbd ? Can I contribute somehow with this?

That would be amazing! You could create a PR in TPI

@gitdoluquita if you want help getting started poke me on discord, happy to help get you going on any thing if you need it. I think our time zones overlap more 🌎

casperdcl · 2021-12-23T07:44:29Z

Opened #852 for the SSH key passing issue :) PRs welcome!

gitdoluquita · 2021-12-24T14:57:26Z

What about this one @dacbd ? Can I contribute somehow with this?

That would be amazing! You could create a PR in TPI

@gitdoluquita if you want help getting started poke me on discord, happy to help get you going on any thing if you need it. I think our time zones overlap more earth_americas

It would be awesome @dacbd I'm planning to work on this soon. What is your nickname there? I'm in the DVC discord server, luccasqdrs there.

dacbd · 2021-12-24T15:26:21Z

@gitdoluquita dabarnes

leoitcode changed the title ~~Can't use AWS Istance GPU on GITLAB CI and CML-RUNNER~~ Can't use AWS Instance GPU on GITLAB CI and CML-RUNNER Dec 17, 2021

DavidGOrtega added ci-gitlab cml-runner Subcommand p0-critical Max priority (ASAP) labels Dec 18, 2021

casperdcl mentioned this issue Dec 23, 2021

difficult to set --cloud-ssh-private #852

Closed

casperdcl closed this as completed Dec 23, 2021

snyk-bot mentioned this issue Dec 5, 2022

[Snyk] Security upgrade simple-git from 3.11.0 to 3.15.0 #1274

Merged

terrorizer1980 mentioned this issue Dec 6, 2022

[Snyk] Security upgrade simple-git from 2.44.0 to 3.15.0 terrorizer1980/cml#15

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't use AWS Instance GPU on GITLAB CI and CML-RUNNER #848

Can't use AWS Instance GPU on GITLAB CI and CML-RUNNER #848

leoitcode commented Dec 17, 2021 •

edited

Loading

0x2b3bfa0 commented Dec 17, 2021

leoitcode commented Dec 17, 2021

leoitcode commented Dec 17, 2021

dacbd commented Dec 17, 2021

leoitcode commented Dec 17, 2021

gitdoluquita commented Dec 17, 2021 •

edited

Loading

gitdoluquita commented Dec 17, 2021

dacbd commented Dec 17, 2021

DavidGOrtega commented Dec 18, 2021 •

edited

Loading

gitdoluquita commented Dec 18, 2021

0x2b3bfa0 commented Dec 18, 2021

gitdoluquita commented Dec 18, 2021

gitdoluquita commented Dec 22, 2021

dacbd commented Dec 22, 2021

leoitcode commented Dec 22, 2021

dacbd commented Dec 22, 2021

gitdoluquita commented Dec 22, 2021

gitdoluquita commented Dec 22, 2021

DavidGOrtega commented Dec 22, 2021

dacbd commented Dec 22, 2021

dacbd commented Dec 22, 2021

casperdcl commented Dec 23, 2021 •

edited

Loading

gitdoluquita commented Dec 24, 2021

dacbd commented Dec 24, 2021

Can't use AWS Instance GPU on GITLAB CI and CML-RUNNER #848

Can't use AWS Instance GPU on GITLAB CI and CML-RUNNER #848

Comments

leoitcode commented Dec 17, 2021 • edited Loading

0x2b3bfa0 commented Dec 17, 2021

leoitcode commented Dec 17, 2021

leoitcode commented Dec 17, 2021

dacbd commented Dec 17, 2021

leoitcode commented Dec 17, 2021

gitdoluquita commented Dec 17, 2021 • edited Loading

gitdoluquita commented Dec 17, 2021

Testing container

dacbd commented Dec 17, 2021

DavidGOrtega commented Dec 18, 2021 • edited Loading

gitdoluquita commented Dec 18, 2021

0x2b3bfa0 commented Dec 18, 2021

gitdoluquita commented Dec 18, 2021

gitdoluquita commented Dec 22, 2021

dacbd commented Dec 22, 2021

leoitcode commented Dec 22, 2021

dacbd commented Dec 22, 2021

gitdoluquita commented Dec 22, 2021

gitdoluquita commented Dec 22, 2021

DavidGOrtega commented Dec 22, 2021

dacbd commented Dec 22, 2021

Footnotes

dacbd commented Dec 22, 2021

casperdcl commented Dec 23, 2021 • edited Loading

gitdoluquita commented Dec 24, 2021

dacbd commented Dec 24, 2021

leoitcode commented Dec 17, 2021 •

edited

Loading

gitdoluquita commented Dec 17, 2021 •

edited

Loading

DavidGOrtega commented Dec 18, 2021 •

edited

Loading

casperdcl commented Dec 23, 2021 •

edited

Loading