Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Google Cloud Deployment Fails when Installing the GRID Drivers #21

Open
renanmb opened this issue Nov 28, 2024 · 2 comments
Open

Google Cloud Deployment Fails when Installing the GRID Drivers #21

renanmb opened this issue Nov 28, 2024 · 2 comments

Comments

@renanmb
Copy link

renanmb commented Nov 28, 2024

With the update to IsaacSim 4.2 and new IsaacLab release I tried to use IsaacAutomator to deploy instances on GPC. My previous deployments with Isaac 4.1 have worked and my AWS deployment with Isaac 4.2 works, so I have reason to believe that Google changed something on their hypervisor because I have issues trying to install Nvidia Drivers, so this problem extends beyond IsaacAutomator. The Ansible roles must be reviewed for GCP as not only it no longer install the drivers reliably, there are some infinite loops not accounted for.

Here is the command I used for to deploy:

./deploy-gcp --ngc-api-key NGC-KEY --project PROJECT-ID --deployment-name test-gcp-issacsim --isaac --isaac-gpu-count 1 --isaac-instance-type g2-standard-32 --isaac-image nvcr.io/nvidia/isaac-sim:4.2.0 --vnc-password 123456 --zone us-west1-a --oige no --isaaclab v1.3.0

Error message:

TASK [nvidia : GCP / Install GRID driver] *********************************************************************************************fatal: [34.168.67.158]: FAILED! => {"changed": true, "cmd": "/tmp/nvidia_driver.run --x-module-path=/usr/lib/xorg/modules/drivers --run-nvidia-xconfig  --disable-nouveau --no-questions --silent", "delta": "0:00:13.559158", "end": "2024-11-28 21:26:31.862749", "msg": "non-zero return code", "rc": 1, "start": "2024-11-28 21:26:18.303591", "stderr": "\nERROR: An error occurred while performing the step: \"Building kernel modules\". See /var/log/nvidia-installer.log for details.\n\n\nERROR: An error occurred while performing the step: \"Checking to see whether the nvidia kernel module was successfully built\". See /var/log/nvidia-installer.log for details.\n\n\nERROR: The nvidia kernel module was not created.\n\n\nERROR: Installation has failed.  Please see the file '/var/log/nvidia-installer.log' for details.  You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.", "stderr_lines": ["", "ERROR: An error occurred while performing the step: \"Building kernel modules\". See /var/log/nvidia-installer.log for details.", "", "", "ERROR: An error occurred while performing the step: \"Checking to see whether the nvidia kernel module was successfully built\". See /var/log/nvidia-installer.log for details.", "", "", "ERROR: The nvidia kernel module was not created.", "", "", "ERROR: Installation has failed.  Please see the file '/var/log/nvidia-installer.log' for details.  You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com."], "stdout": "Verifying 
archive integrity... OK\nUncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 535.129.03........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................", "stdout_lines": ["Verifying archive integrity... OK", "Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 535.129.03........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................"]}

PLAY RECAP ****************************************************************************************************************************34.168.67.158              : ok=34   changed=25   unreachable=0    failed=1    skipped=23   rescued=0    ignored=0

This issue is mostly related to Ansible trying to install the Nvidia Drivers on the GCP in a way that is no longer supported:

Inside: src/ansible/roles/nvidia/tasks we find the file nvidia-driver.gcp.yml

The TASK: name: GCP / Install GRID driver

is running the following command

./nvidia_driver.run --x-module-path=/usr/lib/xorg/modules/drivers --run-nvidia-xconfig --disable-nouveau --no-questions --silent

When logging into the instance and running the command described above I obtain the following error:

ubuntu@isaac-test-gcp-issacsim:/tmp$ sudo ./nvidia_driver.run --x-module-path=/usr/lib/xorg/modules/drivers --ru
n-nvidia-xconfig --disable-nouveau --no-questions --silent
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 535.129.03........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

WARNING: One or more modprobe configuration files to disable Nouveau are already present at:
         /usr/lib/modprobe.d/nvidia-installer-disable-nouveau.conf,
         /etc/modprobe.d/nvidia-installer-disable-nouveau.conf.  Please be sure you have rebooted your system
         since these files were written.  If you have rebooted, then Nouveau may be enabled for other reasons,
         such as being included in the system initial ramdisk or in your X configuration file.  Please consult
         the NVIDIA driver README and your Linux distribution's documentation for details on how to correctly
         disable the Nouveau kernel driver.


ERROR: An error occurred while performing the step: "Building kernel modules". See
       /var/log/nvidia-installer.log for details.


ERROR: An error occurred while performing the step: "Checking to see whether the nvidia kernel module was
       successfully built". See /var/log/nvidia-installer.log for details.


ERROR: The nvidia kernel module was not created.


ERROR: Installation has failed.  Please see the file '/var/log/nvidia-installer.log' for details.  You may find
       suggestions on fixing installation problems in the README available on the Linux driver download page at
       www.nvidia.com.

When following the installation instructions on GCP docs to fix the Ansible role: https://cloud.google.com/compute/docs/gpus/install-grid-drivers#debianubuntu

It install the drivers but the It is still unable to complete the setup, other issues arise one of them being an infinite loop in the autorun.yml

xset is unable to open the display and it keeps waiting for it.

@renanmb
Copy link
Author

renanmb commented Nov 29, 2024

I dont think I can do a PR so I will post the solution:

In the Isaac Sim docs the recommended way is to run the script found here:

https://github.com/GoogleCloudPlatform/compute-gpu-installation/blob/main/linux/install_gpu_driver.py

https://docs.omniverse.nvidia.com/isaacsim/latest/installation/install_advanced_cloud_setup_gcp.html

So to fix the Ansible it is required just some minor changes:

the config.py line 35 needs to include the following

# gcp driver
# @see https://cloud.google.com/compute/docs/gpus/grid-drivers-table
c["gcp_driver_url"] = (
    # "https://storage.googleapis.com/nvidia-drivers-us-public/GRID/vGPU16.2/NVIDIA-Linux-x86_64-535.129.03-grid.run"
    "https://raw.githubusercontent.com/GoogleCloudPlatform/compute-gpu-installation/main/linux/install_gpu_driver.py"
)

At the nvidia-driver.gcp.yml from Line 35 below I commented the code that is no longer necessary, GRID drives are no longer necessary and dont work at all. So we should just run the install script as recommended in Isaac Sim docs and on the GCP docs.

# #######################################################
# This method doesnt work anymore on the Google Cloud
# download driver
# - name: GCP / Download GRID driver
#   get_url:
#     url: "{{ gcp_driver_url }}"
#     dest: /tmp/nvidia_driver.run
#     mode: 0755

# - name: GCP / Install GRID driver
#   shell: "/tmp/nvidia_driver.run \
#     --x-module-path=/usr/lib/xorg/modules/drivers \
#     --run-nvidia-xconfig  \
#     --disable-nouveau \
#     --no-questions \
#     --silent"

# #######################################################

- name: GCP / Download script to install GPU driver
  get_url:
    url: "{{ gcp_driver_url }}"
    dest: /tmp/install_gpu_driver.py
    mode: 0755

- name: GCP / Install gpu driver
  shell: "python3 /tmp/install_gpu_driver.py"

# #######################################################
- name: GCP / Enable persistent mode for the driver
  shell: nvidia-smi -pm ENABLED
# #######################################################
# The following commands need to be double checked
# - name: GCP / Copy gridd.conf
#   copy: >
#     src=/etc/nvidia/gridd.conf.template
#     dest=/etc/nvidia/gridd.conf
#     remote_src=true
#     force=no

# - name: GCP / Update GRID config [1]
#   lineinfile:
#     path: /etc/nvidia/gridd.conf
#     line: "{{ item }}"
#     state: present
#   with_items:
#     - "IgnoreSP=FALSE"
#     - "EnableUI=TRUE"

# - name: GCP / Update GRID config [2]
#   lineinfile:
#     path: /etc/nvidia/gridd.conf
#     regexp: "^FeatureType=(.*)$"
#     line: '# FeatureType=\1'
#     state: present
#     backrefs: yes

Additionally I would like to comment that it is necessary to add some logic to double check and allow for multiple driver versions and the Terraform code might need improvements

@myurasov-nv
Copy link
Collaborator

@renanmb Thank you so much for reporting and solving this! I will include it asap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants