Attempt to query device count via NVML #14631

awaelchli · 2022-09-09T21:06:25Z

What does this PR do?

Redo of #14319
Note: This is an enhancement/improvement of the preivous bugfix, not addressing any new bugs.

User does not need to set an environment variable
Faster: no new processes need to be launched in order to evaluate the functions

This adopts the solution in pytorch/pytorch#84879 for Lightning when running with PyTorch < 1.13.
A minor follow-up is still pending: pytorch/pytorch#85024

Validated with nightly pytorch 1.13 locally, using this simple script:

import torch
import torch.multiprocessing as mp
from lightning_lite.utilities.imports import _TORCH_GREATER_EQUAL_1_13
from lightning_lite.utilities.device_parser import num_cuda_devices, is_cuda_available


def worker(rank):
    print("successfully forked", rank)
    torch.cuda.set_device(rank)


def run():
    print("torch version", torch.__version__)
    print("greater than 1.13?", _TORCH_GREATER_EQUAL_1_13)

    # old function
    torch.cuda.device_count()
    # torch.cuda.is_available()

    # new function
    # print("num_cuda_devices:", num_cuda_devices())
    # print("available:", is_cuda_available())

    mp.start_processes(worker, nprocs=2, start_method="fork")


if __name__ == "__main__":
    run()

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

I made sure I had fun coding 🙃

cc @Borda @akihironitta

src/lightning_lite/utilities/device_parser.py

for more information, see https://pre-commit.ci

carmocca

Great that this was fixed!

for more information, see https://pre-commit.ci

codecov · 2022-09-21T23:29:06Z

Codecov Report

Merging #14631 (3ae5d4e) into master (31788db) will increase coverage by 1%.
The diff coverage is 25%.

❗ Current head 3ae5d4e differs from pull request most recent head ed6628a. Consider uploading reports for the commit ed6628a to get more accurate results

Additional details and impacted files

@@            Coverage Diff            @@
##           master   #14631     +/-   ##
=========================================
+ Coverage      84%      85%     +1%     
=========================================
  Files         395      276    -119     
  Lines       28894    21238   -7656     
=========================================
- Hits        24236    18102   -6134     
+ Misses       4658     3136   -1522

Co-authored-by: Carlos Mocholí <[email protected]>

awaelchli added 2 commits September 9, 2022 23:03

Attempt to query device count via NVML

f553a82

remove fork disablers

ba86075

awaelchli added feature Is an improvement or enhancement accelerator: cuda Compute Unified Device Architecture GPU labels Sep 9, 2022

github-actions bot added the pl Generic label for PyTorch Lightning package label Sep 9, 2022

remove unused imports

64d0372

carmocca reviewed Sep 10, 2022

View reviewed changes

src/lightning_lite/utilities/device_parser.py Outdated Show resolved Hide resolved

awaelchli and others added 3 commits September 14, 2022 19:33

add upstream implementation

e3cf708

Merge branch 'master' into feature/device-count

d59e6ab

[pre-commit.ci] auto fixes from pre-commit.com hooks

6396bcf

for more information, see https://pre-commit.ci

awaelchli added this to the pl:1.7.x milestone Sep 14, 2022

awaelchli added bug Something isn't working and removed feature Is an improvement or enhancement labels Sep 14, 2022

awaelchli self-assigned this Sep 14, 2022

awaelchli marked this pull request as ready for review September 14, 2022 21:14

awaelchli requested review from tchaton, Borda, justusschock, kaushikb11, williamFalcon and rohitgr7 as code owners September 14, 2022 21:14

carmocca approved these changes Sep 14, 2022

View reviewed changes

awaelchli mentioned this pull request Sep 14, 2022

Better error message when trying to re-initialize CUDA in forked subprocess #14709

Merged

11 tasks

Merge branch 'master' into feature/device-count

688b367

otaj approved these changes Sep 16, 2022

View reviewed changes

mergify bot added the ready PRs ready to be merged label Sep 16, 2022

awaelchli and others added 3 commits September 18, 2022 22:51

Merge branch 'master' into feature/device-count

d447151

update test

bcf45e8

[pre-commit.ci] auto fixes from pre-commit.com hooks

9a882be

for more information, see https://pre-commit.ci

mergify bot added the has conflicts label Sep 18, 2022

resolve merge conflicts

7c92ae7

mergify bot added ready PRs ready to be merged and removed has conflicts ready PRs ready to be merged labels Sep 19, 2022

awaelchli added 2 commits September 19, 2022 18:28

merge conflicts

c17a286

rename

9ec7ac6

rohitgr7 approved these changes Sep 19, 2022

View reviewed changes

changelog

9e7ebc8

mergify bot added has conflicts and removed ready PRs ready to be merged labels Sep 20, 2022

Merge branch 'master' into feature/device-count

3ae5d4e

mergify bot added ready PRs ready to be merged and removed has conflicts ready PRs ready to be merged labels Sep 21, 2022

justusschock approved these changes Sep 22, 2022

View reviewed changes

Merge branch 'master' into feature/device-count

ed6628a

awaelchli enabled auto-merge (squash) September 22, 2022 09:16

awaelchli disabled auto-merge September 22, 2022 09:19

remove it also from lite

71ae9ab

awaelchli enabled auto-merge (squash) September 22, 2022 09:23

awaelchli merged commit dc1dc0d into master Sep 22, 2022

awaelchli deleted the feature/device-count branch September 22, 2022 09:57

awaelchli mentioned this pull request Sep 23, 2022

CUDAAccelerator.num_cuda_devices() returns 0 while torch.cuda.device_count() returns 1 #14858

Closed

5 tasks

rohitgr7 pushed a commit that referenced this pull request Sep 24, 2022

Attempt to query device count via NVML (#14631)

bf2d87f

Co-authored-by: Carlos Mocholí <[email protected]>

awaelchli mentioned this pull request Oct 5, 2022

Re-evaluate CUDA NVML and fork checks after torch 1.13 release #15010

Closed

awaelchli mentioned this pull request Oct 13, 2022

[Core][Tune] Ray tune cannot be used with pytorch-lightning 1.7.0 due to processes spawned with fork. ray-project/ray#27493

Closed

awaelchli mentioned this pull request Feb 6, 2023

"MisconfigurationException: No supported gpu backend found!" with multi gpu training in jupyter notebooks #15254

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attempt to query device count via NVML #14631

Attempt to query device count via NVML #14631

awaelchli commented Sep 9, 2022 •

edited by github-actions bot

Loading

carmocca left a comment

codecov bot commented Sep 21, 2022 •

edited

Loading

Attempt to query device count via NVML #14631

Attempt to query device count via NVML #14631

Conversation

awaelchli commented Sep 9, 2022 • edited by github-actions bot Loading

What does this PR do?

Before submitting

PR review

Did you have fun?

carmocca left a comment

Choose a reason for hiding this comment

codecov bot commented Sep 21, 2022 • edited Loading

Codecov Report

awaelchli commented Sep 9, 2022 •

edited by github-actions bot

Loading

codecov bot commented Sep 21, 2022 •

edited

Loading