Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

是否有查看显存占用相关方法的实现?(如torch.cuda.mem_get_info) #28

Open
RrankPyramid opened this issue Apr 5, 2024 · 3 comments

Comments

@RrankPyramid
Copy link

Accelarate 库中具有npu相关的实现,但其调用中依赖了torch_npu.npu.mem_get_info来获取当前显存占用情况,而这个当前版本的torch_npu不支持。是否有一些替代的函数可以实现类似mem_get_info函数的功能?
相关代码(来自accelerate==0.28.0)

def get_max_memory(max_memory: Optional[Dict[Union[int, str], Union[int, str]]] = None):
    """
    Get the maximum memory available if nothing is passed, converts string to int otherwise.
    """
    import psutil

    if max_memory is None:
        if not (torch.cuda.is_available() or is_npu_available() or is_xpu_available()):
            max_memory = {}

        else:
            # Make sure CUDA is initialized on each GPU to have the right memory info.
            if is_npu_available():
                for i in range(torch.npu.device_count()):
                    _ = torch.tensor(0, device=torch.device("npu", i))
                max_memory = {i: torch.npu.mem_get_info(i)[0] for i in range(torch.npu.device_count())}
            elif is_xpu_available():
                for i in range(torch.xpu.device_count()):
                    _ = torch.tensor(0, device=torch.device("xpu", i))
                max_memory = {i: torch.xpu.max_memory_allocated(i) for i in range(torch.xpu.device_count())}
            else:
                for i in range(torch.cuda.device_count()):
                    _ = torch.tensor([0], device=i)
                max_memory = {i: torch.cuda.mem_get_info(i)[0] for i in range(torch.cuda.device_count())}
        # allocate everything in the mps device as the RAM is shared
        if is_mps_available():
            max_memory["mps"] = psutil.virtual_memory().available
        else:
            max_memory["cpu"] = psutil.virtual_memory().available
        return max_memory

    for key in max_memory:
        if isinstance(max_memory[key], str):
            max_memory[key] = convert_file_size_to_int(max_memory[key])

    # Need to sort the device by type to make sure that we allocate the gpu first.
    # As gpu/npu/xpu are represented by int, we need to sort them first.
    gpu_devices = [k for k in max_memory.keys() if isinstance(k, int)]
    gpu_devices.sort()
    # check if gpu/npu/xpu devices are available and if not, throw a warning
    if is_npu_available():
        num_devices = torch.npu.device_count()
    elif is_xpu_available():
        num_devices = torch.xpu.device_count()
    else:
        num_devices = torch.cuda.device_count()
    for device in gpu_devices:
        if device >= num_devices or device < 0:
            logger.warning(f"Device {device} is not available, available devices are {list(range(num_devices))}")
    # Add the other devices in the preset order if they are available
    all_devices = gpu_devices + [k for k in ["mps", "cpu", "disk"] if k in max_memory.keys()]
    # Raise an error if a device is not recognized
    for k in max_memory.keys():
        if k not in all_devices:
            raise ValueError(
                f"Device {k} is not recognized, available devices are integers(for GPU/XPU), 'mps', 'cpu' and 'disk'"
            )
    max_memory = {k: max_memory[k] for k in all_devices}

    return max_memory
@yunyiyun
Copy link

yunyiyun commented Apr 7, 2024

最新主线已支持torch_npu.npu.mem_get_info

@RrankPyramid
Copy link
Author

@yunyiyun 请问是哪个版本的主线呢?我安装的版本是gitee上的3月11日发布的release v5.0.1.1-pytorch2.1.0 。这个版本里调用确实没有这个函数。报错如下:

>>> import torch
>>> torch.__version__
'2.1.1'
>>> import torch_npu
>>> torch_npu.__version__
'2.1.0.post2'
>>> torch_npu.npu.is_available()
True
>>> torch_npu.npu.mem_get_info()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: module 'torch_npu.npu' has no attribute 'mem_get_info'

@yunyiyun
Copy link

yunyiyun commented Apr 8, 2024

目前发布的版本还没有支持,需要你源码编译v2.1.0-6.0.rc1分支

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants