Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compiler: Unified Memory Allocator #2023

Open
wants to merge 23 commits into
base: master
Choose a base branch
from

Conversation

guaacoelho
Copy link

Hello everyone,

We of SENAI CIMATEC are working in a Unified Memory Allocator to Devito trough CuPy library.

The first results using this new allocator have impressive results when using checkpointing compared to default allocator in GPU.

The impact of performance in our experiment using Overthrust (894x884x299) are close to three times* compared to default allocator in Devito.

With this approach we expect to be able to allocate memory beyond the GPU capacity in the future.

We will open a Draft PR to standardize this allocator with Devito patterns, fix possible bugs and open to community use :)

A version enabling this through External Allocator is in development too and we expect share this soon.

All feedbacks are welcome :)

Thank you all CIMATEC and Devito team, to make this possible 🙂

*All experiments were using Nvidia V100 with 32 GB of memory.

@guaacoelho
Copy link
Author

It was used the version 8.3.0 of Cupy

@speglich
Copy link
Contributor

ToDo:

  • Distributed GPU allocation

@@ -82,7 +82,8 @@ def __del__(self):
# Dask/Distributed context), which may (re)create a Data object
# without going through `__array_finalize__`
return
self._allocator.free(*self._memfree_args)
if self._allocator is not CUPY_ALLOC:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, you should override free instead

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in fact, free is a no-op, so why the need to special-case here?

Copy link
Contributor

@FabioLuporini FabioLuporini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution! I left a few comments.

There also are other issues, in particular lack of tests, update to installation, etc. You may want to take a look at this file: https://github.com/devitocodes/devito/blob/master/CONTRIBUTING.md#contributing-to-devito

@@ -15,7 +16,7 @@

__all__ = ['ALLOC_FLAT', 'ALLOC_NUMA_LOCAL', 'ALLOC_NUMA_ANY',
'ALLOC_KNL_MCDRAM', 'ALLOC_KNL_DRAM', 'ALLOC_GUARD',
'default_allocator']
'CUPY_ALLOC', 'default_allocator']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ALLOC_CUPY for homogeneity

class CupyAllocator(MemoryAllocator):

"""
Memory allocator based on ``posix`` functions. The allocated memory is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copy-paste docstring

aligned to page boundaries.
"""

is_Posix = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leftover.....

mem_obj = cp.zeros(size, dtype=cp.float64)
return mem_obj.data.ptr, mem_obj

def free(self, c_pointer):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CuPy frees the memory right? unless we explicitly tell it at some point?

@@ -82,7 +82,8 @@ def __del__(self):
# Dask/Distributed context), which may (re)create a Data object
# without going through `__array_finalize__`
return
self._allocator.free(*self._memfree_args)
if self._allocator is not CUPY_ALLOC:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in fact, free is a no-op, so why the need to special-case here?

@@ -435,6 +437,9 @@ def _map_function_on_high_bw_mem(self, site, obj, storage, devicerm, read_only=F
"""
mmap = self.lang._map_to(obj)

if obj._allocator is CUPY_ALLOC:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this breaks the abstraction in all sort of ways, unfortunately

You can't access the allocator here to steer compilation

what really matters at this point is the _mem_spaceof the object: https://github.com/devitocodes/devito/blob/master/devito/types/basic.py#L39

you shouldn't actually end up here, because GPU-allocated functions shuold have a local mem-space, which in turns naturally prevents them ever enter this point

@guaacoelho guaacoelho changed the title [draft] Unified Memory Allocator compiler: Unified Memory Allocator Feb 9, 2023
@speglich
Copy link
Contributor

@FabioLuporini, we added tests to this allocator, could we proceed with this PR?

@@ -206,6 +207,26 @@ def test_indexing_into_sparse(self):
sf.data[1:-1, 0] = np.arange(8)
assert np.all(sf.data[1:-1, 0] == np.arange(8))

def test_uma_allocation(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a test_external_allocator somewhere in this file. Could you move that test, and this test, within a new class, say TestAllocators?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the PR with this change

@FabioLuporini
Copy link
Contributor

@FabioLuporini, we added tests to this allocator, could we proceed with this PR?

The issue is that CuPy is an optional dependency (shipped by the NVidia SDK I'm guessing?), so if I try to run CI here, it will undoubtedly break.

So, we need conditional imports, a mechanism like this to emit suitable error messages in case one attempts to use the allocator but the allocator isn't available, etc etc

Copy link
Contributor

@FabioLuporini FabioLuporini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on what I said above, a few tweaks are still necessary here

@mloubout
Copy link
Contributor

mloubout commented Mar 8, 2023

Should this be added to the GPU CI and actually tested on the nvidia run?

@guaacoelho
Copy link
Author

@FabioLuporini, we added tests to this allocator, could we proceed with this PR?

The issue is that CuPy is an optional dependency (shipped by the NVidia SDK I'm guessing?), so if I try to run CI here, it will undoubtedly break.

So, we need conditional imports, a mechanism like this to emit suitable error messages in case one attempts to use the allocator but the allocator isn't available, etc etc

We've updated the allocators.py file to use conditional import for cupy. I think this solves this problem.

# try:
# from mpi4py import MPI # noqa
# except ImportError:
# MPI = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leftover

cls.lib = cp
cls._initialize_shared_memory()
try:
from mpi4py import MPI
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why reimport and not import it from devito.mpi?

from mpi4py import MPI
cls.MPI = MPI
cls._set_device_for_mpi()
except:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

except ImportError other errors should be caught

def _alloc_C_libcall(self, size, ctype):
if not self.available():
raise ImportError("Couldn't initialize cupy or MPI elements of alocation")
mem_obj = self.lib.zeros(size, dtype=self.lib.float64)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

always float64? I thoiught size was for UInt8

ctype_1d = ctype * size
buf = ctypes.cast(c_pointer, ctypes.POINTER(ctype_1d)).contents
pointer = np.frombuffer(buf, dtype=dtype)
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can we end up here? the c_pointer is None case is already above

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During the execution in MPI, domain splitting can generate a situation where the allocated data size is zero, as we have observed with Sparse Functions. When this occurs, Cupy returns a pointer with a value of zero. This conditional statement was defined for this case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment noting this, until some better solution is around?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will push it, George.

@@ -1,8 +1,10 @@
import pytest
import numpy as np
import cupy as cp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to be added somehow to the test requirements and this step should be decoratred with a skipif(device)

@mloubout
Copy link
Contributor

Can you check if #2171 fixes the install issue

@speglich
Copy link
Contributor

@mloubout, definitely, we will test it. BTW, this docker image worked too, thanks to you.

FROM nvcr.io/nvidia/nvhpc:23.5-devel-cuda12.1-ubuntu22.04 AS devel

RUN apt-get update -y && \
    DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
        git \
        make \
        wget && \
    rm -rf /var/lib/apt/lists/*

RUN apt-get update -y && \
    DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
        python3 \
        python3-dev \
        python3-pip \
        python3-setuptools \
        python3-wheel \
        python3.10-venv && \
    rm -rf /var/lib/apt/lists/*

RUN pip3 --no-cache-dir install cupy-cuda12x

@mloubout
Copy link
Contributor

mloubout commented Aug 2, 2023

Ok the nvidia setup has been updated so please:

  • rebase and answer comments
  • add the test to the nvidia test suite
    For the second point you may have to modify the pytest-gpu.yml workflow file to add that extra test to it

def initialize(cls):

try:
import cupy as cp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move try import for cupy at the top

try:
  import cupy as cp
except importerror:
  cp = None

and just check if None here.

MPI should just be imported at the top from devito.mpi

@@ -1473,6 +1475,55 @@ def test_gather_time_function(self):
assert ans == np.array(None)


class TestAllocators(object):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to test_gpu_common as TestCupyAllocator with an nividia device skip so it's added to GPU CI

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants