Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compiler: Augment code generation capabilities for CUDA/HIP/SYCL support #1828

Merged
merged 60 commits into from
Feb 14, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
60 commits
Select commit Hold shift + click to select a range
88ba836
compiler: Patch place_definitions
FabioLuporini Jan 17, 2022
12f1b97
compiler: Refactor linearize
FabioLuporini Jan 17, 2022
b11e06d
compiler: Enable allocs/frees with IET nodes too
FabioLuporini Jan 17, 2022
72c2977
compiler: Simplify place_casts
FabioLuporini Jan 17, 2022
f558777
compiler: Patch CGen visitor
FabioLuporini Jan 17, 2022
7480305
compiler: Add IMPLICIT Property for implicit iteration spaces
FabioLuporini Jan 18, 2022
1310826
compiler: Add Global symbols
FabioLuporini Jan 18, 2022
ca7207c
compiler: Refactor FSG Operators
FabioLuporini Jan 19, 2022
dcfb96a
compiler: Add forgotten is_Conditional flag
FabioLuporini Jan 19, 2022
4cfc9c2
compiler: Patch efunc visiting for subclasses
FabioLuporini Jan 19, 2022
9ba7c08
compiler: Enable sparse Cluster fusion through finer grain conditionals
FabioLuporini Jan 20, 2022
0a599e5
compiler: Relocate and simplify blocking preprocessing
FabioLuporini Jan 20, 2022
860d562
compiler: Add blockrelax option to tile all sort of loops
FabioLuporini Jan 20, 2022
8656da0
compiler: Introduce and use Uxreplace
FabioLuporini Jan 21, 2022
66f34b8
compiler: Relax uxreplace
FabioLuporini Jan 21, 2022
b821ac8
compiler: Patch derive_transfers to traverse any IET node
FabioLuporini Jan 21, 2022
f3844f1
compiler: Turn memfree definitions into actual IET objects
FabioLuporini Jan 24, 2022
df06e7b
compiler: Drop unused IET flags
FabioLuporini Jan 24, 2022
cee2c06
compiler: Add VOID sympy extension
FabioLuporini Jan 24, 2022
0b71b4f
compiler: Turn alloc definitions into actual IET objects
FabioLuporini Jan 24, 2022
2329077
compiler: Rework and generalize place_casts
FabioLuporini Jan 24, 2022
84b8f45
compiler: Avoid gen of unused symbols when linearizing
FabioLuporini Jan 26, 2022
2ed2cb4
arch: Bump CudaCompiler std to c++14 for use with thrust/cub
FabioLuporini Jan 26, 2022
0bd2aeb
compiler: Support LocalObject C-level init via custom constructor calls
FabioLuporini Jan 26, 2022
e032f48
compiler: Rework compilation of LocalObject
FabioLuporini Jan 26, 2022
289cc9c
compiler: Patch generation of nested scalars
FabioLuporini Jan 27, 2022
d7ae26c
compiler: Simplify place_definitions
FabioLuporini Jan 27, 2022
d4bddad
compiler: Simplify Expression.is_initializable
FabioLuporini Jan 27, 2022
4bd954f
compiler: Rework place_definitions, for good
FabioLuporini Jan 27, 2022
0fd651a
compiler: Rework defines processing
FabioLuporini Jan 28, 2022
13d45a0
compiler: Separate out objects allocation from data allocations
FabioLuporini Jan 28, 2022
1e45ff3
compiler: Change SparseFunction guarding to maximize Cluster fusion
FabioLuporini Jan 28, 2022
e375ba9
example: Update expected output
FabioLuporini Jan 31, 2022
a60dc75
compiler: Postpone and rework TILABLE analysis
FabioLuporini Jan 31, 2022
48af866
compiler: Patch forgotten openmp-offloading handler
FabioLuporini Jan 31, 2022
8290ed8
gpu: Raise warning in case of CUDA runtime / driver incompat
FabioLuporini Jan 31, 2022
fa94460
compiler: Patch ThreadedProdder reconstruction
FabioLuporini Feb 1, 2022
b5b60aa
examples: Update expected output
FabioLuporini Feb 1, 2022
c78f022
compiler: Make Operator collect only the true input dimensions
FabioLuporini Feb 2, 2022
e238022
compiler: Support blocking with numeric step size
FabioLuporini Feb 2, 2022
954d982
compiler: Add IterationSpace.promote
FabioLuporini Feb 2, 2022
f7b1f0b
compiler: Patch ScheduleTree construction
FabioLuporini Feb 2, 2022
860f135
compiler: Add forgotten Uxreplace handlers
FabioLuporini Feb 2, 2022
2e9d8f5
compiler: Drop unused IMPLICIT property
FabioLuporini Feb 3, 2022
1c5cecd
misc: Fix comments and docstrings
FabioLuporini Feb 3, 2022
248d1d8
compiler: Use _visit instead of visit for homogeneity
FabioLuporini Feb 3, 2022
a584ebd
compiler: Patch FindSymbols
FabioLuporini Feb 4, 2022
ad8aabd
compiler: Simplify FindSymbols
FabioLuporini Feb 4, 2022
308eb3b
compiler: Patch estimate_cost (now distinguishes INT ops correctly)
FabioLuporini Feb 7, 2022
c6578e5
misc: Do not emit itershape of 0-cost sections
FabioLuporini Feb 7, 2022
69b146d
compiler: Add IterationSpace.switch
FabioLuporini Feb 10, 2022
1f28853
compiler: Implement integer block shapes
FabioLuporini Feb 8, 2022
853350b
gpu: Adjust to new par-tile format
FabioLuporini Feb 10, 2022
108814f
examples: Update expected output
FabioLuporini Feb 10, 2022
979202e
api: Extend par-tile opt-option
FabioLuporini Feb 10, 2022
d2ece83
compiler: Hotfix detect_accesses
FabioLuporini Feb 11, 2022
633c4a5
compiler: Polish blocking pass
FabioLuporini Feb 11, 2022
846bff0
compiler: Refactor UnboundedMultiTuple
FabioLuporini Feb 11, 2022
55c6352
misc: Polish leftovers
FabioLuporini Feb 11, 2022
220fe2a
compiler: Refactor VExpanded
FabioLuporini Feb 11, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 28 additions & 0 deletions devito/arch/archinfo.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
from devito.tools import as_tuple, all_equal, memoized_func

__all__ = ['platform_registry', 'get_cpu_info', 'get_gpu_info', 'get_nvidia_cc',
'check_cuda_runtime',
'Platform', 'Cpu64', 'Intel64', 'Amd', 'Arm', 'Power', 'Device',
'NvidiaDevice', 'AmdDevice',
'INTEL64', 'SNB', 'IVB', 'HSW', 'BDW', 'SKX', 'KNL', 'KNL7210', # Intel
Expand Down Expand Up @@ -354,6 +355,33 @@ def get_nvidia_cc():
return 10*cc_major.value + cc_minor.value


@memoized_func
def check_cuda_runtime():
libnames = ('libcudart.so', 'libcudart.dylib', 'cudart.dll')
for libname in libnames:
try:
cuda = ctypes.CDLL(libname)
except OSError:
continue
else:
break
else:
warning("Unable to check compatibility of NVidia driver and runtime")

driver_version = ctypes.c_int()
runtime_version = ctypes.c_int()

if cuda.cudaDriverGetVersion(ctypes.byref(driver_version)) == 0 and \
cuda.cudaRuntimeGetVersion(ctypes.byref(runtime_version)) == 0:
driver_version = driver_version.value
runtime_version = runtime_version.value
if driver_version < runtime_version:
warning("The NVidia driver (v%d) on this system may not be compatible "
"with the CUDA runtime (v%d)" % (driver_version, runtime_version))
else:
warning("Unable to check compatibility of NVidia driver and runtime")


@memoized_func
def lscpu():
try:
Expand Down
11 changes: 9 additions & 2 deletions devito/arch/compiler.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,8 @@
from codepy.jit import compile_from_string
from codepy.toolchain import GCCToolchain

from devito.arch import AMDGPUX, NVIDIAX, M1, SKX, POWER8, POWER9, get_nvidia_cc
from devito.arch import (AMDGPUX, NVIDIAX, M1, SKX, POWER8, POWER9, get_nvidia_cc,
check_cuda_runtime)
from devito.exceptions import CompilationError
from devito.logger import debug, warning, error
from devito.parameters import configuration
Expand Down Expand Up @@ -495,10 +496,16 @@ def __init__(self, *args, **kwargs):
self.cflags.remove('-std=c99')
self.cflags.remove('-Wall')
self.cflags.remove('-fPIC')
self.cflags += ['-std=c++11', '-Xcompiler', '-fPIC']
self.cflags += ['-std=c++14', '-Xcompiler', '-fPIC']

self.src_ext = 'cu'

# NOTE: not sure where we should place this. It definitely needs
# to be executed once to warn the user in case there's a CUDA/driver
# mismatch that would cause the program to run, but likely producing
# garbage, since the CUDA kernel behaviour would be undefined
check_cuda_runtime()
georgebisbas marked this conversation as resolved.
Show resolved Hide resolved

def __lookup_cmds__(self):
self.CC = 'nvcc'
self.CXX = 'nvcc'
Expand Down
62 changes: 25 additions & 37 deletions devito/core/cpu.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
from functools import partial

from devito.core.operator import CoreOperator, CustomOperator
from devito.core.operator import CoreOperator, CustomOperator, ParTile
from devito.exceptions import InvalidOperator
from devito.passes.equations import collect_derivatives
from devito.passes.clusters import (Lift, blocking, buffering, cire, cse,
Expand All @@ -23,6 +23,17 @@ class Cpu64OperatorMixin(object):
3 => "blocks", "sub-blocks", and "sub-sub-blocks", ...
"""

BLOCK_EAGER = True
"""
Apply loop blocking as early as possible, and in particular prior to CIRE.
"""

BLOCK_RELAX = False
"""
If set to True, bypass the compiler heuristics that prevent loop blocking in
situations where the performance impact might be detrimental.
"""

CIRE_MINGAIN = 10
"""
Minimum operation count reduction for a redundant expression to be optimized
Expand Down Expand Up @@ -84,7 +95,11 @@ def _normalize_kwargs(cls, **kwargs):
# Blocking
o['blockinner'] = oo.pop('blockinner', False)
o['blocklevels'] = oo.pop('blocklevels', cls.BLOCK_LEVELS)
o['blockeager'] = oo.pop('blockeager', cls.BLOCK_EAGER)
o['blocklazy'] = oo.pop('blocklazy', not o['blockeager'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IS this really needed to have an option and its negated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not. So, I made it this way because then in theory you could have both modes... we would need some extra machinery, but one could have eager blocking for some loops and lazy blocking for other, but yeah, I admit we don't really have use cases ATM, so if you prefer I can drop one

o['blockrelax'] = oo.pop('blockrelax', cls.BLOCK_RELAX)
o['skewing'] = oo.pop('skewing', False)
o['par-tile'] = ParTile(oo.pop('par-tile', False), default=16)

# CIRE
o['min-storage'] = oo.pop('min-storage', False)
Expand Down Expand Up @@ -172,7 +187,8 @@ def _specialize_clusters(cls, clusters, **kwargs):
clusters = Lift().process(clusters)

# Blocking to improve data locality
clusters = blocking(clusters, options)
if options['blockeager']:
georgebisbas marked this conversation as resolved.
Show resolved Hide resolved
clusters = blocking(clusters, sregistry, options)

# Reduce flops
clusters = extract_increments(clusters, sregistry)
Expand All @@ -186,6 +202,10 @@ def _specialize_clusters(cls, clusters, **kwargs):
# Reduce flops
clusters = cse(clusters, sregistry)

# Blocking to improve data locality
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpicking: In the docstring, I would rename blocking as loop blocking

if options['blocklazy']:
clusters = blocking(clusters, sregistry, options)

return clusters

@classmethod
Expand Down Expand Up @@ -228,6 +248,8 @@ class Cpu64FsgOperator(Cpu64AdvOperator):
Operator with performance optimizations tailored "For small grids" ("Fsg").
"""

BLOCK_EAGER = False

@classmethod
def _normalize_kwargs(cls, **kwargs):
kwargs = super()._normalize_kwargs(**kwargs)
Expand All @@ -238,40 +260,6 @@ def _normalize_kwargs(cls, **kwargs):

return kwargs

@classmethod
@timed_pass(name='specializing.Clusters')
def _specialize_clusters(cls, clusters, **kwargs):
options = kwargs['options']
platform = kwargs['platform']
sregistry = kwargs['sregistry']

# Optimize MultiSubDomains
clusters = optimize_msds(clusters)

# Toposort+Fusion (the former to expose more fusion opportunities)
clusters = fuse(clusters, toposort=True)

# Hoist and optimize Dimension-invariant sub-expressions
clusters = cire(clusters, 'invariants', sregistry, options, platform)
clusters = Lift().process(clusters)

# Reduce flops (potential arithmetic alterations)
clusters = extract_increments(clusters, sregistry)
clusters = cire(clusters, 'sops', sregistry, options, platform)
clusters = factorize(clusters)
clusters = optimize_pows(clusters)

# The previous passes may have created fusion opportunities
clusters = fuse(clusters)

# Reduce flops (no arithmetic alterations)
clusters = cse(clusters, sregistry)

# Blocking to improve data locality
clusters = blocking(clusters, options)

return clusters


class Cpu64CustomOperator(Cpu64OperatorMixin, CustomOperator):

Expand Down Expand Up @@ -299,7 +287,7 @@ def callback(f):

return {
'buffering': lambda i: buffering(i, callback, sregistry, options),
'blocking': lambda i: blocking(i, options),
'blocking': lambda i: blocking(i, sregistry, options),
'factorize': factorize,
'fission': fission,
'fuse': lambda i: fuse(i, options=options),
Expand Down
29 changes: 24 additions & 5 deletions devito/core/gpu.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

import numpy as np

from devito.core.operator import CoreOperator, CustomOperator
from devito.core.operator import CoreOperator, CustomOperator, ParTile
from devito.exceptions import InvalidOperator
from devito.passes.equations import collect_derivatives
from devito.passes.clusters import (Lift, Streaming, Tasker, blocking, buffering,
Expand All @@ -26,6 +26,17 @@ class DeviceOperatorMixin(object):
3 => "blocks", "sub-blocks", and "sub-sub-blocks", ...
"""

BLOCK_EAGER = True
"""
Apply loop blocking as early as possible, and in particular prior to CIRE.
"""

BLOCK_RELAX = False
"""
If set to True, bypass the compiler heuristics that prevent loop blocking in
situations where the performance impact might be detrimental.
"""

CIRE_MINGAIN = 10
"""
Minimum operation count reduction for a redundant expression to be optimized
Expand Down Expand Up @@ -67,6 +78,9 @@ def _normalize_kwargs(cls, **kwargs):
# Blocking
o['blockinner'] = oo.pop('blockinner', True)
o['blocklevels'] = oo.pop('blocklevels', cls.BLOCK_LEVELS)
o['blockeager'] = oo.pop('blockeager', cls.BLOCK_EAGER)
o['blocklazy'] = oo.pop('blocklazy', not o['blockeager'])
o['blockrelax'] = oo.pop('blockrelax', cls.BLOCK_RELAX)
o['skewing'] = oo.pop('skewing', False)

# CIRE
Expand All @@ -78,7 +92,7 @@ def _normalize_kwargs(cls, **kwargs):
o['cire-schedule'] = oo.pop('cire-schedule', cls.CIRE_SCHEDULE)

# GPU parallelism
o['par-tile'] = oo.pop('par-tile', False) # Control tile parallelism
o['par-tile'] = ParTile(oo.pop('par-tile', False), default=(32, 4))
o['par-collapse-ncores'] = 1 # Always collapse (meaningful if `par-tile=False`)
o['par-collapse-work'] = 1 # Always collapse (meaningful if `par-tile=False`)
o['par-chunk-nonaffine'] = oo.pop('par-chunk-nonaffine', cls.PAR_CHUNK_NONAFFINE)
Expand Down Expand Up @@ -161,8 +175,9 @@ def _specialize_clusters(cls, clusters, **kwargs):
clusters = cire(clusters, 'invariants', sregistry, options, platform)
clusters = Lift().process(clusters)

# Loop tiling
clusters = blocking(clusters, options)
# Blocking to define thread blocks
if options['blockeager']:
clusters = blocking(clusters, sregistry, options)

# Reduce flops
clusters = extract_increments(clusters, sregistry)
Expand All @@ -176,6 +191,10 @@ def _specialize_clusters(cls, clusters, **kwargs):
# Reduce flops
clusters = cse(clusters, sregistry)

# Blocking to define thread blocks
georgebisbas marked this conversation as resolved.
Show resolved Hide resolved
if options['blocklazy']:
clusters = blocking(clusters, sregistry, options)

return clusters

@classmethod
Expand Down Expand Up @@ -245,7 +264,7 @@ def callback(f):

return {
'buffering': lambda i: buffering(i, callback, sregistry, options),
'blocking': lambda i: blocking(i, options),
'blocking': lambda i: blocking(i, sregistry, options),
'tasking': Tasker(runs_on_host).process,
'streaming': Streaming(reads_if_on_host).process,
'factorize': factorize,
Expand Down
73 changes: 71 additions & 2 deletions devito/core/operator.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,16 @@
from collections.abc import Iterable

from devito.core.autotuning import autotune
from devito.exceptions import InvalidOperator
from devito.logger import warning
from devito.parameters import configuration
from devito.operator import Operator
from devito.tools import as_tuple, timed_pass
from devito.tools import as_tuple, is_integer, timed_pass
from devito.types import NThreads

__all__ = ['CoreOperator', 'CustomOperator']
__all__ = ['CoreOperator', 'CustomOperator',
# Optimization options
'ParTile']


class BasicOperator(Operator):
Expand Down Expand Up @@ -208,3 +212,68 @@ def _specialize_iet(cls, graph, **kwargs):
passes_mapper['linearize'](graph)

return graph


# Wrappers for optimization options


class OptOption(object):
pass


class ParTileArg(tuple):

def __new__(cls, items, shm=0, tag=None):
obj = super().__new__(cls, items)
obj.shm = shm
obj.tag = tag
return obj


class ParTile(tuple, OptOption):

def __new__(cls, items, default=None):
if not items:
georgebisbas marked this conversation as resolved.
Show resolved Hide resolved
return None
elif isinstance(items, bool):
if not default:
raise ValueError("Expected `default` value, got None")
items = (ParTileArg(as_tuple(default)),)
elif isinstance(items, tuple):
if not items:
raise ValueError("Expected at least one value")

# Normalize to tuple of ParTileArgs

x = items[0]
if is_integer(x):
# E.g., (32, 4, 8)
items = (ParTileArg(items),)

elif isinstance(x, Iterable):
if not x:
raise ValueError("Expected at least one value")

try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a lot of nested try/catch isn't there an simpler way? WIth like a recursion or something like make_tile(I) for I in x

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sure it's possible, but here the MAX depth is fixed (3), so I think explicit is OK

y = items[1]
if is_integer(y):
# E.g., ((32, 4, 8), 1)
# E.g., ((32, 4, 8), 1, 'tag')
items = (ParTileArg(*items),)
else:
try:
# E.g., (((32, 4, 8), 1), ((32, 4, 4), 2))
# E.g., (((32, 4, 8), 1, 'tag0'), ((32, 4, 4), 2, 'tag1'))
items = tuple(ParTileArg(*i) for i in items)
except TypeError:
# E.g., ((32, 4, 8), (32, 4, 4))
items = tuple(ParTileArg(i) for i in items)
except IndexError:
# E.g., ((32, 4, 8),)
items = (ParTileArg(x),)
else:
raise ValueError("Expected int or tuple, got %s instead" % type(x))
else:
raise ValueError("Expected bool or tuple, got %s instead" % type(items))

return super().__new__(cls, items)
Loading