compiler: Augment code generation capabilities for CUDA/HIP/SYCL support #1828

FabioLuporini · 2022-01-28T15:36:38Z

This PR:

Extend the par-tile opt-option and enables its use for integer-sized loop blocking
Significantly rewrites and simplifies place_definitions and place_casts, while making them much more general
Refactors and simplifies the loop blocking pass
Refactors, enhances, and fixes estimate_cost (the fix is about the counting of integer arithmetic)
Misc minor improvements nearly everywhere
Fixes the FindSymbols visitor (see test test_unsubstituted_indexeds)

codecov · 2022-01-28T16:00:00Z

Codecov Report

Merging #1828 (220fe2a) into master (e2321f4) will decrease coverage by 0.00%.
The diff coverage is 92.77%.

@@            Coverage Diff             @@
##           master    #1828      +/-   ##
==========================================
- Coverage   89.54%   89.54%   -0.01%     
==========================================
  Files         209      209              
  Lines       34517    34841     +324     
  Branches     5212     5258      +46     
==========================================
+ Hits        30908    31197     +289     
- Misses       3117     3152      +35     
  Partials      492      492

Impacted Files	Coverage Δ
devito/ir/clusters/cluster.py	`96.63% <ø> (ø)`
devito/passes/iet/engine.py	`93.10% <ø> (-0.12%)`	⬇️
devito/types/dimension.py	`93.01% <ø> (+0.34%)`	⬆️
tests/conftest.py	`91.21% <ø> (ø)`
tests/test_docstrings.py	`100.00% <ø> (ø)`
tests/test_gpu_openmp.py	`98.43% <ø> (ø)`
devito/arch/archinfo.py	`46.31% <11.11%> (-1.58%)`	⬇️
devito/arch/compiler.py	`55.05% <33.33%> (-0.17%)`	⬇️
devito/ir/support/space.py	`87.21% <46.66%> (-1.83%)`	⬇️
devito/core/gpu.py	`95.56% <76.92%> (-1.33%)`	⬇️
... and 53 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e2321f4...220fe2a. Read the comment docs.

devito/passes/clusters/blocking.py

review-notebook-app · 2022-01-31T15:29:32Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

FabioLuporini · 2022-02-01T14:18:14Z

devito/ir/clusters/analysis.py

@@ -157,49 +155,3 @@ def _callback(self, clusters, d, prefix):
        accesses = [a for a in scope.accesses if not a.is_scalar]
        if all(a.is_regular and a.affine_if_present(d._defines) for a in accesses):
            return AFFINE
-
-
-class Tiling(Detector):


note for reviewers: this was moved inside the blocking pass.

FabioLuporini · 2022-02-01T14:22:04Z

devito/passes/iet/definitions.py

-        size = "sizeof(%s%s)" % (obj._C_typedata, shape)
-        alloc = c.Statement(self.lang['alloc-host'](obj._C_name,
-                                                    obj._data_alignment, size))
+        memptr = VOID(Byref(obj._C_symbol), '**')


Note for reviewers: at last (!) these aren't cgen objects anymore, but rather first-class IET nodes. So now the visitors pick them up...

devito/passes/iet/definitions.py

georgebisbas

Had the first pass, would like to look again, I liked this tiling restructuring.

devito/ir/iet/nodes.py

devito/passes/clusters/blocking.py

devito/passes/iet/linearization.py

examples/compiler/03_iet-A.ipynb

examples/performance/00_overview.ipynb

mloubout

Some comments. Looks like a nice cleanup

mloubout · 2022-02-02T13:08:35Z

devito/core/cpu.py

@@ -84,6 +95,9 @@ def _normalize_kwargs(cls, **kwargs):
        # Blocking
        o['blockinner'] = oo.pop('blockinner', False)
        o['blocklevels'] = oo.pop('blocklevels', cls.BLOCK_LEVELS)
+        o['blockeager'] = oo.pop('blockeager', cls.BLOCK_EAGER)
+        o['blocklazy'] = oo.pop('blocklazy', not o['blockeager'])


IS this really needed to have an option and its negated?

Probably not. So, I made it this way because then in theory you could have both modes... we would need some extra machinery, but one could have eager blocking for some loops and lazy blocking for other, but yeah, I admit we don't really have use cases ATM, so if you prefer I can drop one

mloubout · 2022-02-02T13:09:31Z

devito/core/gpu.py

@@ -176,6 +191,10 @@ def _specialize_clusters(cls, clusters, **kwargs):
        # Reduce flops
        clusters = cse(clusters, sregistry)

+        # Blocking to define thread blocks
+        if options['blocklazy']:
+            clusters = blocking(clusters, options)


Still think would be nice to have this cpu/gpu core [art merged as it is very similar but I understand may be a pain.

Agreed. I think we're getting there, slowly, PR after PR. With this PR for example we're dropping the explicit advanced-fsg pipeline. In the past we've dropped others.

Right...theoretically as far as we move to exploring the optimization order impact we are dropping pipelines...theoretically always,...

mloubout · 2022-02-02T13:10:04Z

devito/ir/clusters/algorithms.py

@@ -17,10 +17,12 @@
 __all__ = ['clusterize']


-def clusterize(exprs):
+def clusterize(exprs, **kwargs):


options=None? Or do you plan to add more kwargs?

the caller passes in all of the kwargs directly, so this way makes it work seamlessly

clusterize(exprs, options=None, **kwargs): maybe then?

yes, changing this now

mloubout · 2022-02-02T13:10:53Z

devito/ir/iet/nodes.py

    is_Section = False
    is_HaloSpot = False
    is_ExpressionBundle = False
-    is_ParallelIteration = False
-    is_ParallelBlock = False


always nice to see stuff disappear nice.

devito/ir/iet/nodes.py

mloubout · 2022-02-02T13:15:08Z

devito/ir/iet/visitors.py

@@ -455,7 +485,7 @@ def visit_Operator(self, o):
                prefix = ' '.join(i.root.prefix + (i.root.retval,))
                esigns.append(c.FunctionDeclaration(c.Value(prefix, i.root.name),
                                                    self._args_decl(i.root.parameters)))
-                efuncs.extend([i.root.ccode, blankline])
+                efuncs.extend([self.visit(i.root), blankline])


visit or _visit?

I'll make it _visit for homogeneity

mloubout · 2022-02-02T13:15:24Z

devito/ir/iet/visitors.py

-            else:
-                v = i
+        for i in o.children:
+            v = self.visit(i)


visit or _visit?

I'll make it _visit for homogeneity

mloubout · 2022-02-02T13:16:54Z

devito/ir/support/space.py

-        sub_dims = [i.parent for v in self.sub_iterators.values() for i in v]
-        return filter_ordered(self.intervals.dimensions + sub_dims)
+        sub_dims = flatten(i._defines for v in self.sub_iterators.values() for i in v)
+        return filter_ordered(self.itdimensions + list(sub_dims))


isn't flatten already a list?

right, dropping the list()

mloubout · 2022-02-02T13:22:02Z

tests/test_operator.py

+        s = Symbol(name='s', dtype=grid.dtype)
+
+        eqns = [Eq(s, 0),
+                Eq(s, s + f + 1)]


So does that mean norm can be simplified in our bultins or do we still need that size 1 Function

not really, what makes you think that ?

georgebisbas · 2022-02-04T13:57:26Z

devito/core/cpu.py

+    situations where the performance impact might be detrimental.
+    """
+
+    BLOCK_STEP = None


Should this skip autotuning?

devito/core/cpu.py

georgebisbas · 2022-02-04T14:00:11Z

devito/core/gpu.py

@@ -176,6 +191,10 @@ def _specialize_clusters(cls, clusters, **kwargs):
        # Reduce flops
        clusters = cse(clusters, sregistry)

+        # Blocking to define thread blocks
+        if options['blocklazy']:
+            clusters = blocking(clusters, options)


Right...theoretically as far as we move to exploring the optimization order impact we are dropping pipelines...theoretically always,...

devito/passes/clusters/blocking.py

devito/ir/support/space.py

georgebisbas · 2022-02-10T15:26:10Z

devito/passes/clusters/blocking.py

+        d = prefix[-1].dim
+
+        for c in clusters:
+            # PARALLEL* and AFFINE are necessaary conditions


typo in necessary

devito/passes/clusters/blocking.py

georgebisbas · 2022-02-10T15:27:34Z

devito/passes/clusters/blocking.py

+            if is_inner and not self.inner:
+                return clusters
+
+            # Heuristic: TILABLE not worth it if not within SEQUENTIAL Dimension


Heuristic: TILABLE is not worth it if not within a SEQUENTIAL Dimension

georgebisbas · 2022-02-10T15:35:39Z

devito/passes/clusters/blocking.py

+        base = self.sregistry.make_name(prefix=d.name)
+
+        if self.generator:
+            # An explicit integer step has been supplied


Something is missing here? Sounds weird?

devito/core/operator.py

devito/passes/clusters/blocking.py

georgebisbas

Some nitpicking-level comments and questions.

devito/arch/compiler.py

georgebisbas · 2022-02-11T11:47:26Z

devito/core/cpu.py

@@ -186,6 +202,10 @@ def _specialize_clusters(cls, clusters, **kwargs):
        # Reduce flops
        clusters = cse(clusters, sregistry)

+        # Blocking to improve data locality


Nitpicking: In the docstring, I would rename blocking as loop blocking

devito/core/gpu.py

devito/core/operator.py

devito/ir/clusters/algorithms.py

georgebisbas · 2022-02-11T12:13:53Z

devito/passes/iet/linearization.py

    functions = sorted(functions, key=lambda f: len(f.dimensions), reverse=True)

+    # `functions_unseen` are all Functions that `iet` may need to linearize
+    # that have not been seen while processing other IETs


that - > and ?

yes, changing

georgebisbas · 2022-02-11T12:14:16Z

devito/passes/iet/linearization.py

+    # `functions_unseen` are all Functions that `iet` may need to linearize
+    # that have not been seen while processing other IETs
+    functions_unseen = [f for f in functions if f not in cache]
+
    # Find unique sizes (unique -> minimize necessary registers)


necessary - > required/needed ?

what changes?

np, nit picking

georgebisbas · 2022-02-11T12:25:31Z

tests/test_linearize.py

-def test_strides_forwarding():
+def test_unsubstituted_indexeds():
+    """
+    This issue emerged in the context of PR #1828, after the introduction


So we are all good with this right? Initially I thought it was more severe ?

georgebisbas · 2022-02-11T12:26:14Z

tests/test_linearize.py

+    op0 = Operator(eq)
+    op1 = Operator(eq, opt=('advanced', {'linearize': True}))
+
+    # NOTE: we compare the numerical output eventually, but truly the most


NOTE: Eventually we compare....

georgebisbas · 2022-02-11T12:26:42Z

tests/test_linearize.py

+    linearize(graph, mode=True, sregistry=SymbolRegistry())
+
+    # Despite `a` is passed via `a.indexed`, and since it's an Array (which
+    # have symbolic shape), we expect the stride exprs to be placed in `bar`,


which has ?

mloubout

Some minor comments but looks good to me

mloubout · 2022-02-11T12:44:04Z

devito/core/operator.py

+                if not x:
+                    raise ValueError("Expected at least one value")
+
+                try:


that's a lot of nested try/catch isn't there an simpler way? WIth like a recursion or something like make_tile(I) for I in x

I'm sure it's possible, but here the MAX depth is fixed (3), so I think explicit is OK

mloubout · 2022-02-11T12:45:46Z

devito/ir/clusters/algorithms.py

@@ -17,10 +17,12 @@
 __all__ = ['clusterize']


-def clusterize(exprs):
+def clusterize(exprs, **kwargs):


clusterize(exprs, options=None, **kwargs): maybe then?

mloubout · 2022-02-11T12:46:26Z

devito/ir/iet/nodes.py

@@ -254,6 +263,8 @@ def __init__(self, name, arguments=None, retobj=None, is_indirect=False):
        self.arguments = as_tuple(arguments)
        self.retobj = retobj
        self.is_indirect = is_indirect
+        self.cast = cast


not private?

we don't do private in IET node classes. They're buried deep inside the compiler, and since everything is immutable, we're loosely avoiding _private + property

(not saying it's better this way -- just pointing out why it evolved into this, naturally, over time)

mloubout · 2022-02-11T12:57:43Z

devito/symbolics/inspection.py

+        if expr.exp.is_Number:
+            if expr.exp < 0:
+                flops += estimate_values['div']
+            elif expr.exp == 0 or expr.exp == 1:


next one correct for 1 as well.

? this is an or right? wdym by "next one"?

mloubout · 2022-02-11T12:58:59Z

devito/symbolics/inspection.py

+@_estimate_cost.register(Function)
+def _(expr, estimate):
+    if q_routine(expr):
+        flops, _ = zip(*[_estimate_cost(a, estimate) for a in expr.args])


I like it, just wondering if people usually consider these for flops/OI/.... measures. Like does vtune consider indices flops?

integer aritmethic, including array indexing, is never counted as flops

the trigonometric (and siblings) functions... no, here we return pure estimates, but such estimates are only used by CIRE. What devito tells the user is the "flatten" operation count, ie one flop per operation, irrespective of whether it's a div, a mul, or a sin...

this is in practice extremely reliable because divs and trigonometric tend to be hoisted out of the inner loops, so...

advisor uses INTOPS, counts them separately, you also have the option to add flops and intops to global ops

mloubout · 2022-02-11T13:02:04Z

devito/tools/data_structures.py

+    UnboundedMultiTuple((1, 2), (3, 4))
+    >>> ub.iter()
+    >>> ub
+    UnboundedMultiTuple((1, 2)*, (3, 4))


nitpicking: shouldn't the * be before the tuple? THis looks like the tip passed the first tuple

yes, fixing

mloubout · 2022-02-11T13:04:03Z

devito/tools/data_structures.py

+        self.curiter = None
+
+    def __repr__(self):
+        items = []


items = list(self.nitems) insert(items, "*", self.tip-1)

you're right, improved as per your suggestion

georgebisbas

Not much more to add

devito/arch/compiler.py

devito/passes/clusters/blocking.py

georgebisbas · 2022-02-11T15:34:05Z

devito/passes/iet/linearization.py

+    # `functions_unseen` are all Functions that `iet` may need to linearize
+    # that have not been seen while processing other IETs
+    functions_unseen = [f for f in functions if f not in cache]
+
    # Find unique sizes (unique -> minimize necessary registers)


np, nit picking

georgebisbas · 2022-02-11T15:37:03Z

devito/symbolics/inspection.py

+@_estimate_cost.register(Function)
+def _(expr, estimate):
+    if q_routine(expr):
+        flops, _ = zip(*[_estimate_cost(a, estimate) for a in expr.args])


advisor uses INTOPS, counts them separately, you also have the option to add flops and intops to global ops

georgebisbas · 2022-02-11T15:40:23Z

examples/performance/00_overview.ipynb

@@ -713,7 +713,7 @@
      "  START_TIMER(section0)\n",
      "  #pragma omp parallel num_threads(nthreads)\n",
      "  {\n",
-      "    const int tid = omp_get_thread_num();\n",
+      "    const int tid = omp_get_thread_num();;\n",


This? reminder ;;

FabioLuporini added the compiler label Jan 28, 2022

FabioLuporini requested review from mloubout and georgebisbas January 28, 2022 15:36

FabioLuporini commented Jan 31, 2022

View reviewed changes

devito/passes/clusters/blocking.py Show resolved Hide resolved

FabioLuporini force-pushed the admit-cuda-2 branch from 10d7c13 to c75152e Compare January 31, 2022 15:29

FabioLuporini force-pushed the admit-cuda-2 branch 2 times, most recently from aca3717 to 5d9b0e5 Compare January 31, 2022 18:18

FabioLuporini commented Feb 1, 2022

View reviewed changes

devito/passes/iet/definitions.py Show resolved Hide resolved

FabioLuporini force-pushed the admit-cuda-2 branch from 6636d53 to d991678 Compare February 2, 2022 10:44

georgebisbas requested changes Feb 2, 2022

View reviewed changes

mloubout reviewed Feb 2, 2022

View reviewed changes

FabioLuporini force-pushed the admit-cuda-2 branch 2 times, most recently from 3e89c1a to aa391ae Compare February 3, 2022 14:25

georgebisbas requested changes Feb 4, 2022

View reviewed changes

FabioLuporini force-pushed the admit-cuda-2 branch 3 times, most recently from 17e6b05 to 7bce275 Compare February 7, 2022 14:51

FabioLuporini and others added 9 commits February 10, 2022 12:40

compiler: Patch place_definitions

88ba836

compiler: Refactor linearize

12f1b97

compiler: Enable allocs/frees with IET nodes too

b11e06d

compiler: Simplify place_casts

72c2977

compiler: Patch CGen visitor

f558777

compiler: Add IMPLICIT Property for implicit iteration spaces

7480305

compiler: Add Global symbols

1310826

compiler: Refactor FSG Operators

ca7207c

compiler: Add forgotten is_Conditional flag

dcfb96a

FabioLuporini and others added 12 commits February 10, 2022 12:40

compiler: Add IterationSpace.promote

954d982

compiler: Patch ScheduleTree construction

f7b1f0b

compiler: Add forgotten Uxreplace handlers

860f135

compiler: Drop unused IMPLICIT property

2e9d8f5

misc: Fix comments and docstrings

1c5cecd

compiler: Use _visit instead of visit for homogeneity

248d1d8

compiler: Patch FindSymbols

a584ebd

compiler: Simplify FindSymbols

ad8aabd

compiler: Patch estimate_cost (now distinguishes INT ops correctly)

308eb3b

misc: Do not emit itershape of 0-cost sections

c6578e5

compiler: Add IterationSpace.switch

69b146d

compiler: Implement integer block shapes

1f28853

FabioLuporini force-pushed the admit-cuda-2 branch from 14a7b88 to 1f28853 Compare February 10, 2022 11:48

FabioLuporini added 2 commits February 10, 2022 14:31

gpu: Adjust to new par-tile format

853350b

examples: Update expected output

108814f

georgebisbas requested changes Feb 10, 2022

View reviewed changes

api: Extend par-tile opt-option

979202e

FabioLuporini force-pushed the admit-cuda-2 branch from c1d4f29 to 22341f1 Compare February 11, 2022 10:24

compiler: Hotfix detect_accesses

d2ece83

FabioLuporini force-pushed the admit-cuda-2 branch from 22341f1 to d2ece83 Compare February 11, 2022 10:42

compiler: Polish blocking pass

633c4a5

georgebisbas reviewed Feb 11, 2022

View reviewed changes

devito/core/operator.py Show resolved Hide resolved

devito/passes/clusters/blocking.py Show resolved Hide resolved

georgebisbas requested changes Feb 11, 2022

View reviewed changes

mloubout approved these changes Feb 11, 2022

View reviewed changes

FabioLuporini added 2 commits February 11, 2022 15:29

compiler: Refactor UnboundedMultiTuple

846bff0

misc: Polish leftovers

55c6352

georgebisbas approved these changes Feb 11, 2022

View reviewed changes

compiler: Refactor VExpanded

220fe2a

FabioLuporini merged commit 41ee245 into master Feb 14, 2022

FabioLuporini deleted the admit-cuda-2 branch February 14, 2022 08:31

compiler: Augment code generation capabilities for CUDA/HIP/SYCL support #1828

compiler: Augment code generation capabilities for CUDA/HIP/SYCL support #1828

Conversation

FabioLuporini commented Jan 28, 2022 • edited Loading

codecov bot commented Jan 28, 2022 • edited Loading

Codecov Report

review-notebook-app bot commented Jan 31, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

georgebisbas left a comment

Choose a reason for hiding this comment

mloubout left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

georgebisbas Feb 10, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

georgebisbas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mloubout left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

georgebisbas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

FabioLuporini commented Jan 28, 2022 •

edited

Loading

codecov bot commented Jan 28, 2022 •

edited

Loading

georgebisbas Feb 10, 2022 •

edited

Loading