Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compiler: Augment code generation capabilities for CUDA/HIP/SYCL support #1828

Merged
merged 60 commits into from
Feb 14, 2022

Conversation

FabioLuporini
Copy link
Contributor

@FabioLuporini FabioLuporini commented Jan 28, 2022

This PR:

  • Extend the par-tile opt-option and enables its use for integer-sized loop blocking
  • Significantly rewrites and simplifies place_definitions and place_casts, while making them much more general
  • Refactors and simplifies the loop blocking pass
  • Refactors, enhances, and fixes estimate_cost (the fix is about the counting of integer arithmetic)
  • Misc minor improvements nearly everywhere
  • Fixes the FindSymbols visitor (see test test_unsubstituted_indexeds)

@codecov
Copy link

codecov bot commented Jan 28, 2022

Codecov Report

Merging #1828 (220fe2a) into master (e2321f4) will decrease coverage by 0.00%.
The diff coverage is 92.77%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1828      +/-   ##
==========================================
- Coverage   89.54%   89.54%   -0.01%     
==========================================
  Files         209      209              
  Lines       34517    34841     +324     
  Branches     5212     5258      +46     
==========================================
+ Hits        30908    31197     +289     
- Misses       3117     3152      +35     
  Partials      492      492              
Impacted Files Coverage Δ
devito/ir/clusters/cluster.py 96.63% <ø> (ø)
devito/passes/iet/engine.py 93.10% <ø> (-0.12%) ⬇️
devito/types/dimension.py 93.01% <ø> (+0.34%) ⬆️
tests/conftest.py 91.21% <ø> (ø)
tests/test_docstrings.py 100.00% <ø> (ø)
tests/test_gpu_openmp.py 98.43% <ø> (ø)
devito/arch/archinfo.py 46.31% <11.11%> (-1.58%) ⬇️
devito/arch/compiler.py 55.05% <33.33%> (-0.17%) ⬇️
devito/ir/support/space.py 87.21% <46.66%> (-1.83%) ⬇️
devito/core/gpu.py 95.56% <76.92%> (-1.33%) ⬇️
... and 53 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e2321f4...220fe2a. Read the comment docs.

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@FabioLuporini FabioLuporini force-pushed the admit-cuda-2 branch 2 times, most recently from aca3717 to 5d9b0e5 Compare January 31, 2022 18:18
@@ -157,49 +155,3 @@ def _callback(self, clusters, d, prefix):
accesses = [a for a in scope.accesses if not a.is_scalar]
if all(a.is_regular and a.affine_if_present(d._defines) for a in accesses):
return AFFINE


class Tiling(Detector):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note for reviewers: this was moved inside the blocking pass.

size = "sizeof(%s%s)" % (obj._C_typedata, shape)
alloc = c.Statement(self.lang['alloc-host'](obj._C_name,
obj._data_alignment, size))
memptr = VOID(Byref(obj._C_symbol), '**')
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for reviewers: at last (!) these aren't cgen objects anymore, but rather first-class IET nodes. So now the visitors pick them up...

Copy link
Contributor

@georgebisbas georgebisbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had the first pass, would like to look again, I liked this tiling restructuring.

devito/ir/iet/nodes.py Outdated Show resolved Hide resolved
devito/ir/iet/nodes.py Show resolved Hide resolved
devito/passes/clusters/blocking.py Show resolved Hide resolved
devito/passes/clusters/blocking.py Show resolved Hide resolved
devito/passes/clusters/blocking.py Show resolved Hide resolved
devito/passes/iet/linearization.py Outdated Show resolved Hide resolved
examples/compiler/03_iet-A.ipynb Show resolved Hide resolved
examples/performance/00_overview.ipynb Outdated Show resolved Hide resolved
examples/performance/00_overview.ipynb Outdated Show resolved Hide resolved
examples/performance/00_overview.ipynb Show resolved Hide resolved
Copy link
Contributor

@mloubout mloubout left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments. Looks like a nice cleanup

@@ -84,6 +95,9 @@ def _normalize_kwargs(cls, **kwargs):
# Blocking
o['blockinner'] = oo.pop('blockinner', False)
o['blocklevels'] = oo.pop('blocklevels', cls.BLOCK_LEVELS)
o['blockeager'] = oo.pop('blockeager', cls.BLOCK_EAGER)
o['blocklazy'] = oo.pop('blocklazy', not o['blockeager'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IS this really needed to have an option and its negated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not. So, I made it this way because then in theory you could have both modes... we would need some extra machinery, but one could have eager blocking for some loops and lazy blocking for other, but yeah, I admit we don't really have use cases ATM, so if you prefer I can drop one

@@ -176,6 +191,10 @@ def _specialize_clusters(cls, clusters, **kwargs):
# Reduce flops
clusters = cse(clusters, sregistry)

# Blocking to define thread blocks
if options['blocklazy']:
clusters = blocking(clusters, options)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still think would be nice to have this cpu/gpu core [art merged as it is very similar but I understand may be a pain.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I think we're getting there, slowly, PR after PR. With this PR for example we're dropping the explicit advanced-fsg pipeline. In the past we've dropped others.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right...theoretically as far as we move to exploring the optimization order impact we are dropping pipelines...theoretically always,...

@@ -17,10 +17,12 @@
__all__ = ['clusterize']


def clusterize(exprs):
def clusterize(exprs, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

options=None? Or do you plan to add more kwargs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the caller passes in all of the kwargs directly, so this way makes it work seamlessly

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clusterize(exprs, options=None, **kwargs): maybe then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, changing this now

is_Section = False
is_HaloSpot = False
is_ExpressionBundle = False
is_ParallelIteration = False
is_ParallelBlock = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

always nice to see stuff disappear nice.

devito/ir/iet/nodes.py Show resolved Hide resolved
@@ -455,7 +485,7 @@ def visit_Operator(self, o):
prefix = ' '.join(i.root.prefix + (i.root.retval,))
esigns.append(c.FunctionDeclaration(c.Value(prefix, i.root.name),
self._args_decl(i.root.parameters)))
efuncs.extend([i.root.ccode, blankline])
efuncs.extend([self.visit(i.root), blankline])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

visit or _visit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll make it _visit for homogeneity

else:
v = i
for i in o.children:
v = self.visit(i)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

visit or _visit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll make it _visit for homogeneity

sub_dims = [i.parent for v in self.sub_iterators.values() for i in v]
return filter_ordered(self.intervals.dimensions + sub_dims)
sub_dims = flatten(i._defines for v in self.sub_iterators.values() for i in v)
return filter_ordered(self.itdimensions + list(sub_dims))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't flatten already a list?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, dropping the list()

s = Symbol(name='s', dtype=grid.dtype)

eqns = [Eq(s, 0),
Eq(s, s + f + 1)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So does that mean norm can be simplified in our bultins or do we still need that size 1 Function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not really, what makes you think that ?

@FabioLuporini FabioLuporini force-pushed the admit-cuda-2 branch 2 times, most recently from 3e89c1a to aa391ae Compare February 3, 2022 14:25
situations where the performance impact might be detrimental.
"""

BLOCK_STEP = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this skip autotuning?

devito/core/cpu.py Show resolved Hide resolved
@@ -176,6 +191,10 @@ def _specialize_clusters(cls, clusters, **kwargs):
# Reduce flops
clusters = cse(clusters, sregistry)

# Blocking to define thread blocks
if options['blocklazy']:
clusters = blocking(clusters, options)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right...theoretically as far as we move to exploring the optimization order impact we are dropping pipelines...theoretically always,...

devito/passes/clusters/blocking.py Show resolved Hide resolved
@FabioLuporini FabioLuporini force-pushed the admit-cuda-2 branch 3 times, most recently from 17e6b05 to 7bce275 Compare February 7, 2022 14:51
devito/ir/support/space.py Show resolved Hide resolved
d = prefix[-1].dim

for c in clusters:
# PARALLEL* and AFFINE are necessaary conditions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo in necessary

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, fixing

devito/passes/clusters/blocking.py Show resolved Hide resolved
if is_inner and not self.inner:
return clusters

# Heuristic: TILABLE not worth it if not within SEQUENTIAL Dimension
Copy link
Contributor

@georgebisbas georgebisbas Feb 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heuristic: TILABLE is not worth it if not within a SEQUENTIAL Dimension

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixing

base = self.sregistry.make_name(prefix=d.name)

if self.generator:
# An explicit integer step has been supplied
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something is missing here? Sounds weird?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixing

Copy link
Contributor

@georgebisbas georgebisbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nitpicking-level comments and questions.

devito/arch/compiler.py Show resolved Hide resolved
@@ -186,6 +202,10 @@ def _specialize_clusters(cls, clusters, **kwargs):
# Reduce flops
clusters = cse(clusters, sregistry)

# Blocking to improve data locality
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpicking: In the docstring, I would rename blocking as loop blocking

devito/core/gpu.py Show resolved Hide resolved
devito/core/operator.py Show resolved Hide resolved
devito/ir/clusters/algorithms.py Show resolved Hide resolved
functions = sorted(functions, key=lambda f: len(f.dimensions), reverse=True)

# `functions_unseen` are all Functions that `iet` may need to linearize
# that have not been seen while processing other IETs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that - > and ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, changing

# `functions_unseen` are all Functions that `iet` may need to linearize
# that have not been seen while processing other IETs
functions_unseen = [f for f in functions if f not in cache]

# Find unique sizes (unique -> minimize necessary registers)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

necessary - > required/needed ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what changes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np, nit picking

def test_strides_forwarding():
def test_unsubstituted_indexeds():
"""
This issue emerged in the context of PR #1828, after the introduction
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we are all good with this right? Initially I thought it was more severe ?

op0 = Operator(eq)
op1 = Operator(eq, opt=('advanced', {'linearize': True}))

# NOTE: we compare the numerical output eventually, but truly the most
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE: Eventually we compare....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

linearize(graph, mode=True, sregistry=SymbolRegistry())

# Despite `a` is passed via `a.indexed`, and since it's an Array (which
# have symbolic shape), we expect the stride exprs to be placed in `bar`,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which has ?

Copy link
Contributor

@mloubout mloubout left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments but looks good to me

if not x:
raise ValueError("Expected at least one value")

try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a lot of nested try/catch isn't there an simpler way? WIth like a recursion or something like make_tile(I) for I in x

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sure it's possible, but here the MAX depth is fixed (3), so I think explicit is OK

@@ -17,10 +17,12 @@
__all__ = ['clusterize']


def clusterize(exprs):
def clusterize(exprs, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clusterize(exprs, options=None, **kwargs): maybe then?

@@ -254,6 +263,8 @@ def __init__(self, name, arguments=None, retobj=None, is_indirect=False):
self.arguments = as_tuple(arguments)
self.retobj = retobj
self.is_indirect = is_indirect
self.cast = cast
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not private?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't do private in IET node classes. They're buried deep inside the compiler, and since everything is immutable, we're loosely avoiding _private + property

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(not saying it's better this way -- just pointing out why it evolved into this, naturally, over time)

if expr.exp.is_Number:
if expr.exp < 0:
flops += estimate_values['div']
elif expr.exp == 0 or expr.exp == 1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

next one correct for 1 as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

? this is an or right? wdym by "next one"?

@_estimate_cost.register(Function)
def _(expr, estimate):
if q_routine(expr):
flops, _ = zip(*[_estimate_cost(a, estimate) for a in expr.args])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it, just wondering if people usually consider these for flops/OI/.... measures. Like does vtune consider indices flops?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

integer aritmethic, including array indexing, is never counted as flops

the trigonometric (and siblings) functions... no, here we return pure estimates, but such estimates are only used by CIRE. What devito tells the user is the "flatten" operation count, ie one flop per operation, irrespective of whether it's a div, a mul, or a sin...

this is in practice extremely reliable because divs and trigonometric tend to be hoisted out of the inner loops, so...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

advisor uses INTOPS, counts them separately, you also have the option to add flops and intops to global ops

UnboundedMultiTuple((1, 2), (3, 4))
>>> ub.iter()
>>> ub
UnboundedMultiTuple((1, 2)*, (3, 4))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpicking: shouldn't the * be before the tuple? THis looks like the tip passed the first tuple

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, fixing

self.curiter = None

def __repr__(self):
items = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

items = list(self.nitems)
insert(items, "*", self.tip-1)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right, improved as per your suggestion

Copy link
Contributor

@georgebisbas georgebisbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not much more to add

devito/arch/compiler.py Show resolved Hide resolved
devito/passes/clusters/blocking.py Show resolved Hide resolved
# `functions_unseen` are all Functions that `iet` may need to linearize
# that have not been seen while processing other IETs
functions_unseen = [f for f in functions if f not in cache]

# Find unique sizes (unique -> minimize necessary registers)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np, nit picking

@_estimate_cost.register(Function)
def _(expr, estimate):
if q_routine(expr):
flops, _ = zip(*[_estimate_cost(a, estimate) for a in expr.args])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

advisor uses INTOPS, counts them separately, you also have the option to add flops and intops to global ops

@@ -713,7 +713,7 @@
" START_TIMER(section0)\n",
" #pragma omp parallel num_threads(nthreads)\n",
" {\n",
" const int tid = omp_get_thread_num();\n",
" const int tid = omp_get_thread_num();;\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This? reminder ;;

@FabioLuporini FabioLuporini merged commit 41ee245 into master Feb 14, 2022
@FabioLuporini FabioLuporini deleted the admit-cuda-2 branch February 14, 2022 08:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants