-
Notifications
You must be signed in to change notification settings - Fork 229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
compiler: Augment code generation capabilities for CUDA/HIP/SYCL support #1828
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1828 +/- ##
==========================================
- Coverage 89.54% 89.54% -0.01%
==========================================
Files 209 209
Lines 34517 34841 +324
Branches 5212 5258 +46
==========================================
+ Hits 30908 31197 +289
- Misses 3117 3152 +35
Partials 492 492
Continue to review full report at Codecov.
|
10d7c13
to
c75152e
Compare
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
aca3717
to
5d9b0e5
Compare
@@ -157,49 +155,3 @@ def _callback(self, clusters, d, prefix): | |||
accesses = [a for a in scope.accesses if not a.is_scalar] | |||
if all(a.is_regular and a.affine_if_present(d._defines) for a in accesses): | |||
return AFFINE | |||
|
|||
|
|||
class Tiling(Detector): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note for reviewers: this was moved inside the blocking
pass.
size = "sizeof(%s%s)" % (obj._C_typedata, shape) | ||
alloc = c.Statement(self.lang['alloc-host'](obj._C_name, | ||
obj._data_alignment, size)) | ||
memptr = VOID(Byref(obj._C_symbol), '**') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note for reviewers: at last (!) these aren't cgen objects anymore, but rather first-class IET nodes. So now the visitors pick them up...
6636d53
to
d991678
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had the first pass, would like to look again, I liked this tiling restructuring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments. Looks like a nice cleanup
@@ -84,6 +95,9 @@ def _normalize_kwargs(cls, **kwargs): | |||
# Blocking | |||
o['blockinner'] = oo.pop('blockinner', False) | |||
o['blocklevels'] = oo.pop('blocklevels', cls.BLOCK_LEVELS) | |||
o['blockeager'] = oo.pop('blockeager', cls.BLOCK_EAGER) | |||
o['blocklazy'] = oo.pop('blocklazy', not o['blockeager']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IS this really needed to have an option and its negated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably not. So, I made it this way because then in theory you could have both modes... we would need some extra machinery, but one could have eager blocking for some loops and lazy blocking for other, but yeah, I admit we don't really have use cases ATM, so if you prefer I can drop one
devito/core/gpu.py
Outdated
@@ -176,6 +191,10 @@ def _specialize_clusters(cls, clusters, **kwargs): | |||
# Reduce flops | |||
clusters = cse(clusters, sregistry) | |||
|
|||
# Blocking to define thread blocks | |||
if options['blocklazy']: | |||
clusters = blocking(clusters, options) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still think would be nice to have this cpu/gpu core [art merged as it is very similar but I understand may be a pain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I think we're getting there, slowly, PR after PR. With this PR for example we're dropping the explicit advanced-fsg
pipeline. In the past we've dropped others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right...theoretically as far as we move to exploring the optimization order impact we are dropping pipelines...theoretically always,...
devito/ir/clusters/algorithms.py
Outdated
@@ -17,10 +17,12 @@ | |||
__all__ = ['clusterize'] | |||
|
|||
|
|||
def clusterize(exprs): | |||
def clusterize(exprs, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
options=None? Or do you plan to add more kwargs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the caller passes in all of the kwargs
directly, so this way makes it work seamlessly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clusterize(exprs, options=None, **kwargs):
maybe then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, changing this now
is_Section = False | ||
is_HaloSpot = False | ||
is_ExpressionBundle = False | ||
is_ParallelIteration = False | ||
is_ParallelBlock = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
always nice to see stuff disappear nice.
devito/ir/iet/visitors.py
Outdated
@@ -455,7 +485,7 @@ def visit_Operator(self, o): | |||
prefix = ' '.join(i.root.prefix + (i.root.retval,)) | |||
esigns.append(c.FunctionDeclaration(c.Value(prefix, i.root.name), | |||
self._args_decl(i.root.parameters))) | |||
efuncs.extend([i.root.ccode, blankline]) | |||
efuncs.extend([self.visit(i.root), blankline]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
visit
or _visit
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll make it _visit for homogeneity
devito/ir/iet/visitors.py
Outdated
else: | ||
v = i | ||
for i in o.children: | ||
v = self.visit(i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
visit
or _visit
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll make it _visit for homogeneity
devito/ir/support/space.py
Outdated
sub_dims = [i.parent for v in self.sub_iterators.values() for i in v] | ||
return filter_ordered(self.intervals.dimensions + sub_dims) | ||
sub_dims = flatten(i._defines for v in self.sub_iterators.values() for i in v) | ||
return filter_ordered(self.itdimensions + list(sub_dims)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't flatten already a list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, dropping the list()
s = Symbol(name='s', dtype=grid.dtype) | ||
|
||
eqns = [Eq(s, 0), | ||
Eq(s, s + f + 1)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So does that mean norm
can be simplified in our bultins or do we still need that size 1
Function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not really, what makes you think that ?
3e89c1a
to
aa391ae
Compare
devito/core/cpu.py
Outdated
situations where the performance impact might be detrimental. | ||
""" | ||
|
||
BLOCK_STEP = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this skip autotuning?
devito/core/gpu.py
Outdated
@@ -176,6 +191,10 @@ def _specialize_clusters(cls, clusters, **kwargs): | |||
# Reduce flops | |||
clusters = cse(clusters, sregistry) | |||
|
|||
# Blocking to define thread blocks | |||
if options['blocklazy']: | |||
clusters = blocking(clusters, options) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right...theoretically as far as we move to exploring the optimization order impact we are dropping pipelines...theoretically always,...
17e6b05
to
7bce275
Compare
14a7b88
to
1f28853
Compare
devito/passes/clusters/blocking.py
Outdated
d = prefix[-1].dim | ||
|
||
for c in clusters: | ||
# PARALLEL* and AFFINE are necessaary conditions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo in necessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, fixing
devito/passes/clusters/blocking.py
Outdated
if is_inner and not self.inner: | ||
return clusters | ||
|
||
# Heuristic: TILABLE not worth it if not within SEQUENTIAL Dimension |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heuristic: TILABLE is not worth it if not within a SEQUENTIAL Dimension
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixing
devito/passes/clusters/blocking.py
Outdated
base = self.sregistry.make_name(prefix=d.name) | ||
|
||
if self.generator: | ||
# An explicit integer step has been supplied |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something is missing here? Sounds weird?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixing
c1d4f29
to
22341f1
Compare
22341f1
to
d2ece83
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some nitpicking-level comments and questions.
@@ -186,6 +202,10 @@ def _specialize_clusters(cls, clusters, **kwargs): | |||
# Reduce flops | |||
clusters = cse(clusters, sregistry) | |||
|
|||
# Blocking to improve data locality |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpicking: In the docstring, I would rename blocking as loop blocking
devito/passes/iet/linearization.py
Outdated
functions = sorted(functions, key=lambda f: len(f.dimensions), reverse=True) | ||
|
||
# `functions_unseen` are all Functions that `iet` may need to linearize | ||
# that have not been seen while processing other IETs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that - > and ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, changing
# `functions_unseen` are all Functions that `iet` may need to linearize | ||
# that have not been seen while processing other IETs | ||
functions_unseen = [f for f in functions if f not in cache] | ||
|
||
# Find unique sizes (unique -> minimize necessary registers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
necessary - > required/needed ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np, nit picking
def test_strides_forwarding(): | ||
def test_unsubstituted_indexeds(): | ||
""" | ||
This issue emerged in the context of PR #1828, after the introduction |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we are all good with this right? Initially I thought it was more severe ?
tests/test_linearize.py
Outdated
op0 = Operator(eq) | ||
op1 = Operator(eq, opt=('advanced', {'linearize': True})) | ||
|
||
# NOTE: we compare the numerical output eventually, but truly the most |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NOTE: Eventually we compare....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
linearize(graph, mode=True, sregistry=SymbolRegistry()) | ||
|
||
# Despite `a` is passed via `a.indexed`, and since it's an Array (which | ||
# have symbolic shape), we expect the stride exprs to be placed in `bar`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which has ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor comments but looks good to me
if not x: | ||
raise ValueError("Expected at least one value") | ||
|
||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's a lot of nested try/catch isn't there an simpler way? WIth like a recursion or something like make_tile(I) for I in x
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sure it's possible, but here the MAX depth is fixed (3), so I think explicit is OK
devito/ir/clusters/algorithms.py
Outdated
@@ -17,10 +17,12 @@ | |||
__all__ = ['clusterize'] | |||
|
|||
|
|||
def clusterize(exprs): | |||
def clusterize(exprs, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clusterize(exprs, options=None, **kwargs):
maybe then?
@@ -254,6 +263,8 @@ def __init__(self, name, arguments=None, retobj=None, is_indirect=False): | |||
self.arguments = as_tuple(arguments) | |||
self.retobj = retobj | |||
self.is_indirect = is_indirect | |||
self.cast = cast |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not private?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't do private in IET node classes. They're buried deep inside the compiler, and since everything is immutable, we're loosely avoiding _private + property
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(not saying it's better this way -- just pointing out why it evolved into this, naturally, over time)
if expr.exp.is_Number: | ||
if expr.exp < 0: | ||
flops += estimate_values['div'] | ||
elif expr.exp == 0 or expr.exp == 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
next one correct for 1
as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
? this is an or
right? wdym by "next one"?
@_estimate_cost.register(Function) | ||
def _(expr, estimate): | ||
if q_routine(expr): | ||
flops, _ = zip(*[_estimate_cost(a, estimate) for a in expr.args]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it, just wondering if people usually consider these for flops/OI/.... measures. Like does vtune consider indices flops?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
integer aritmethic, including array indexing, is never counted as flops
the trigonometric (and siblings) functions... no, here we return pure estimates, but such estimates are only used by CIRE. What devito tells the user is the "flatten" operation count, ie one flop per operation, irrespective of whether it's a div, a mul, or a sin...
this is in practice extremely reliable because divs and trigonometric tend to be hoisted out of the inner loops, so...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
advisor uses INTOPS, counts them separately, you also have the option to add flops and intops to global ops
devito/tools/data_structures.py
Outdated
UnboundedMultiTuple((1, 2), (3, 4)) | ||
>>> ub.iter() | ||
>>> ub | ||
UnboundedMultiTuple((1, 2)*, (3, 4)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpicking: shouldn't the *
be before the tuple? THis looks like the tip passed the first tuple
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, fixing
devito/tools/data_structures.py
Outdated
self.curiter = None | ||
|
||
def __repr__(self): | ||
items = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
items = list(self.nitems)
insert(items, "*", self.tip-1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're right, improved as per your suggestion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not much more to add
# `functions_unseen` are all Functions that `iet` may need to linearize | ||
# that have not been seen while processing other IETs | ||
functions_unseen = [f for f in functions if f not in cache] | ||
|
||
# Find unique sizes (unique -> minimize necessary registers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np, nit picking
@_estimate_cost.register(Function) | ||
def _(expr, estimate): | ||
if q_routine(expr): | ||
flops, _ = zip(*[_estimate_cost(a, estimate) for a in expr.args]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
advisor uses INTOPS, counts them separately, you also have the option to add flops and intops to global ops
@@ -713,7 +713,7 @@ | |||
" START_TIMER(section0)\n", | |||
" #pragma omp parallel num_threads(nthreads)\n", | |||
" {\n", | |||
" const int tid = omp_get_thread_num();\n", | |||
" const int tid = omp_get_thread_num();;\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This? reminder ;;
This PR:
par-tile
opt-option and enables its use for integer-sized loop blockingplace_definitions
andplace_casts
, while making them much more generaltest_unsubstituted_indexeds
)