forked from pytorch/pytorch
-
Notifications
You must be signed in to change notification settings - Fork 0
/
tutorial.cpp
394 lines (361 loc) · 17.4 KB
/
tutorial.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
// *** Tensor Expressions ***
//
// This tutorial covers basics of NNC's tensor expressions, shows basic APIs to
// work with them, and outlines how they are used in the overall TorchScript
// compilation pipeline. This doc is permanently a "work in progress" since NNC
// is under active development and things change fast.
//
// This Tutorial's code is compiled in the standard pytorch build, and the
// executable can be found in `build/bin/tutorial_tensorexpr`.
//
// *** What is NNC ***
//
// NNC stands for Neural Net Compiler. It is a component of TorchScript JIT
// and it performs on-the-fly code generation for kernels, which are often a
// combination of multiple aten (torch) operators.
//
// When the JIT interpreter executes a torchscript model, it automatically
// extracts subgraphs from the torchscript IR graph for which specialized code
// can be JIT generated. This usually improves performance as the 'combined'
// kernel created from the subgraph could avoid unnecessary memory traffic that
// is unavoidable when the subgraph is interpreted as-is, operator by operator.
// This optimization is often referred to as 'fusion'. Relatedly, the process of
// finding and extracting subgraphs suitable for NNC code generation is done by
// a JIT pass called 'fuser'.
//
// *** What is TE ***
//
// TE stands for Tensor Expressions. TE is a commonly used approach for
// compiling kernels performing tensor (~matrix) computation. The idea behind it
// is that operators are represented as a mathematical formula describing what
// computation they do (as TEs) and then the TE engine can perform mathematical
// simplification and other optimizations using those formulas and eventually
// generate executable code that would produce the same results as the original
// sequence of operators, but more efficiently.
//
// NNC's design and implementation of TE was heavily inspired by Halide and TVM
// projects.
#include <iostream>
#include <string>
#include <torch/csrc/jit/tensorexpr/eval.h>
#include <torch/csrc/jit/tensorexpr/expr.h>
#include <torch/csrc/jit/tensorexpr/ir.h>
#include <torch/csrc/jit/tensorexpr/ir_printer.h>
#include <torch/csrc/jit/tensorexpr/loopnest.h>
#include <torch/csrc/jit/tensorexpr/stmt.h>
#include <torch/csrc/jit/tensorexpr/tensor.h>
using namespace torch::jit::tensorexpr;
int main(int argc, char* argv[]) {
// Memory management for tensor expressions is currently done with memory
// arenas. That is, whenever an object is created it registers itself in an
// arena and the object is kept alive as long as the arena is alive. When the
// arena gets destructed, it deletes all objects registered in it.
//
// The easiest way to set up a memory arena is to use `KernelScope` class - it
// is a resource guard that creates a new arena on construction and restores
// the previously set arena on destruction.
//
// We will create a kernel scope here, and thus we'll set up a mem arena for
// the entire tutorial.
KernelScope kernel_scope;
std::cout << "*** Structure of tensor expressions ***" << std::endl;
{
// A tensor expression is a tree of expressions. Each expression has a type,
// and that type defines what sub-expressions it the current expression has.
// For instance, an expression of type 'Mul' would have a type 'kMul' and
// two subexpressions: LHS and RHS. Each of these two sub-expressions could
// also be a 'Mul' or some other expression.
//
// Let's construct a simple TE:
Expr* lhs = new IntImm(5);
Expr* rhs = new Var("x", kInt);
Expr* mul = new Mul(lhs, rhs);
std::cout << "Tensor expression: " << *mul << std::endl;
// Prints: Tensor expression: 5 * x
// Here we created an expression representing a 5*x computation, where x is
// an int variable.
// Another, probably a more convenient, way to construct tensor expressions
// is to use so called expression handles (as opposed to raw expressions
// like we did in the previous example). Expression handles overload common
// operations and allow us to express the same semantics in a more natural
// way:
ExprHandle l = 1;
ExprHandle r = Var::make("x", kInt);
ExprHandle m = l * r;
std::cout << "Tensor expression: " << *m.node() << std::endl;
// Prints: Tensor expression: 1 * x
// In a similar fashion we could construct arbitrarily complex expressions
// using mathematical and logical operations, casts between various data
// types, and a bunch of intrinsics.
ExprHandle a = Var::make("a", kInt);
ExprHandle b = Var::make("b", kFloat);
ExprHandle c = Var::make("c", kFloat);
ExprHandle x = ExprHandle(5) * a + b / (sigmoid(c) - 3.0f);
std::cout << "Tensor expression: " << *x.node() << std::endl;
// Prints: Tensor expression: float(5 * a) + b / ((sigmoid(c)) - 3.f)
// An ultimate purpose of tensor expressions is to optimize tensor
// computations, and in order to represent accesses to tensors data, there
// is a special kind of expression - a load.
// To construct a load we need two pieces: the base and the indices. The
// base of a load is a Buf expression, which could be thought of as a
// placeholder similar to Var, but with dimensions info.
//
// Let's construct a simple load:
BufHandle A("A", {ExprHandle(64), ExprHandle(32)}, kInt);
ExprHandle i = Var::make("i", kInt), j = Var::make("j", kInt);
ExprHandle load = Load::make(A.dtype(), A, {i, j}, /* mask= */ 1);
std::cout << "Tensor expression: " << *load.node() << std::endl;
// Prints: Tensor expression: A[i, j]
}
std::cout << "*** Tensors, Functions, and Placeholders ***" << std::endl;
{
// A tensor computation is represented by objects of Tensor class and
// consists of the following pieces:
// - domain, which is specified by a Buf expression
// - an expression (or several expressions if we want to perform several
// independent computations over the same domain) for its elements, as a
// function of indices
//
// TODO: Update this section once Tensor/Function cleanup is done
std::vector<const Expr*> dims = {
new IntImm(64), new IntImm(32)}; // IntImm stands for Integer Immediate
// and represents an integer constant
// Next we need to create arguments. The arguments are Vars, and they play
// role of placeholders. The computation that the tensor would describe
// would use these arguments.
const Var* i = new Var("i", kInt);
const Var* j = new Var("j", kInt);
std::vector<const Var*> args = {i, j};
// Now we can define the body of the tensor computation using these
// arguments.
Expr* body = new Mul(i, j);
// Finally, we pass all these pieces together to Tensor constructor:
Tensor* X = new Tensor("X", dims, args, body);
std::cout << "Tensor computation: " << *X << std::endl;
// Prints: Tensor computation: Tensor X(i[64], j[32]) = i * j
// Similarly to how we provide a more convenient way of using handles for
// constructing Exprs, Tensors also have a more convenient API for
// construction. It is based on Compute API, which takes a name,
// dimensions, and a lambda specifying the computation body:
Tensor* Z = Compute(
"Z",
{{64, "i"}, {32, "j"}},
[](const VarHandle& i, const VarHandle& j) { return i / j; });
std::cout << "Tensor computation: " << *Z << std::endl;
// Prints: Tensor computation: Tensor Z(i[64], j[32]) = i / j
// Tensors might access other tensors and external placeholders in their
// expressions. It can be done like so:
Placeholder P("P", kFloat, {64, 32});
Tensor* R = Compute(
"R",
{{64, "i"}, {32, "j"}},
[&](const VarHandle& i, const VarHandle& j) {
return Z->call(i, j) * P.load(i, j);
});
std::cout << "Tensor computation: " << *R << std::endl;
// Prints: Tensor computation: Tensor R(i[64], j[32]) = Z(i, j) * P[i, j]
// Placeholders could be thought of as external tensors, i.e. tensors for
// which we don't have the element expression. In other words, for `Tensor`
// we know an expression specifying how its elements can be computed (a
// mathematical formula). For external tensors, or placeholders, we don't
// have such an expression. They need to be considered as coming to us as
// inputs from outside - we can only load data from them.
//
// Also note that we use 'call' to construct an access to an element of a
// Tensor and we use 'load' for accessing elements of an external tensor
// through its Placeholder. This is an implementation detail and could be
// changed in future.
// TODO: Show how reductions are represented and constructed
}
std::cout << "*** Loopnests and Statements ***" << std::endl;
{
// Creating a tensor expression is the first step to generate an executable
// code for it. A next step is to represent it as a loop nest and apply
// various loop transformations in order to get an optimal implementation.
// In Halide's or TVM's terms the first step was to define the algorithm of
// computation (what to compute?) and now we are getting to the schedule of
// the computation (how to compute?).
//
// Let's create a simple tensor expression and construct a loop nest for it.
Placeholder A("A", kFloat, {64, 32});
Placeholder B("B", kFloat, {64, 32});
Tensor* X = Compute(
"X",
{{64, "i"}, {32, "j"}},
[&](const VarHandle& i, const VarHandle& j) {
return A.load(i, j) + B.load(i, j);
});
Tensor* Y = Compute(
"Y",
{{64, "i"}, {32, "j"}},
[&](const VarHandle& i, const VarHandle& j) {
return sigmoid(X->call(i, j));
});
std::cout << "Tensor computation X: " << *X
<< "Tensor computation Y: " << *Y << std::endl;
// Prints:
// Tensor computation X: Tensor X(i[64], j[32]) = (A[i, j]) + (B[i, j])
// Tensor computation Y: Tensor Y(i[64], j[32]) = sigmoid(X(i, j))
// Creating a loop nest is as quite simple, we just need to specify what are
// the output tensors in our computation and LoopNest object will
// automatically pull all tensor dependencies:
LoopNest loopnest({Y});
// An IR used in LoopNest is based on tensor statements, represented by
// `Stmt` class. Statements are used to specify the loop nest structure, and
// to take a sneak peek at them, let's print out what we got right after
// creating our LoopNest object:
std::cout << *loopnest.root_stmt() << std::endl;
// Prints:
// {
// for (int i = 0; i < 64; i++) {
// for (int j = 0; j < 32; j++) {
// X[i, j] = (A[i, j]) + (B[i, j]);
// }
// }
// for (int i_1 = 0; i_1 < 64; i_1++) {
// for (int j_1 = 0; j_1 < 32; j_1++) {
// Y[i_1, j_1] = sigmoid(X(i_1, j_1));
// }
// }
// }
// To introduce statements let's first look at their three main types (in
// fact, there are more than 3 types, but the other types would be easy to
// understand once the overall structure is clear):
// 1) Block
// 2) For
// 3) Store
//
// A `Block` statement is simply a list of other statements.
// A `For` is a statement representing one axis of computation. It contains
// an index variable (Var), boundaries of the axis (start and end - both are
// `Expr`s), and a `Block` statement body.
// A `Store` represents an assignment to a tensor element. It contains a Buf
// representing the target tensor, a list of expressions for indices of the
// element, and the value to be stored, which is an arbitrary expression.
// Once we've constructed the loop nest, we can apply various tranformations
// to it. To begin with, let's inline computation of X into computation of Y
// and see what happens to our statements.
loopnest.computeInline(loopnest.getLoopBodyFor(X));
std::cout << *loopnest.root_stmt() << std::endl;
// Prints:
// {
// for (int i = 0; i < 64; i++) {
// for (int j = 0; j < 32; j++) {
// Y[i, j] = sigmoid((A[i, j]) + (B[i, j]));
// }
// }
// }
//
// As you can see, the first two loops have disappeared and the expression
// for X[i,j] has been inserted into the Y[i,j] computation.
// Loop transformations can be composed, so we can do something else with
// our loop nest now. Let's split the inner loop with a factor of 9, for
// instance.
std::vector<For*> loops = loopnest.getLoopStmtsFor(Y);
For* j_outer;
For* j_inner;
For* j_tail;
int split_factor = 9;
loopnest.splitWithTail(
loops[1], // loops[0] is the outer loop, loops[1] is inner
split_factor,
&j_outer, // These are handles that we would be using for
&j_inner, // further transformations
&j_tail);
std::cout << *loopnest.root_stmt() << std::endl;
// Prints:
// {
// for (int i = 0; i < 64; i++) {
// for (int j_outer = 0; j_outer < (32 - 0) / 9; j_outer++) {
// for (int j_inner = 0; j_inner < 9; j_inner++) {
// Y[i, j_outer * 9 + j_inner] = sigmoid((A[i, j_outer * 9 + ...
// }
// }
// for (int j_tail = 0; j_tail < (32 - 0) % 9; j_tail++) {
// Y[i, j_tail + ((32 - 0) / 9) * 9] = sigmoid((A[i, j_tail + ...
// }
// }
// }
// TODO: List all available transformations
// TODO: Show how statements can be constructed manually
}
std::cout << "*** Codegen ***" << std::endl;
{
// An ultimate goal of tensor expressions is to be provide a mechanism to
// execute a given computation in the fastest possible way. So far we've
// looked at how we could describe what computation we're interested in, but
// we haven't looked at how to actually execute it. So far all we've been
// dealing with was just symbols with no actual data associated, in this
// section we would look at how we can bridge that gap.
// Let's start by constructing a simple computation for us to work with:
Placeholder A("A", kInt, {64, 32});
Placeholder B("B", kInt, {64, 32});
Tensor* X = Compute(
"X",
{{64, "i"}, {32, "j"}},
[&](const VarHandle& i, const VarHandle& j) {
return A.load(i, j) + B.load(i, j);
});
// And let's lower it to a loop nest, as we did in the previous section:
LoopNest loopnest({X});
std::cout << *loopnest.root_stmt() << std::endl;
// Prints:
// {
// for (int i = 0; i < 64; i++) {
// for (int j = 0; j < 32; j++) {
// X[i, j] = (A[i, j]) + (B[i, j]);
// }
// }
// Now imagine that we have two actual tensors 64x32 that we want sum
// together, how do we pass those tensors to the computation and how do we
// carry it out?
//
// Codegen object is aimed at providing exactly that functionality. Codegen
// is an abstract class and concrete codegens are derived from it.
// Currently, we have three codegens:
// 1) Simple Evaluator,
// 2) LLVM Codegen for CPU,
// 3) CUDA Codegen.
// In this example we will be using Simple Evaluator, since it's available
// everywhere.
// To create a codegen, we need to provide the statement - it specifies the
// computation we want to perform - and a list of placeholders and tensors
// used in the computation. The latter part is crucial since that's the only
// way the codegen could use to correlate symbols in the statement to actual
// data arrays that we will be passing when we will actually be performing
// the computation.
//
// Let's create a Simple IR Evaluator codegen for our computation:
SimpleIREvaluator ir_eval(loopnest.root_stmt(), {A, B, X});
// We are using the simplest codegen and in it almost no work is done at the
// construction step. Real codegens such as CUDA and LLVM perform
// compilation during that stage so that when we're about to run the
// computation everything is ready.
// Let's now create some inputs and run our computation with them:
std::vector<int> data_A(64 * 32, 3); // This will be the input A
std::vector<int> data_B(64 * 32, 5); // This will be the input B
std::vector<int> data_X(64 * 32, 0); // This will be used for the result
// Now let's invoke our codegen to perform the computation on our data. We
// need to provide as many arguments as how many placeholders and tensors we
// passed at the codegen construction time. A position in these lists would
// define how real data arrays from the latter call (these arguments are
// referred to as 'CallArg's in our codebase) correspond to symbols
// (placeholders and tensors) used in the tensor expressions we constructed
// (these are referred to as 'BufferArg').
// Thus, we will provide three arguments: data_A, data_B, and data_X. data_A
// contains data for the placeholder A, data_B - for the placeholder B, and
// data_X would be used for contents of tensor X.
ir_eval(data_A, data_B, data_X);
// Let's print one of the elements from each array to verify that the
// computation did happen:
std::cout << "A[10] = " << data_A[10] << std::endl
<< "B[10] = " << data_B[10] << std::endl
<< "X[10] = A[10] + B[10] = " << data_X[10] << std::endl;
// Prints:
// A[10] = 3
// B[10] = 5
// X[10] = A[10] + B[10] = 8
}
// TODO: Show how TorchScript IR is translated to TE
return 0;
}