[TIR] Enhance and fix tensorize schedule for some case #16560

LeiWang1999 · 2024-02-13T13:59:37Z

To optimize i4_to_f16 decoding, we can use some advanced hardware instructions to do fast type conversion to alleviate the cost of decoding, we can do that by tensorize in tvm.

To tensorize decoding, this pr extends the call component of ir_comparator, which is necessary because the decode block comprises call expressions.

Moreover, currently comparator do simplification on the lhs expr, however, the tensor intrin descs are not simplified, which will be inconsistent and will fail at comparation,
see this pr: #14108.

For example, we provide a test case for this situation:

def test_tensorize_arith_simplification():
    # fmt: off
    @T.prim_func
    def decode_i4s_to_int32_to_f16():
        B_decode_local = T.alloc_buffer((16384, 16384), "float16", scope="local")
        B_local = T.alloc_buffer((16384, 2048), "int32", scope="local")
        for ax0_0 in T.thread_binding(8192, thread="blockIdx.x"):
            for ax0_1 in T.thread_binding(2, thread="threadIdx.y"):
                for ax1_0 in range(32):
                    for ax1_1 in T.thread_binding(64, thread="threadIdx.x"):
                        for ax0, ax1 in T.grid(1, 8):
                            with T.block("B_decode_local"):
                                v0 = T.axis.spatial(16384, ax0_0 * 2 + ax0_1 + ax0)
                                v1 = T.axis.spatial(16384, ax1_0 * 512 + ax1_1 * 8 + ax1)
                                T.reads(B_local[v0, v1 // 8])
                                T.writes(B_decode_local[v0, v1])
                                B_decode_local[v0, v1] = T.Cast("float16", T.shift_right(T.shift_left(T.bitwise_and(T.shift_right(B_local[v0, v1 // 8], v1 % 8 * 4), 15), 28), 28))

The desc should be simplified from [v1 // 8] and [v1 % 8] to [0], [v1] to match the simplified lhs expr.

To do simplification for tensor intrin's desc, we warp and reuse tir::transform::simplify to support simplification for single stmt.

LeiWang1999 · 2024-02-13T14:02:08Z

cc @Hzfengsy and @Lunderberg , looks like pr #13299 provides a stmt_simplify declaration but do not provide an implementation.

tqchen · 2024-02-13T14:21:05Z

cc @vinx13

Lunderberg · 2024-02-13T17:59:13Z

src/tir/schedule/primitive/blockize_tensorize.cc


+class TensorIntrinSimplifier : public arith::IRMutatorWithAnalyzer {
+ public:
+  static PrimFunc Apply(PrimFunc func, arith::Analyzer* analyzer) {


Instead of simplifying the body of the PrimFunc, can we instead simplify the entire PrimFunc? That way, dynamic expressions that are used in shapes are exposed to the analyzer as non-negative. (e.g. Using buffer of shape [n,m] implies that n >= 0 && m >= 0.)

as u mentioned in #13299 (comment) , perhaps its better to simplify in prim_func level, I chose to implement a stmt simplifier because it may be more useful. The rationale is that stmt is more fine-grained.

Moreover, in the context of tensor desc in tensorize schedule, prim_func typically encompasses a single block without dynamic symbolic. I think for this issue a stmt simplifier is enough.
But we can implement a prim_func one as well, should we keep both stmt and primfunc simplifier or just maintain only one of them?

as u mentioned in #13299 (comment) , perhaps its better to simplify in prim_func level, I chose to implement a stmt simplifier because it may be more useful. The rationale is that stmt is more fine-grained.

Good point. Thinking on it again in the morning, I think we should avoid having the simplify function for tir::Stmt altogether, because it is more fine-grained. That is, its existence would encourage simplifications to be performed for specific statements, even though those statements might not be the outer-most.

But we can implement a prim_func one as well, should we keep both stmt and primfunc simplifier or just maintain only one of them?

I think having a simplifier for a PrimFunc would be better, because it encourages developers to simplify with the full context of a statement. The functionality already exists here, and would just need a wrapper function to expose StmtSimplifier::Apply.

@Lunderberg hi, I think this pr is ready for review.

tqchen · 2024-03-04T14:01:21Z

@vinx13 @spectrometerHBH do you mind take a look at this PR?

* support tensorize with simplified and call expr * replace stmt simplifier with primfunc simplifier * lint fix * lint:remove white space * lint: remove white space * cpp lint fix * lint: resolve include * clang format lint fix

support tensorize with simplified and call expr

d0100c3

LeiWang1999 marked this pull request as ready for review February 13, 2024 13:59

Lunderberg mentioned this pull request Feb 13, 2024

[TIR] Introduce ReduceBranchingThroughOvercompute #13299

Merged

Lunderberg reviewed Feb 13, 2024

View reviewed changes

LeiWang1999 mentioned this pull request Feb 14, 2024

[Bug] Tensorization breaks when TIR one dimension is a unit iterator #16566

Closed

LeiWang1999 added 7 commits February 16, 2024 00:01

replace stmt simplifier with primfunc simplifier

834f204

lint fix

cd1c24c

lint:remove white space

42da897

lint: remove white space

4721d7e

cpp lint fix

0cd107c

lint: resolve include

e1306a5

clang format lint fix

8fc861b

tqchen assigned vinx13 Mar 4, 2024

vinx13 approved these changes Mar 8, 2024

View reviewed changes

vinx13 merged commit 7b7677f into apache:main Mar 8, 2024

ysh329 mentioned this pull request Apr 21, 2024

[Release] v0.16.0 Release Candidate Notes #16911

Closed

LeiWang1999 pushed a commit to LeiWang1999/tvm that referenced this pull request May 14, 2024

merge pr from apache#16560

0d0f7cc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[TIR] Enhance and fix tensorize schedule for some case #16560

[TIR] Enhance and fix tensorize schedule for some case #16560

Uh oh!

LeiWang1999 commented Feb 13, 2024 •

edited

Loading

Uh oh!

LeiWang1999 commented Feb 13, 2024 •

edited

Loading

Uh oh!

tqchen commented Feb 13, 2024

Uh oh!

Lunderberg Feb 13, 2024

Uh oh!

LeiWang1999 Feb 14, 2024 •

edited

Loading

Uh oh!

Lunderberg Feb 14, 2024

Uh oh!

LeiWang1999 Feb 21, 2024

Uh oh!

tqchen commented Mar 4, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[TIR] Enhance and fix tensorize schedule for some case #16560

[TIR] Enhance and fix tensorize schedule for some case #16560

Uh oh!

Conversation

LeiWang1999 commented Feb 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LeiWang1999 commented Feb 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tqchen commented Feb 13, 2024

Uh oh!

Lunderberg Feb 13, 2024

Choose a reason for hiding this comment

Uh oh!

LeiWang1999 Feb 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Lunderberg Feb 14, 2024

Choose a reason for hiding this comment

Uh oh!

LeiWang1999 Feb 21, 2024

Choose a reason for hiding this comment

Uh oh!

tqchen commented Mar 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

LeiWang1999 commented Feb 13, 2024 •

edited

Loading

LeiWang1999 commented Feb 13, 2024 •

edited

Loading

LeiWang1999 Feb 14, 2024 •

edited

Loading

tqchen commented Mar 4, 2024 •

edited

Loading