[Bug] Tensorization breaks when TIR one dimension is a unit iterator

Thanks for participating in the TVM community! We use https://discuss.tvm.ai for any general usage questions and discussions. The issue tracker is used for actionable items such as feature proposals discussion, roadmaps, and bug tracking.  You are always welcomed to post on the forum first :smile_cat:

Issues that are inactive for a period of time may get closed. We adopt this policy so that we won't lose track of actionable issues that may fall at the bottom of the pile. Feel free to reopen a new one if you feel there is an additional problem that needs attention when an old one gets closed.

### Expected behavior

Tensorization works, replacing a part of the schedule with my custom instruction

### Actual behavior

Tensorization throws an error about CompareBufferRegion buffer region min mismatch. I think that during tensorization the schedule is simplified to prune inner unit_iters, see [this function here](https://github.com/apache/tvm/blob/685355e2c7f4ae98342dadb6b4b6119066d8c305/src/tir/schedule/primitive/blockize_tensorize.cc#L254). The inner iteration variable is set to zero. However, if I have an intrinsic where one dimension is a unit iterator the same simplification is not applied, leading to an error as the parts of the schedule are no longer equivalent.
I am aware that intrinsics with unit iters are not necessarily the most common use-case, but the equivalent feature existed in TE based scheduling, so it would be nice if it would still work with TIR.

### Environment

Rocky Linux 

### Steps to reproduce

```
import tvm
from tvm import te
from tvm.script import tir as T

dim_I = 1
dim_K = 1024
dim_J = 512

inp_shape =  (dim_I, dim_K)
wght_shape = (dim_K, dim_J)
out_shape =  (dim_I, dim_J)

ins_dtype = "int8"
out_dtype = "int8"

inp = te.placeholder(inp_shape, dtype=ins_dtype, name="a_in")
wght = te.placeholder(wght_shape, dtype=ins_dtype, name="b_in")
rk = te.reduce_axis((0, dim_K), name="k")

res = te.compute(
    out_shape,
    lambda i, j: te.sum(
        inp[i, rk].astype(out_dtype) * wght[rk, j].astype(out_dtype),
        axis=[rk],
    ),
    name="res",
    tag="dense",
)

func = te.create_prim_func([inp, wght, res])
sch = tvm.tir.Schedule(func)

def get_intrin_gemm(
    dim_i: int,
    dim_k: int,
    dim_j: int,
):
    @T.prim_func
    def matmul_desc(a: T.handle, b:T.handle, c:T.handle, ) -> None:
        A = T.match_buffer(a, (dim_i, dim_k), "int8", offset_factor=1,)
        B = T.match_buffer(b, (dim_k, dim_j), "int8", offset_factor=1,)
        C = T.match_buffer(c, (dim_i, dim_j), "int8", offset_factor=1,)

        with T.block("root"):
            T.reads(C[0:dim_i, 0:dim_j], A[0:dim_i, 0:dim_k], B[0:dim_k, 0:dim_j])
            T.writes(C[0:dim_i, 0:dim_j])
            for i, k, j in T.grid(dim_i, dim_k, dim_j):
                with T.block(""):
                    vii, vjj, vkk = T.axis.remap("SSR", [i, j, k])
                    C[vii, vjj] = C[vii, vjj] + T.cast(A[vii, vkk], ins_dtype) * T.cast(B[vkk, vjj], ins_dtype)

    @T.prim_func
    def matmul_impl(a: T.handle, b:T.handle, c:T.handle, ) -> None:
        A = T.match_buffer(a, (dim_i, dim_k), "int8", offset_factor=1,)
        B = T.match_buffer(b, (dim_k, dim_j), "int8", offset_factor=1,)
        C = T.match_buffer(c, (dim_i, dim_j), "int8", offset_factor=1,)

        with T.block("root"):
            T.reads(A[0:dim_i, 0:dim_k], B[0:dim_k, 0:dim_j], C[0:dim_i, 0:dim_j],)
            T.writes(C[0:dim_i, 0:dim_j])
            T.evaluate(
                T.call_extern("computer_function_extern",
                            dtype="")
            )
    return matmul_desc, matmul_impl

desc, impl = get_intrin_gemm(dim_I, dim_K, dim_J)

res_block = sch.get_block("res")
i, j, k = sch.get_loops(res_block)
sch.reorder(i,k,j)
sch.decompose_reduction(res_block, i)

tvm.tir.TensorIntrin.register("matmul_intrin", desc, impl)
sch.tensorize(i, "matmul_intrin")
```

### Triage

Please refer to the list of label tags [here](https://github.com/apache/tvm/wiki/Issue-Triage-Labels) to find the relevant tags and add them below in a bullet format (example below).

* needs-triage
* tir:schedule


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] Tensorization breaks when TIR one dimension is a unit iterator #16566

Expected behavior

Actual behavior

Environment

Steps to reproduce

Triage

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] Tensorization breaks when TIR one dimension is a unit iterator #16566

Description

Expected behavior

Actual behavior

Environment

Steps to reproduce

Triage

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions