Expected behavior
the optimal implementation of reduce_sum searched by Ansor will have a performance similar to that of torch.sum
Actual behavior
But the optimal implementation of reduce_sum searched by Ansor is more than 30x slower than torch.sum
Environment
Any environment details, such as: Operating System, TVM version, etc
TVM version:0.12.0 release
NVCC:11.0
Steps to reproduce



Triage
Please refer to the list of label tags here to find the relevant tags and add them below in a bullet format (example below).