-
Notifications
You must be signed in to change notification settings - Fork 659
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Accuracy regression on INT8 Models #8712
Comments
MLIR files attached. |
Do you know when does it start? Also, can you attach the log and flagfile? |
It started failing at this release: https://github.com/google/iree/releases/tag/candidate-20220325.87 |
Thank you! I'll try to see if there are suspicious commits between |
Actually, |
Okay, good to know this, thank you! |
This file includes MLIR and inputs. |
note: the second number of the output should be I'm going to bisect commits... |
Checking out to f10b758, I'm able to get |
It starts failing at this integration... 9240f1c I'll try to narrow it down to dispatch levels next week |
Got a smaller repro. It is a matmul+generic case. To repro: $ iree-translate -iree-mlir-to-vm-bytecode-module --iree-hal-target-backends=dylib-llvm-aot ~/repro.mlir -o /tmp/a.vmfb
$ iree-run-module --module_file=/tmp/a.vmfb --flagfile=$HOME/flagfile Expected results:
Actual output:
|
Looking into IRs, it looks like there is an overflow in IR. %10 = arith.select %9, %c550829555712_i64, %c548682072064_i64 : i64
%11 = arith.extsi %8 : i32 to i64
%12 = arith.muli %11, %c1092190289_i64 : i64 @rsuderman I highly suspect that this is come from some apply_scale ops. I remember that you have some changes recently. Would you like to verify if this issue will get addressed with your recent TOSA changes? #map0 = affine_map<(d0, d1) -> (d1)>
#map1 = affine_map<(d0, d1) -> (d0, d1)>
#map2 = affine_map<(d0, d1) -> ()>
module {
func @main(%arg0: tensor<?x1280xi8>, %arg1: tensor<1280x1001xi8>, %arg2: tensor<1001xi32>, %arg3: tensor<1001xi32>) -> tensor<?x1001xf32> {
%cst = arith.constant 0.0775590464 : f32
%cst_0 = arith.constant -5.300000e+01 : f32
%c127_i32 = arith.constant 127 : i32
%c-128_i32 = arith.constant -128 : i32
%c-53_i32 = arith.constant -53 : i32
%c40_i64 = arith.constant 40 : i64
%c1092190289_i64 = arith.constant 1092190289 : i64
%c548682072064_i64 = arith.constant 548682072064 : i64
%c550829555712_i64 = arith.constant 550829555712 : i64
%c0_i32 = arith.constant 0 : i32
%c0 = arith.constant 0 : index
%0 = tensor.dim %arg0, %c0 : tensor<?x1280xi8>
%1 = linalg.init_tensor [%0, 1001] : tensor<?x1001xf32>
%2 = linalg.init_tensor [%0, 1001] : tensor<?x1001xi32>
%3 = linalg.fill ins(%c0_i32 : i32) outs(%2 : tensor<?x1001xi32>) -> tensor<?x1001xi32>
%4 = linalg.matmul ins(%arg0, %arg1 : tensor<?x1280xi8>, tensor<1280x1001xi8>) outs(%3 : tensor<?x1001xi32>) -> tensor<?x1001xi32>
%5 = linalg.generic {indexing_maps = [#map0, #map1, #map0, #map2, #map1], iterator_types = ["parallel", "parallel"]} ins(%arg2, %4, %arg3, %c-128_i32 : tensor<1001xi32>, tensor<?x1001xi32>, tensor<1001xi32>, i32) outs(%1 : tensor<?x1001xf32>) {
^bb0(%arg4: i32, %arg5: i32, %arg6: i32, %arg7: i32, %arg8: f32):
%6 = arith.muli %arg6, %arg7 : i32
%7 = arith.subi %arg5, %6 : i32
%8 = arith.addi %arg4, %7 : i32
%9 = arith.cmpi sge, %8, %c0_i32 : i32
%10 = arith.select %9, %c550829555712_i64, %c548682072064_i64 : i64
%11 = arith.extsi %8 : i32 to i64
%12 = arith.muli %11, %c1092190289_i64 : i64
%13 = arith.addi %12, %10 : i64
%14 = arith.shrsi %13, %c40_i64 : i64
%15 = arith.trunci %14 : i64 to i32
%16 = arith.addi %15, %c-53_i32 : i32
%17 = arith.cmpi slt, %16, %c-128_i32 : i32
%18 = arith.select %17, %c-128_i32, %16 : i32
%19 = arith.cmpi slt, %c127_i32, %16 : i32
%20 = arith.select %19, %c127_i32, %18 : i32
%21 = arith.trunci %20 : i32 to i8
%22 = arith.sitofp %21 : i8 to f32
%23 = arith.subf %22, %cst_0 : f32
%24 = arith.mulf %23, %cst : f32
linalg.yield %24 : f32
} -> tensor<?x1001xf32>
return %5 : tensor<?x1001xf32>
}
} |
Read wrong IRs, I'll take the overflow issue back and read it again... |
Reading from #8731 (comment), I think this issue can be addressed in the next integration? We may want to add a verify pass to detect if unsupported types are involved though. My guess is that the i64 constant is demoted to i32 with a weird value in this case. |
Yep - if you want to run on vulkan (or small systems) you'll need to ensure there are no i64 values that actually need 64-bits. |
Still have correctness issues after bumping the MLIR. The final dispatch changes. There is a apply_rescale op in the dispatch. Here is the new repro: repro.zip #map0 = affine_map<(d0, d1) -> (d1)>
#map1 = affine_map<(d0, d1) -> (d0, d1)>
#map2 = affine_map<(d0, d1) -> ()>
module {
func @main(%arg0: tensor<?x1280xi8>, %arg1: tensor<1280x1001xi8>, %arg2: tensor<1001xi32>, %arg3: tensor<1001xi32>) -> tensor<?x1001xf32> {
%c0_i32 = arith.constant 0 : i32
%c1092190289_i32 = arith.constant 1092190289 : i32
%c40_i8 = arith.constant 40 : i8
%c-53_i32 = arith.constant -53 : i32
%c-128_i32 = arith.constant -128 : i32
%c127_i32 = arith.constant 127 : i32
%cst = arith.constant -5.300000e+01 : f32
%cst_0 = arith.constant 0.0775590464 : f32
%c0 = arith.constant 0 : index
%0 = tensor.dim %arg0, %c0 : tensor<?x1280xi8>
%1 = linalg.init_tensor [%0, 1001] : tensor<?x1001xf32>
%2 = linalg.init_tensor [%0, 1001] : tensor<?x1001xi32>
%3 = linalg.fill ins(%c0_i32 : i32) outs(%2 : tensor<?x1001xi32>) -> tensor<?x1001xi32>
%4 = linalg.matmul ins(%arg0, %arg1 : tensor<?x1280xi8>, tensor<1280x1001xi8>) outs(%3 : tensor<?x1001xi32>) -> tensor<?x1001xi32>
%5 = linalg.generic {indexing_maps = [#map0, #map1, #map0, #map2, #map1], iterator_types = ["parallel", "parallel"]} ins(%arg2, %4, %arg3, %c-128_i32 : tensor<1001xi32>, tensor<?x1001xi32>, tensor<1001xi32>, i32) outs(%1 : tensor<?x1001xf32>) {
^bb0(%arg4: i32, %arg5: i32, %arg6: i32, %arg7: i32, %arg8: f32):
%6 = arith.muli %arg6, %arg7 : i32
%7 = arith.subi %arg5, %6 : i32
%8 = arith.addi %arg4, %7 : i32
%9 = "tosa.apply_scale"(%8, %c1092190289_i32, %c40_i8) {double_round = true} : (i32, i32, i8) -> i32
%10 = arith.addi %9, %c-53_i32 : i32
%11 = arith.cmpi slt, %10, %c-128_i32 : i32
%12 = arith.select %11, %c-128_i32, %10 : i32
%13 = arith.cmpi slt, %c127_i32, %10 : i32
%14 = arith.select %13, %c127_i32, %12 : i32
%15 = arith.trunci %14 : i32 to i8
%16 = arith.sitofp %15 : i8 to f32
%17 = arith.subf %16, %cst : f32
%18 = arith.mulf %17, %cst_0 : f32
linalg.yield %18 : f32
} -> tensor<?x1001xf32>
return %5 : tensor<?x1001xf32>
}
} It looks like there are correctness issue in The results are different if there are apply_scale ops. @rsuderman could you help take a look at what's happening? |
I took a quick look, and found that there are stack buffer. The stack buffer is needed because it is quantized matmul and it's not vectorized. I have a local "always-vectorize" patch which fixes the issue. This makes me suspect the issues happened in bufferization, because there won't be stack buffer in "always-vectorize". However, the IR looks good to me. I don't know the root cause yet. |
I know the issue now. It looks like an upstream bug. The CSE pass (after tileAndFuse) hoists The left hand side is correct, and the right hand side is incorrect: which results in The initialization does not happen right before matmul. @gysit @matthias-springer any ideas how to fix it? (I'll send out "always-vectorize" patch which would fix correctness issue, but the upstream bug still needs to be fixed.) |
That is an interesting case! Let me share my initial understanding. It looks like there is more stuff running than just CSE. I would guess the change is a combination of canonicalization and hoisting. Additionally, it seems that the change is actually correct but it triggers a problem in bufferization afterwards. What canonicalization does is folding:
into
After folding the extract_slice with the init_tensor into a smaller init_tensor the code snipped is not dependent on the inner loop anymore and the fill hoists out of the inner loop. That seems correct? It seems though that bufferization afterwards bufferizes the matmul output / fill result in-place (@matthias-springer and me discussed this and @matthias-springer already has a smaller repro). Fixing this requires extending the bufferization analysis and likely requires a larger change. |
I'm working on a fix for the bufferization. Have some idea on how to fix it. Will keep you posted. |
I have a potential fix: https://reviews.llvm.org/D123791 |
Verified that it's fixed in #8888.
|
Seeing numerical differences of 5-17 in several INT8 models that were previously passing in iree-samples. Ran on x86 with dylib.
The text was updated successfully, but these errors were encountered: