Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate LLVM at llvm/llvm-project@7752fec6 #18177

Merged
merged 3 commits into from
Aug 12, 2024
Merged

Conversation

hanhanW
Copy link
Contributor

@hanhanW hanhanW commented Aug 9, 2024

Revert commits:

Copy link

github-actions bot commented Aug 9, 2024

Abbreviated Benchmark Summary

@ commit a7057b33695a46ea551c12173d37e9f4bf451106 (vs. base 5a48912c52f65ead960bec9bfde5a836d1b02ab2)

Data-Tiling Comparison Table

Click to show
Name No-DT (baseline) DT-Only DT-UK
BertLargeTF(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[30-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 723.910 (1.0X) 272.932 (2.7X) 222.729 (3.3X)
DeepLabV3_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 6.999 (1.0X) 9.338 (0.7X) 8.566 (0.8X)
EfficientNetV2STF(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 35.859 (1.0X) 36.458 (1.0X) 34.246 (1.0X)
EfficientNet_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 5.864 (1.0X) 10.984 (0.5X) 5.049 (1.2X)
GPT2_117M_TF_1X1XI32(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 9.168 (1.0X) 8.514 (1.1X) 8.531 (1.1X)
GPT2_117M_TF_1X4XI32(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 11.110 (1.0X) 9.028 (1.2X) 8.971 (1.2X)
MiniLML12H384Uncased(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 12.037 (1.0X) 15.503 (0.8X) 14.040 (0.9X)
MobileBertSquad_fp16(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 33.396 (1.0X) 65.118 (0.5X) 61.113 (0.5X)
MobileBertSquad_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 33.701 (1.0X) 65.543 (0.5X) 61.493 (0.5X)
MobileBertSquad_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 69.078 (1.0X) 134.004 (0.5X) 64.305 (1.1X)
MobileNetV1_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 4.949 (1.0X) 5.316 (0.9X) 4.640 (1.1X)
MobileNetV2_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 3.774 (1.0X) 5.342 (0.7X) 4.948 (0.8X)
MobileNetV2_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 5.907 (1.0X) 9.579 (0.6X) 5.469 (1.1X)
MobileNetV3Small_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 2.867 (1.0X) 3.411 (0.8X) 2.796 (1.0X)
MobileSSD_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 8.490 (1.0X) 11.003 (0.8X) 9.955 (0.9X)
PersonDetect_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 0.787 (1.0X) 1.396 (0.6X) 0.655 (1.2X)
PoseNet_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 4.207 (1.0X) 5.912 (0.7X) 5.333 (0.8X)
matmul_256x256x2048_i8_i4_i32_tile_config_default(linalg) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ c2-standard-60[cpu] 7.540 (1.0X) 7.563 (1.0X) 7.584 (1.0X)
matmul_256x256x2048_i8_i8_i32_tile_config_default(linalg) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ c2-standard-60[cpu] 6.574 (1.0X) 13.283 (0.5X) 1.810 (3.6X)
BertForMaskedLMTF(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[30-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 216.668 (1.0X) 138.930 (1.6X) 108.981 (2.0X)
DeepLabV3_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 32.187 (1.0X) 36.191 (0.9X) 30.060 (1.1X)
EfficientNetV2STF(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 273.498 (1.0X) 259.691 (1.1X) 229.542 (1.2X)
EfficientNet_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 26.880 (1.0X) 51.381 (0.5X) 13.059 (2.1X)
GPT2_117M_TF_1X1XI32(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 70.287 (1.0X) 39.724 (1.8X) 40.306 (1.7X)
GPT2_117M_TF_1X4XI32(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 88.835 (1.0X) 42.469 (2.1X) 41.889 (2.1X)
MiniLML12H384Uncased(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 79.770 (1.0X) 78.321 (1.0X) 59.040 (1.4X)
MobileBertSquad_fp16(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 178.544 (1.0X) 247.326 (0.7X) 186.457 (1.0X)
MobileBertSquad_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 180.599 (1.0X) 252.246 (0.7X) 190.727 (0.9X)
MobileBertSquad_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 519.970 (1.0X) 1087.049 (0.5X) 243.736 (2.1X)
MobileNetV1_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 24.627 (1.0X) 22.565 (1.1X) 17.840 (1.4X)
MobileNetV2_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 12.006 (1.0X) 14.754 (0.8X) 11.562 (1.0X)
MobileNetV2_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 21.647 (1.0X) 42.272 (0.5X) 11.930 (1.8X)
MobileNetV3Small_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ c2-standard-60[cpu] 2.769 (1.0X) 3.299 (0.8X) 2.725 (1.0X)
MobileSSD_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 34.201 (1.0X) 39.500 (0.9X) 31.543 (1.1X)
PersonDetect_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ c2-standard-60[cpu] 0.715 (1.0X) 1.297 (0.6X) 0.580 (1.2X)
PoseNet_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 17.762 (1.0X) 23.615 (0.8X) 19.548 (0.9X)
matmul_1x256x2048_i8_i4_i32_tile_config_default(linalg) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ c2-standard-60[cpu] 0.055 (1.0X) 0.055 (1.0X) 0.055 (1.0X)
matmul_1x256x2048_i8_i8_i32_tile_config_default(linalg) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ c2-standard-60[cpu] 0.044 (1.0X) 0.226 (0.2X) 0.022 (2.0X)

No improved or regressed benchmarks 🏖️

No improved or regressed compilation metrics 🏖️

For more information:

Source Workflow Run

@hanhanW hanhanW changed the title Integrate LLVM at llvm/llvm-project@f4fb7358 Integrate LLVM at llvm/llvm-project@7752fec6 Aug 9, 2024
@hanhanW
Copy link
Contributor Author

hanhanW commented Aug 9, 2024

  FAILED: tests/e2e/tensor_ops/check_rocm_hip_pack.mlir_module.vmfb /home/esaimana/actions-runner/_work/iree/iree/build-tests/tests/e2e/tensor_ops/check_rocm_hip_pack.mlir_module.vmfb 
  cd /home/esaimana/actions-runner/_work/iree/iree/build-tests/tests/e2e/tensor_ops && /home/esaimana/actions-runner/_work/iree/iree/.venv/bin/iree-compile --output-format=vm-bytecode --mlir-print-op-on-diagnostic=false --iree-hal-target-backends=rocm --iree-rocm-target-chip=gfx1100 /home/esaimana/actions-runner/_work/iree/iree/tests/e2e/tensor_ops/pack.mlir -o check_rocm_hip_pack.mlir_module.vmfb --iree-hal-executable-object-search-path=\"/home/esaimana/actions-runner/_work/iree/iree/build-tests\"
  error: Illegal instruction detected: Operand has incorrect register class.

It looks a LLVM failure.. looking

@hanhanW
Copy link
Contributor Author

hanhanW commented Aug 9, 2024

Confirmed that llvm/llvm-project#102198 is the issue locally. I'll try to create a repro and signal it in upstream.

@hanhanW
Copy link
Contributor Author

hanhanW commented Aug 9, 2024

To repro it in IREE: iree-compile --output-format=vm-bytecode --iree-hal-target-backends=rocm --iree-rocm-target-chip=gfx1100 --iree-rocm-bc-dir=/opt/rocm ~/repro.mlir -o /tmp/z.vmfb

func.func private @generate_2D_source(%height : index, %width : index) -> tensor<?x?xi32> {
  %init_source = tensor.empty(%height, %width) : tensor<?x?xi32>
  %source = linalg.generic {
      indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>],
      iterator_types = ["parallel", "parallel"]}
      outs(%init_source : tensor<?x?xi32>) {
    ^bb0(%b0 : i32):
      %outer = linalg.index 0 : index
      %inner = linalg.index 1 : index
      %strided = arith.muli %outer, %width : index
      %linearized = arith.addi %inner, %strided : index
      %linearized_i32 = arith.index_cast %linearized : index to i32
      linalg.yield %linearized_i32 : i32
  } -> tensor<?x?xi32>
  // This blocks the fusion for inputs and testing ops.
  %0 = util.optimization_barrier %source : tensor<?x?xi32>
  %1 = flow.tensor.tie_shape %0 : tensor<?x?xi32>{%height, %width}
  return %1 : tensor<?x?xi32>
}

func.func @static_pack_pad_transpose_outer_dims_large() {
  %height = arith.constant 100 : index
  %width = arith.constant 250 : index
  %0 = call @generate_2D_source(%height, %width) : (index, index) -> tensor<?x?xi32>
  %source = tensor.cast %0 : tensor<?x?xi32> to tensor<100x250xi32>
  %padding_value = arith.constant 42 : i32

  %init_pack = tensor.empty() : tensor<16x4x32x16xi32>
  %pack = tensor.pack %source padding_value(%padding_value : i32)
      outer_dims_perm = [1, 0] inner_dims_pos = [0, 1] inner_tiles = [32, 16] into %init_pack
      : tensor<100x250xi32> -> tensor<16x4x32x16xi32>

  %pad = tensor.pad %source low[0, 0] high[28, 6] {
    ^bb0(%b0 : index, %b1 : index):
      tensor.yield %padding_value : i32
  } : tensor<100x250xi32> to tensor<128x256xi32>
  %reshape = tensor.expand_shape %pad [[0, 1], [2, 3]] output_shape [4, 32, 16, 16] : tensor<128x256xi32> into tensor<4x32x16x16xi32>
  %init_transpose = tensor.empty() : tensor<16x4x32x16xi32>
  %transpose = linalg.transpose
    ins(%reshape : tensor<4x32x16x16xi32>)
    outs(%init_transpose : tensor<16x4x32x16xi32>)
    permutation = [2, 0, 1, 3]

  check.expect_eq(%pack, %transpose) : tensor<16x4x32x16xi32>
  return
}

@hanhanW
Copy link
Contributor Author

hanhanW commented Aug 9, 2024

Here is the LLVM module before translate it to ISA: https://gist.github.com/hanhanW/b806883680dd028ed82943b84483c2b8

And it failed in

translateModuleToISA(*moduleCopy.get(), *targetMachine);

which calls translateModuleToISA method:

static std::string translateModuleToISA(llvm::Module &module,
llvm::TargetMachine &targetMachine) {
std::string targetISA;
{
llvm::raw_string_ostream stream(targetISA);
llvm::buffer_ostream pstream(stream);
llvm::legacy::PassManager codegenPasses;
targetMachine.addPassesToEmitFile(codegenPasses, pstream, nullptr,
llvm::CodeGenFileType::AssemblyFile);
codegenPasses.run(module);
}
return targetISA;
}

@MaheshRavishankar @kuhar how do we reproduce the failure using LLVM tools? I think we can ask for a revert if we can provide such repro to the author.

@hanhanW hanhanW merged commit 76eb9c1 into main Aug 12, 2024
54 checks passed
@hanhanW hanhanW deleted the integrates/llvm-20240809 branch August 12, 2024 17:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants