NHWC miopenOpTensor Add incorrect results #2860

ShanoToni · 2024-03-29T11:44:17Z

While trying to do a NHWC conv bias I encountered incorrect results some investigation showed that the miopenOpTensor Add operation produced incorrect results for NHWC format(Works correctly for NCHW).
The used MIOpen library is part of the rocm-5.6.1 package.
I could not find specification regarding the supported formats for miopenOpTensor.
Is this a known/expected limitation?
Additionally looking for similar issues this #2001 seemed related and pointed NHWC not working correctly and a reorder of the tensor data would be the workaround.
#2828 shows reorders not to be working.
What would the solution be? Is there a workaround to this?
I have attached a reproducer showing this behavior:

#define __HIP_PLATFORM_AMD__
#include <vector>

#include <iostream>
#include <miopen/miopen.h>

int main(int argc, char *argv[]) {
  miopenHandle_t handle;

  miopenCreate(&handle);

  auto dt = miopenDataType_t::miopenFloat;
  auto dt_size = sizeof(float);

  miopenTensorDescriptor_t desc_a = {};
  miopenTensorDescriptor_t desc_b = {};
  miopenTensorDescriptor_t desc_bias = {};

  miopenCreateTensorDescriptor(&desc_a);
  miopenCreateTensorDescriptor(&desc_b);
  miopenCreateTensorDescriptor(&desc_bias);

  int a_dims[4] = {1, 2, 1, 5};      // nchw
  int a_strides[4] = {10, 1, 10, 2}; // nchw -> nhwc
  int b_strides[4] = {10, 5, 5, 1};  // nchw -> nchw

  int bias_dims[4] = {1, 2, 1, 1};    // nchw
  int bias_strides[4] = {2, 1, 2, 2}; // nchw

  int ndims = 4;

  miopenSetTensorDescriptor(desc_a, dt, ndims, a_dims, a_strides);
  miopenSetTensorDescriptor(desc_b, dt, ndims, a_dims, b_strides);
  miopenSetTensorDescriptor(desc_bias, dt, ndims, bias_dims, bias_strides);

  size_t ws_size_miopen = 0;
  size_t a_size_bytes = 0;

  void *a_dev;
  void *b_dev;
  void *bias_dev;
  void *out_dev;

  auto a_size = a_dims[0] * a_dims[1] * a_dims[2] * a_dims[3] * dt_size;
  auto bias_size =
      bias_dims[0] * bias_dims[1] * bias_dims[2] * bias_dims[3] * dt_size;

  hipMalloc(&a_dev, a_size);
  hipMalloc(&b_dev, a_size);
  hipMalloc(&bias_dev, bias_size);

  std::vector<float> host_a(a_size / sizeof(float), 1.0f);
  std::vector<float> host_bias(bias_size / sizeof(float), 1.f);

  hipMemcpy(a_dev, host_a.data(), a_size, hipMemcpyHostToDevice);
  hipMemcpy(bias_dev, host_bias.data(), bias_size, hipMemcpyHostToDevice);

  float bias_alpha = 0.0f;
  float alpha2 = 1.0f;
  float bias_beta = 1.0f;
  // Run op ADD
  auto status =
      miopenOpTensor(handle, miopenTensorOpAdd, &bias_alpha, desc_a, a_dev,
                     &alpha2, desc_bias, bias_dev, &bias_beta, desc_a, a_dev);

  if (status != miopenStatusSuccess) {
    std::cout << "\n Failure to launch OpTensor! \n";
  }

  hipDeviceSynchronize();

  // Print out results of OpAdd
  std::vector<float> host_a_out(host_a.size());
  std::cout << "Expected output is 2 for each IDX \n\n";
  hipMemcpy(host_a_out.data(), a_dev, a_size, hipMemcpyDeviceToHost);
  std::cout << "NHWC OpAdd result before transforming to NCHW \n";
  for (size_t i = 0; i < host_a_out.size(); ++i) {
    std::cout << "IDX:" << i << " " << host_a_out[i] << ", \n";
  }
  std::cout << "\n";

  // Run TransformTensor
  auto alpha = 1.0f;
  auto beta = 0.0f;
  miopenTransformTensor(handle, &alpha, desc_a, a_dev, &beta, desc_b, b_dev);

  // Print out results of TransformTensor(output of OpAdd but in NCHW)
  std::vector<float> host_b_out(host_a.size());
  hipMemcpy(host_b_out.data(), b_dev, a_size, hipMemcpyDeviceToHost);
  std::cout << "Transformed to NCHW \n";
  for (size_t i = 0; i < host_b_out.size(); ++i) {
    std::cout << "IDX:" << i << " " << host_b_out[i] << ", \n";
  }
  std::cout << "\n";
}

ShanoToni · 2024-03-29T11:44:51Z

Additionally adding the log of running the reproducer

MIOpen(HIP): Info [get_device_name] Raw device name: gfx90a:sramecc+:xnack-
MIOpen(HIP): Info [Handle] stream: 0, device_id: 0
MIOpen(HIP): Info2 [GetKernels] 0 kernels for key: OpTensorFwdBias "1-1-0-4096-512-256"
MIOpen(HIP): Info2 [AddKernel] Key: OpTensorFwdBias "1-1-0-4096-512-256"
MIOpen(HIP): Info2 [GetLibPath] Lib Path: /opt/rocm/5.6.1/lib/libMIOpen.so.1.0.50601
MIOpen(HIP): Info2 [SQLiteBase] Initializing system database file 
MIOpen(HIP): Info [KernDb] database not present
MIOpen(HIP): Info2 [SQLiteBase] Initializing user database file /home/antonmitkov/.cache/miopen/2.20.0.a3a8c84f9-dirty/gfx90a68.ukdb
MIOpen(HIP): Info2 [KernDb] Database created successfully
MIOpen(HIP): Info2 [LoadBinary] Loading binary for: MIOpenTensorKernels.cl.o; args:  -DMIOPEN_TYPE=float -DMAX_NUM_WG=4096 -DMIOPEN_USE_FP16=0 -DMIOPEN_USE_FP16x4=0 -DMIOPEN_USE_FP16x8=0 -DMIOPEN_USE_FP32=1 -DMIOPEN_USE_INT8=0 -DMIOPEN_USE_INT8x4=0 -DMIOPEN_USE_BFP16=0 -DMIOPEN_USE_INT32=0 -DMIOPEN_USE_RNE_BFLOAT16=1 -DMIOPEN_TENSOR_OP=miopenAdd -DUSE_FWD_BIAS -mcpu=gfx90a
MIOpen(HIP): Info2 [Prepare] SELECT kernel_blob, kernel_hash, uncompressed_size FROM kern_db WHERE (kernel_name = 'MIOpenTensorKernels.cl.o') AND (kernel_args = ' -DMIOPEN_TYPE=float -DMAX_NUM_WG=4096 -DMIOPEN_USE_FP16=0 -DMIOPEN_USE_FP16x4=0 -DMIOPEN_USE_FP16x8=0 -DMIOPEN_USE_FP32=1 -DMIOPEN_USE_INT8=0 -DMIOPEN_USE_INT8x4=0 -DMIOPEN_USE_BFP16=0 -DMIOPEN_USE_INT32=0 -DMIOPEN_USE_RNE_BFLOAT16=1 -DMIOPEN_TENSOR_OP=miopenAdd -DUSE_FWD_BIAS -mcpu=gfx90a');
MIOpen(HIP): Info2 [Measure] Db::FindRecord time: 4.68422 ms
MIOpen(HIP): Info2 [LoadBinary] Successfully loaded binary for: MIOpenTensorKernels.cl.o; args:  -DMIOPEN_TYPE=float -DMAX_NUM_WG=4096 -DMIOPEN_USE_FP16=0 -DMIOPEN_USE_FP16x4=0 -DMIOPEN_USE_FP16x8=0 -DMIOPEN_USE_FP32=1 -DMIOPEN_USE_INT8=0 -DMIOPEN_USE_INT8x4=0 -DMIOPEN_USE_BFP16=0 -DMIOPEN_USE_INT32=0 -DMIOPEN_USE_RNE_BFLOAT16=1 -DMIOPEN_TENSOR_OP=miopenAdd -DUSE_FWD_BIAS -mcpu=gfx90a
MIOpen(HIP): Info2 [run] kernel_name = OpTensorFwdBias, global_work_dim = { 512, 1, 1 }, local_work_dim = { 256, 1, 1 }
Expected output is 2 for each IDX 

NHWC OpAdd result before transforming to NCHW 
IDX:0 2, 
IDX:1 2, 
IDX:2 2, 
IDX:3 2, 
IDX:4 2, 
IDX:5 2, 
IDX:6 1, 
IDX:7 1, 
IDX:8 1, 
IDX:9 1, 

MIOpen(HIP): Info2 [GetKernels] 0 kernels for key: SubTensorOpWithTransform1d "transform 1x10"
MIOpen(HIP): Info2 [AddKernel] Key: SubTensorOpWithTransform1d "transform 1x10"
MIOpen(HIP): Info2 [LoadBinary] Loading binary for: MIOpenSubTensorOpWithTransformKernel.cl.o; args: -DSUBTENSOR_OP_WITH_SCALAR=SUBTENSOR_OP_WITH_SCALAR_MAD -DMIOPEN_USE_FP16=0 -DMIOPEN_USE_FP16x4=0 -DMIOPEN_USE_FP16x8=0 -DMIOPEN_USE_FP32=1 -DMIOPEN_USE_INT8=0 -DMIOPEN_USE_INT8x4=0 -DMIOPEN_USE_BFP16=0 -DMIOPEN_USE_INT32=0 -DMIOPEN_USE_RNE_BFLOAT16=1 -DWORK_LENGTH_0=16 -mcpu=gfx90a
MIOpen(HIP): Info2 [Prepare] SELECT kernel_blob, kernel_hash, uncompressed_size FROM kern_db WHERE (kernel_name = 'MIOpenSubTensorOpWithTransformKernel.cl.o') AND (kernel_args = '-DSUBTENSOR_OP_WITH_SCALAR=SUBTENSOR_OP_WITH_SCALAR_MAD -DMIOPEN_USE_FP16=0 -DMIOPEN_USE_FP16x4=0 -DMIOPEN_USE_FP16x8=0 -DMIOPEN_USE_FP32=1 -DMIOPEN_USE_INT8=0 -DMIOPEN_USE_INT8x4=0 -DMIOPEN_USE_BFP16=0 -DMIOPEN_USE_INT32=0 -DMIOPEN_USE_RNE_BFLOAT16=1 -DWORK_LENGTH_0=16 -mcpu=gfx90a');
MIOpen(HIP): Info2 [Measure] Db::FindRecord time: 4.53736 ms
MIOpen(HIP): Info2 [LoadBinary] Successfully loaded binary for: MIOpenSubTensorOpWithTransformKernel.cl.o; args: -DSUBTENSOR_OP_WITH_SCALAR=SUBTENSOR_OP_WITH_SCALAR_MAD -DMIOPEN_USE_FP16=0 -DMIOPEN_USE_FP16x4=0 -DMIOPEN_USE_FP16x8=0 -DMIOPEN_USE_FP32=1 -DMIOPEN_USE_INT8=0 -DMIOPEN_USE_INT8x4=0 -DMIOPEN_USE_BFP16=0 -DMIOPEN_USE_INT32=0 -DMIOPEN_USE_RNE_BFLOAT16=1 -DWORK_LENGTH_0=16 -mcpu=gfx90a
MIOpen(HIP): Info2 [run] kernel_name = SubTensorOpWithTransform1d, global_work_dim = { 16, 1, 1 }, local_work_dim = { 16, 1, 1 }
Transformed to NCHW 
IDX:0 2, 
IDX:1 2, 
IDX:2 2, 
IDX:3 2, 
IDX:4 2, 
IDX:5 2, 
IDX:6 1, 
IDX:7 1, 
IDX:8 1, 
IDX:9 1,

CAHEK7 · 2024-03-29T15:21:14Z

As far as I know, miopen tensor operations do not support non-singular stride for the last dimension.
Moreover OpTensorFwdBias (from the logs) assumes that for the NCHW layout W stride is always 1, H stride is always W and only N and C strides can be changes.
There is a generic version of the kernel OpTensorFwdBiasGeneric - but it allows to change H stride, while W stride must be 1.

It means that any layout change using the strides is not possible, only explicit transpose operation can help, where you physically move C to the last dimension.

I'm wondering why the library does not report the unsupported layout.
(Tensor Operations have some other flaws, for example, some of them can be speeded up to 100 times)

atamazov · 2024-04-11T20:38:35Z

@CAHEK7

I'm wondering why the library does not report the unsupported layout.

This is a bug, I suspect. Can you please set bug correctness and urgency_high on this ticket.

/cc @junliume @JehandadKhan

junliume added bug urgency_high correctness labels Apr 12, 2024

junliume assigned CAHEK7, JehandadKhan and atamazov Apr 12, 2024

atamazov mentioned this issue Aug 9, 2024

[DRAFT] Fix op tensor nhwc #3190

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NHWC miopenOpTensor Add incorrect results #2860

NHWC miopenOpTensor Add incorrect results #2860

ShanoToni commented Mar 29, 2024

ShanoToni commented Mar 29, 2024

CAHEK7 commented Mar 29, 2024 •

edited

Loading

atamazov commented Apr 11, 2024 •

edited

Loading

NHWC miopenOpTensor Add incorrect results #2860

NHWC miopenOpTensor Add incorrect results #2860

Comments

ShanoToni commented Mar 29, 2024

ShanoToni commented Mar 29, 2024

CAHEK7 commented Mar 29, 2024 • edited Loading

atamazov commented Apr 11, 2024 • edited Loading

CAHEK7 commented Mar 29, 2024 •

edited

Loading

atamazov commented Apr 11, 2024 •

edited

Loading