Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NHWC miopenOpTensor Add incorrect results #2860

Open
ShanoToni opened this issue Mar 29, 2024 · 3 comments
Open

NHWC miopenOpTensor Add incorrect results #2860

ShanoToni opened this issue Mar 29, 2024 · 3 comments

Comments

@ShanoToni
Copy link

While trying to do a NHWC conv bias I encountered incorrect results some investigation showed that the miopenOpTensor Add operation produced incorrect results for NHWC format(Works correctly for NCHW).
The used MIOpen library is part of the rocm-5.6.1 package.
I could not find specification regarding the supported formats for miopenOpTensor.
Is this a known/expected limitation?
Additionally looking for similar issues this #2001 seemed related and pointed NHWC not working correctly and a reorder of the tensor data would be the workaround.
#2828 shows reorders not to be working.
What would the solution be? Is there a workaround to this?
I have attached a reproducer showing this behavior:

#define __HIP_PLATFORM_AMD__
#include <vector>

#include <iostream>
#include <miopen/miopen.h>

int main(int argc, char *argv[]) {
  miopenHandle_t handle;

  miopenCreate(&handle);

  auto dt = miopenDataType_t::miopenFloat;
  auto dt_size = sizeof(float);

  miopenTensorDescriptor_t desc_a = {};
  miopenTensorDescriptor_t desc_b = {};
  miopenTensorDescriptor_t desc_bias = {};

  miopenCreateTensorDescriptor(&desc_a);
  miopenCreateTensorDescriptor(&desc_b);
  miopenCreateTensorDescriptor(&desc_bias);

  int a_dims[4] = {1, 2, 1, 5};      // nchw
  int a_strides[4] = {10, 1, 10, 2}; // nchw -> nhwc
  int b_strides[4] = {10, 5, 5, 1};  // nchw -> nchw

  int bias_dims[4] = {1, 2, 1, 1};    // nchw
  int bias_strides[4] = {2, 1, 2, 2}; // nchw

  int ndims = 4;

  miopenSetTensorDescriptor(desc_a, dt, ndims, a_dims, a_strides);
  miopenSetTensorDescriptor(desc_b, dt, ndims, a_dims, b_strides);
  miopenSetTensorDescriptor(desc_bias, dt, ndims, bias_dims, bias_strides);

  size_t ws_size_miopen = 0;
  size_t a_size_bytes = 0;

  void *a_dev;
  void *b_dev;
  void *bias_dev;
  void *out_dev;

  auto a_size = a_dims[0] * a_dims[1] * a_dims[2] * a_dims[3] * dt_size;
  auto bias_size =
      bias_dims[0] * bias_dims[1] * bias_dims[2] * bias_dims[3] * dt_size;

  hipMalloc(&a_dev, a_size);
  hipMalloc(&b_dev, a_size);
  hipMalloc(&bias_dev, bias_size);

  std::vector<float> host_a(a_size / sizeof(float), 1.0f);
  std::vector<float> host_bias(bias_size / sizeof(float), 1.f);

  hipMemcpy(a_dev, host_a.data(), a_size, hipMemcpyHostToDevice);
  hipMemcpy(bias_dev, host_bias.data(), bias_size, hipMemcpyHostToDevice);

  float bias_alpha = 0.0f;
  float alpha2 = 1.0f;
  float bias_beta = 1.0f;
  // Run op ADD
  auto status =
      miopenOpTensor(handle, miopenTensorOpAdd, &bias_alpha, desc_a, a_dev,
                     &alpha2, desc_bias, bias_dev, &bias_beta, desc_a, a_dev);

  if (status != miopenStatusSuccess) {
    std::cout << "\n Failure to launch OpTensor! \n";
  }

  hipDeviceSynchronize();

  // Print out results of OpAdd
  std::vector<float> host_a_out(host_a.size());
  std::cout << "Expected output is 2 for each IDX \n\n";
  hipMemcpy(host_a_out.data(), a_dev, a_size, hipMemcpyDeviceToHost);
  std::cout << "NHWC OpAdd result before transforming to NCHW \n";
  for (size_t i = 0; i < host_a_out.size(); ++i) {
    std::cout << "IDX:" << i << " " << host_a_out[i] << ", \n";
  }
  std::cout << "\n";

  // Run TransformTensor
  auto alpha = 1.0f;
  auto beta = 0.0f;
  miopenTransformTensor(handle, &alpha, desc_a, a_dev, &beta, desc_b, b_dev);

  // Print out results of TransformTensor(output of OpAdd but in NCHW)
  std::vector<float> host_b_out(host_a.size());
  hipMemcpy(host_b_out.data(), b_dev, a_size, hipMemcpyDeviceToHost);
  std::cout << "Transformed to NCHW \n";
  for (size_t i = 0; i < host_b_out.size(); ++i) {
    std::cout << "IDX:" << i << " " << host_b_out[i] << ", \n";
  }
  std::cout << "\n";
}
@ShanoToni
Copy link
Author

Additionally adding the log of running the reproducer

MIOpen(HIP): Info [get_device_name] Raw device name: gfx90a:sramecc+:xnack-
MIOpen(HIP): Info [Handle] stream: 0, device_id: 0
MIOpen(HIP): Info2 [GetKernels] 0 kernels for key: OpTensorFwdBias "1-1-0-4096-512-256"
MIOpen(HIP): Info2 [AddKernel] Key: OpTensorFwdBias "1-1-0-4096-512-256"
MIOpen(HIP): Info2 [GetLibPath] Lib Path: /opt/rocm/5.6.1/lib/libMIOpen.so.1.0.50601
MIOpen(HIP): Info2 [SQLiteBase] Initializing system database file 
MIOpen(HIP): Info [KernDb] database not present
MIOpen(HIP): Info2 [SQLiteBase] Initializing user database file /home/antonmitkov/.cache/miopen/2.20.0.a3a8c84f9-dirty/gfx90a68.ukdb
MIOpen(HIP): Info2 [KernDb] Database created successfully
MIOpen(HIP): Info2 [LoadBinary] Loading binary for: MIOpenTensorKernels.cl.o; args:  -DMIOPEN_TYPE=float -DMAX_NUM_WG=4096 -DMIOPEN_USE_FP16=0 -DMIOPEN_USE_FP16x4=0 -DMIOPEN_USE_FP16x8=0 -DMIOPEN_USE_FP32=1 -DMIOPEN_USE_INT8=0 -DMIOPEN_USE_INT8x4=0 -DMIOPEN_USE_BFP16=0 -DMIOPEN_USE_INT32=0 -DMIOPEN_USE_RNE_BFLOAT16=1 -DMIOPEN_TENSOR_OP=miopenAdd -DUSE_FWD_BIAS -mcpu=gfx90a
MIOpen(HIP): Info2 [Prepare] SELECT kernel_blob, kernel_hash, uncompressed_size FROM kern_db WHERE (kernel_name = 'MIOpenTensorKernels.cl.o') AND (kernel_args = ' -DMIOPEN_TYPE=float -DMAX_NUM_WG=4096 -DMIOPEN_USE_FP16=0 -DMIOPEN_USE_FP16x4=0 -DMIOPEN_USE_FP16x8=0 -DMIOPEN_USE_FP32=1 -DMIOPEN_USE_INT8=0 -DMIOPEN_USE_INT8x4=0 -DMIOPEN_USE_BFP16=0 -DMIOPEN_USE_INT32=0 -DMIOPEN_USE_RNE_BFLOAT16=1 -DMIOPEN_TENSOR_OP=miopenAdd -DUSE_FWD_BIAS -mcpu=gfx90a');
MIOpen(HIP): Info2 [Measure] Db::FindRecord time: 4.68422 ms
MIOpen(HIP): Info2 [LoadBinary] Successfully loaded binary for: MIOpenTensorKernels.cl.o; args:  -DMIOPEN_TYPE=float -DMAX_NUM_WG=4096 -DMIOPEN_USE_FP16=0 -DMIOPEN_USE_FP16x4=0 -DMIOPEN_USE_FP16x8=0 -DMIOPEN_USE_FP32=1 -DMIOPEN_USE_INT8=0 -DMIOPEN_USE_INT8x4=0 -DMIOPEN_USE_BFP16=0 -DMIOPEN_USE_INT32=0 -DMIOPEN_USE_RNE_BFLOAT16=1 -DMIOPEN_TENSOR_OP=miopenAdd -DUSE_FWD_BIAS -mcpu=gfx90a
MIOpen(HIP): Info2 [run] kernel_name = OpTensorFwdBias, global_work_dim = { 512, 1, 1 }, local_work_dim = { 256, 1, 1 }
Expected output is 2 for each IDX 

NHWC OpAdd result before transforming to NCHW 
IDX:0 2, 
IDX:1 2, 
IDX:2 2, 
IDX:3 2, 
IDX:4 2, 
IDX:5 2, 
IDX:6 1, 
IDX:7 1, 
IDX:8 1, 
IDX:9 1, 

MIOpen(HIP): Info2 [GetKernels] 0 kernels for key: SubTensorOpWithTransform1d "transform 1x10"
MIOpen(HIP): Info2 [AddKernel] Key: SubTensorOpWithTransform1d "transform 1x10"
MIOpen(HIP): Info2 [LoadBinary] Loading binary for: MIOpenSubTensorOpWithTransformKernel.cl.o; args: -DSUBTENSOR_OP_WITH_SCALAR=SUBTENSOR_OP_WITH_SCALAR_MAD -DMIOPEN_USE_FP16=0 -DMIOPEN_USE_FP16x4=0 -DMIOPEN_USE_FP16x8=0 -DMIOPEN_USE_FP32=1 -DMIOPEN_USE_INT8=0 -DMIOPEN_USE_INT8x4=0 -DMIOPEN_USE_BFP16=0 -DMIOPEN_USE_INT32=0 -DMIOPEN_USE_RNE_BFLOAT16=1 -DWORK_LENGTH_0=16 -mcpu=gfx90a
MIOpen(HIP): Info2 [Prepare] SELECT kernel_blob, kernel_hash, uncompressed_size FROM kern_db WHERE (kernel_name = 'MIOpenSubTensorOpWithTransformKernel.cl.o') AND (kernel_args = '-DSUBTENSOR_OP_WITH_SCALAR=SUBTENSOR_OP_WITH_SCALAR_MAD -DMIOPEN_USE_FP16=0 -DMIOPEN_USE_FP16x4=0 -DMIOPEN_USE_FP16x8=0 -DMIOPEN_USE_FP32=1 -DMIOPEN_USE_INT8=0 -DMIOPEN_USE_INT8x4=0 -DMIOPEN_USE_BFP16=0 -DMIOPEN_USE_INT32=0 -DMIOPEN_USE_RNE_BFLOAT16=1 -DWORK_LENGTH_0=16 -mcpu=gfx90a');
MIOpen(HIP): Info2 [Measure] Db::FindRecord time: 4.53736 ms
MIOpen(HIP): Info2 [LoadBinary] Successfully loaded binary for: MIOpenSubTensorOpWithTransformKernel.cl.o; args: -DSUBTENSOR_OP_WITH_SCALAR=SUBTENSOR_OP_WITH_SCALAR_MAD -DMIOPEN_USE_FP16=0 -DMIOPEN_USE_FP16x4=0 -DMIOPEN_USE_FP16x8=0 -DMIOPEN_USE_FP32=1 -DMIOPEN_USE_INT8=0 -DMIOPEN_USE_INT8x4=0 -DMIOPEN_USE_BFP16=0 -DMIOPEN_USE_INT32=0 -DMIOPEN_USE_RNE_BFLOAT16=1 -DWORK_LENGTH_0=16 -mcpu=gfx90a
MIOpen(HIP): Info2 [run] kernel_name = SubTensorOpWithTransform1d, global_work_dim = { 16, 1, 1 }, local_work_dim = { 16, 1, 1 }
Transformed to NCHW 
IDX:0 2, 
IDX:1 2, 
IDX:2 2, 
IDX:3 2, 
IDX:4 2, 
IDX:5 2, 
IDX:6 1, 
IDX:7 1, 
IDX:8 1, 
IDX:9 1, 

@CAHEK7
Copy link
Contributor

CAHEK7 commented Mar 29, 2024

As far as I know, miopen tensor operations do not support non-singular stride for the last dimension.
Moreover OpTensorFwdBias (from the logs) assumes that for the NCHW layout W stride is always 1, H stride is always W and only N and C strides can be changes.
There is a generic version of the kernel OpTensorFwdBiasGeneric - but it allows to change H stride, while W stride must be 1.

It means that any layout change using the strides is not possible, only explicit transpose operation can help, where you physically move C to the last dimension.

I'm wondering why the library does not report the unsupported layout.
(Tensor Operations have some other flaws, for example, some of them can be speeded up to 100 times)

@atamazov
Copy link
Contributor

atamazov commented Apr 11, 2024

@CAHEK7

I'm wondering why the library does not report the unsupported layout.

This is a bug, I suspect. Can you please set bug correctness and urgency_high on this ticket.

/cc @junliume @JehandadKhan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants