Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reordering dimensions using miopenTransformTensor is broken #2828

Open
t4c1 opened this issue Mar 21, 2024 · 8 comments
Open

Reordering dimensions using miopenTransformTensor is broken #2828

t4c1 opened this issue Mar 21, 2024 · 8 comments
Assignees

Comments

@t4c1
Copy link

t4c1 commented Mar 21, 2024

Attaching two reproducers:

  • The first one creates 2-D tensors with miopenSetTensorDescriptor. Attempting to change the order of dimensions copies the data, but it is not reordered.
  • The second one creates 4-D tensors with miopenSetNdTensorDescriptorWithLayout and attempts to change order from NCHW to NHWC. In this case no data seems to be copied to the second tensor.

Reproducer 1:

#define  __HIP_PLATFORM_AMD__
#include <miopen/miopen.h>

#include <vector>
#include <iostream>

int main(int argc, char* argv[]) {
    miopenHandle_t handle;

    miopenCreate(&handle);

    auto dt = miopenDataType_t::miopenFloat;
    auto dt_size = sizeof(float);

    constexpr int ndims = 2;
    int src_dims[ndims]={2, 2};
    int dst_dims[ndims]={2, 2};
    int src_strides[ndims] = {1, 2};
    int dst_strides[ndims] = {2, 1};

    constexpr int n_tensors = 2;
    miopenTensorDescriptor_t tensor_descs[n_tensors] = {};
    miopenCreateTensorDescriptor(&tensor_descs[0]);
    miopenCreateTensorDescriptor(&tensor_descs[1]);
    miopenSetTensorDescriptor(tensor_descs[0], dt, ndims, src_dims, src_strides);
    miopenSetTensorDescriptor(tensor_descs[1], dt, ndims, dst_dims, dst_strides);

    const float alpha = 1.f, beta = 0.f;
    bool is_training = true;

    size_t ws_size_miopen = 0;
    size_t x_size_bytes = 0;

    void * src_dev;
    void * dst_dev;
    hipMalloc(&src_dev, src_dims[0] * src_dims[1] * dt_size);
    hipMalloc(&dst_dev, dst_dims[0] * dst_dims[1] * dt_size);
    std::vector<float> src = {1,2,3,4};
    hipMemcpy(src_dev, src.data(), 4 * dt_size, hipMemcpyDefault);

    miopenTransformTensor(handle, &alpha, tensor_descs[0], src_dev, &beta, tensor_descs[1], dst_dev);

    hipDeviceSynchronize();

    std::vector<float> dst(4);
    hipMemcpy(dst.data(), dst_dev, 4 * dt_size, hipMemcpyDefault);
    std::cout << "actual result: ";
    for(float f : dst){
        std::cout << f << ", "; // prints 1, 2, 3, 4 - no reordering was done
    }
    std::cout << std::endl;
    std::cout << "expected result: ";
    std::vector<float> expected_dst = {1, 3, 2, 4};
    for(float f : expected_dst){
        std::cout << f << ", ";
    }
    std::cout << std::endl;
}

Reproducer 2:

#define  __HIP_PLATFORM_AMD__
#include <miopen/miopen.h>

#include <vector>
#include <iostream>

int main(int argc, char* argv[]) {
    miopenHandle_t handle;

    miopenCreate(&handle);

    auto dt = miopenDataType_t::miopenFloat;
    auto dt_size = sizeof(float);

    constexpr int ndims = 4;
    int src_dims[ndims]={1, 2, 1, 2};
    int dst_dims[ndims]={1, 2, 1, 2};

    constexpr int n_tensors = 2;
    miopenTensorDescriptor_t tensor_descs[n_tensors] = {};
    miopenCreateTensorDescriptor(&tensor_descs[0]);
    miopenCreateTensorDescriptor(&tensor_descs[1]);

    miopenSetNdTensorDescriptorWithLayout(tensor_descs[0], dt, miopenTensorNCHW, src_dims, ndims);
    miopenSetNdTensorDescriptorWithLayout(tensor_descs[1], dt, miopenTensorNHWC, dst_dims, ndims);

    const float alpha = 1.f, beta = 0.f;
    bool is_training = true;

    size_t ws_size_miopen = 0;
    size_t x_size_bytes = 0;

    void * src_dev;
    void * dst_dev;
    hipMalloc(&src_dev, src_dims[0] * src_dims[1] * dt_size);
    hipMalloc(&dst_dev, dst_dims[0] * dst_dims[1] * dt_size);
    std::vector<float> src = {1,2,3,4};
    hipMemcpy(src_dev, src.data(), 4 * dt_size, hipMemcpyDefault);

    miopenTransformTensor(handle, &alpha, tensor_descs[0], src_dev, &beta, tensor_descs[1], dst_dev);

    hipDeviceSynchronize();

    std::vector<float> dst(4);
    hipMemcpy(dst.data(), dst_dev, 4 * dt_size, hipMemcpyDefault);
    std::cout << "actual result: ";
    for(float f : dst){
        std::cout << f << ", "; // prints 0, 0, 0, 0 - no copying was done
    }
    std::cout << std::endl;
    std::cout << "expected result: ";
    std::vector<float> expected_dst = {1, 3, 2, 4};
    for(float f : expected_dst){
        std::cout << f << ", ";
    }
    std::cout << std::endl;
}
@atamazov
Copy link
Contributor

atamazov commented Apr 10, 2024

@t4c1 Thanks for clear instructions. Please let us know the importance of this issue for you; that would help us to prioritize our work.

Related ticket: #1704

/cc @averinevg @junliume @shurale-nkn @CAHEK7 @JehandadKhan @xinlipn @technicalgrp89

@ShanoToni
Copy link

Hi @atamazov replying in lieu of @t4c1, this error is currently blocking us as we do not have a viable workaround.
Any prioritization of this would be very much appreciated, thank you.

@atamazov
Copy link
Contributor

@junliume @JehandadKhan Can you please set the appropriate priority label and assign this to an engineer.

@t4c1
Copy link
Author

t4c1 commented Jul 26, 2024

Checking up on this - has there been any progress?

@atamazov
Copy link
Contributor

atamazov commented Aug 5, 2024

@t4c1 Sorry for delays. It seems like miopenTransformTensor works incorrectly if destination buffer contains junk (which is weird because beta = 0). Please try if hipMemset(dst_dev, 0, dst_size) (prior miopenTransformTensor) helps, while I am continuing working on this.

@atamazov
Copy link
Contributor

atamazov commented Aug 5, 2024

@ShanoToni please see previous comment

@atamazov
Copy link
Contributor

atamazov commented Aug 6, 2024

@ShanoToni @t4c1 Reproducer 1 works without problems if workaround (mentioned at #2828 (comment)) is applied.

Reproducer 2 contains bugs:

hipMalloc(&src_dev, src_dims[0] * src_dims[1] * dt_size); // Bug: src_dims[2] and [3] are not observed
                                                          // -> space for only 2 elements allocated.

as a result, both hipMemcpy calls failed. Failures are not caught because return codes aren't checked.

Simplest fix:

#define EC(expr)                                                                   \
    do                                                                             \
    {                                                                              \
        auto rc = (expr);                                                          \
        if(rc != 0)                                                                \
        {                                                                          \
            std::cerr << "HIP call failed: '" #expr "', rc = " << rc << std::endl; \
        }                                                                          \
    } while(false)
...
    const auto nelem =
        std::accumulate(std::begin(src_dims), std::end(src_dims), 1ULL, std::multiplies<size_t>());
    const auto bufsize = nelem * DT_SIZE;
...
    EC(hipMalloc(&src_dev, bufsize));

With these fixes (plus workaround) reproducer 2 also works correctly.

Please expect PR that eliminates the need for a workaround that cleans output buffer.

/cc @junliume

@atamazov
Copy link
Contributor

atamazov commented Aug 8, 2024

@ShanoToni @t4c1

Please expect PR that eliminates the need for a workaround that cleans output buffer.

#3184 is merged in. Please check that reproducers work correctly with our develop branch and close the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants