Rs/conv immediate #3045

randyspauldingamd · 2024-06-11T05:43:30Z

Ready for review

I decided to keep the TYPED_TEST. It's a bit clunky due to gtest limitations, and somewhat limited in scalability, but compared to alternatives it isn't all that bad and it will be a useful example in the future.
It would be nice to add miopenTranspose mode, but the CPU code causes core dumps in this mode. I wasn't able to find the cause in the time available, so this is left for future work.

CAHEK7

I suggested some simplifications.

But first of all, direction and layout must not be template parameters. It slows down compilation and bloats the code. That's why the biggest problem is not a passing multiple template arguments into TYPED_TEST, but a mixing value-parametrized TEST_P and type-parametrized TYPED_TEST approaches.

Typed approach is not scalable at all and can handle just a few dozens of hand-written cases, while we may need much more and generate it.

test/gtest/conv_immediate.cpp

CAHEK7 · 2024-06-11T23:39:21Z

I would suggest not mixing type-parametrized and value-parametrized tests in our test suite without really strong reason - we've already got quite complex construction and let's not overcomplicate it more.

Most of the tests relies on value parametrization and if we need different types, it can be solved manually, or I can provide some macros - but in this case we can still use standardized test declaration and we can use all the power of value-parametrized test like passing list of test cases or even combine and mix the cases.

@JehandadKhan may I also suggest to stop any testing related activities until we finalize the test design documentation and creat some sort of plan how we will fix current gtests, port legacy remains to gtest and add new tests. I expect another round of quite significant changes in the infrastructure.

CAHEK7 · 2024-07-28T13:08:19Z

May I ask you to update the test according to the document (either by updating the document, or the test, or even both)

CAHEK7

I still have doubts that we have to introduce one more way to run such tests, especially if there is no ability to run some particular configuration.
If you decide to go this way, don't forget do describe this approach in the documentation and put some recommendations why and when we should use TYPED_TEST instead of this TEST_P approach with types:

MIOpen/test/gtest/gpu_mha_backward.cpp

Line 421 in bd993eb

class Test_Bwd_Mha_F32 : public Test_Bwd_Mha<float>

CAHEK7 · 2024-07-28T13:13:43Z

test/gtest/cartesian_product.hpp

+// produces the cartesian product of the tuples in A and two copies of the elements of B
+// all elements must be the same type
+template <typename T, typename... Ts, typename = SameType<T, Ts...>>
+inline auto cartesian_product_abb(const std::vector<std::tuple<Ts...>>& A, const std::vector<T>& B)
+    -> std::vector<decltype(std::tuple_cat(A[0], std::make_tuple(B[0], B[0])))>
+{
+    auto C = std::vector<decltype(std::tuple_cat(A[0], std::make_tuple(B[0], B[0])))>{};
+    for(auto a : A)
+    {
+        for(auto b : B)
+        {
+            for(auto b2 : B)
+            {
+                C.push_back(std::tuple_cat(a, std::make_tuple(b, b2)));
+            }
+        }
+    }
+
+    return C;
+}


Everything in this file looks like reinvention of testing::Combine. I'm not sure we have to do this, but probably it could be useful if we can extend it with filtering and limiting capabilities.
But even in this case we can always alter the result of testing::Combine unless we have too many cases that testing::Combine can fail.

There was some massive overhead using TEST_P and testing::Combine which I did not want to investigate. This is appx. 8x faster (8-ish seconds vs 60-ish).

If TYPED_TEST is 8x faster, then why we are still using TEST_P? Probably it is worth to investigate the problem and propose a better solution?
Is it testing::Combine? If it's not, then you can use testing::Combine for cartesian product and this code is no needed.
Is it TEST_P and its internal infrastructure? If it's not, then your cartesian product must be used in every test.
Or maybe it's a test structure and resource managing and you can give some recommendations how to do it in a better way.

Also, what's the reasoning behind two copies of B?

CAHEK7 · 2024-07-28T13:15:54Z

test/gtest/cartesian_product.hpp

+
+// converts a vector of a type to a vector of tuples of the type
+template <typename T>
+inline std::vector<std::tuple<T>> vec_to_tuple(std::vector<T> in)


Suggested change

inline std::vector<std::tuple<T>> vec_to_tuple(std::vector<T> in)

inline std::vector<std::tuple<T>> vec_to_tuple(const std::vector<T>& in)

Passing std::vector by value look suspicious. Is it intended?

It was not intended. I think it's the same with clang (due to inline), but yours is better anyway (clearer, more portable).

I think it's the same with clang (due to inline)

It is not. And there no chance for that kind of optimization even for the perfectly inlined code - std::vector constructor may throw, alter global state and has side-effects - you can't remove memory allocation.

CAHEK7 · 2024-07-28T13:26:42Z

test/gtest/conv_immediate.cpp

+    static std::vector<int> get_layout_lengths(int n, int c, std::vector<int>& dims)
+    {
+        auto ret = std::vector<int>{n, c};
+        ret.insert(ret.end(), dims.cbegin(), dims.cend());
+
+        return ret;
+    }
+
+    static std::vector<int>
+    get_strides(std::vector<int>& lens, int dims, miopenTensorLayout_t tensor_layout)
+    {
+        std::vector<int> strides;
+        std::string layout_default = miopen::tensor_layout_get_default(dims + 2);
+        std::string layout_string  = miopen::TensorDescriptor::GetLayoutStr(tensor_layout);
+
+        miopen::tensor_layout_to_strides(lens, layout_default, layout_string, strides);
+
+        constexpr int min_stride_multiplier = 1;
+        constexpr int max_stride_multiplier = 4;
+
+        auto c = prng::gen_A_to_B(min_stride_multiplier, max_stride_multiplier);
+        for(auto& v : strides)
+        {
+            // cppcheck-suppress useStlAlgorithm
+            v = v * c;
+        }
+
+        return strides;
+    }
+
+    static miopenTensorDescriptor_t init_tensor_descriptor(miopenDataType_t type,
+                                                           const std::vector<int>& lens,
+                                                           const std::vector<int>& strides)
+    {
+        miopenTensorDescriptor_t desc;
+
+        EXPECT_TRUE(miopenCreateTensorDescriptor(&desc) == miopenStatusSuccess);
+        EXPECT_TRUE(
+            miopenSetTensorDescriptor(desc, type, lens.size(), lens.data(), strides.data()) ==
+            miopenStatusSuccess);
+
+        return desc;
+    }


It does not depend on T, but each distinct type will generate it's own set of those functions.
Though it's not a big deal.

Interesting. I've never had a reason to be concerned about bloat in test code. I would personally prefer this over something like an unnamed namespace unless it was a ridiculous size difference. But that's just me :)

The tests take more compilation time than the library itself. We shall think about this too, as well as about the wall clock execution time.

Btw, unnamed namespace and static member function are not equivalent.

CAHEK7 · 2024-07-28T13:34:06Z

test/gtest/conv_immediate.cpp

+
+    tensor<T> check{};
+    tensor<T> host;
+    miopen::TensorDescriptor descriptor;


I'm not sure that a separated miopen::TensorDescriptor required, since tensor<T> has already included it.

MIOpen/test/tensor_holder.hpp

Line 143 in bd993eb

miopen::TensorDescriptor desc;

Oops, I missed this one. This was left over from an earlier revision that reused buffers. I think you're right, I'll try to make it go away.

CAHEK7 · 2024-07-28T13:43:21Z

test/gtest/conv_immediate.cpp

+        check      = tensor<T>{descriptor.GetLengths(), descriptor.GetStrides()};
+        check.data = handle.Read<T>(ddata, check.data.size());


It's not efficient, since we allocate twice more memory that we need.
I'd suggest to use handle.ReadTo directly to the already allacated vector. Currently Read<T> creates a new one and replaces previously allocated.

Nice, I didn't know about ReadTo. Any time difference wasn't noticeable, but it's still a warm and fuzzy.

Sure, it's not a problem for tiny tensors. As well as replacing std::vector<T> by const std::vector<T>& in the vet_to_tuple won't show you any difference.

Since you've provided some abstraction, it would be nice to have a good abstraction without obvious pitfalls, or those pitfalls must be visible and clearly documented.

CAHEK7 · 2024-07-28T13:50:54Z

test/gtest/conv_immediate.cpp

+
+    static std::vector<int> get_batch_size() { return {1, 2}; }
+
+    static bool same(std::vector<int> v) { return std::equal(v.begin() + 1, v.end(), v.begin()); }


Passing std::vector by value looks suspicious.
Also there it an UB if v is empty.

It's no longer in use, I'll delete it.

CAHEK7 · 2024-07-28T13:54:37Z

test/gtest/conv_immediate.cpp

+        for(auto test_case : test_cases)
+        {
+            int c, hi, wi, fy, fx, py, px;
+            std::tie(c, hi, wi, fy, fx, py, px) = test_case;
+
+            int n = get_batch_size()[prng::gen_canonical<size_t>()];
+            int g = get_group_size()[prng::gen_canonical<size_t>()];
+            int k = get_channel_size()[prng::gen_canonical<size_t>()];
+
+            int sy = get_stride_dilation_size()[prng::gen_canonical<size_t>()];
+            int sx = get_stride_dilation_size()[prng::gen_canonical<size_t>()];
+
+            int dy = get_stride_dilation_size()[prng::gen_canonical<size_t>()];
+            int dx = get_stride_dilation_size()[prng::gen_canonical<size_t>()];
+
+            int ho = conv_out_size(hi, py, dy, fy, sy);
+            int wo = conv_out_size(wi, px, dx, fx, sx);
+
+            if(fy > hi || fx > wi || (fy - 1) < py || (fx - 1) < px || ho <= 0 || wo <= 0 ||
+               c % g != 0 || k % g != 0)
+                continue;
+            if((fx == 3 && fy == 5) || (fx == 5 && fy == 3))
+                continue;
+
+            f(n, {wi, hi}, c, k, {fy, fx}, {py, px}, {sy, sx}, {dy, dx}, g);
+        }


The problem of this approach is that - we don't have separated test cases and if something got failed in the middle, we have to rerun aaaall the cases.

We have a bunch of similar problems in the tests, where we misuse testing::Value instead of testing::ValueIn, let's don't create a new one.

Valid point, but the overhead associated with those test cases is horrible for a test space this vast (it was over 80k test cases at one point). I usually add custom instrumentation anyway, so adding a filter or custom data to isolate the bad ones before rebuilding is trivial.

Sure, not a big deal. May I ask you to investigate that horrible overhead and provide some recommendations when each approach should be used and why?

That investigation is far from trivial, so (just to be clear) it will not be part of this PR. There was another team member who pointed this out not long after I saw it so there may already be a ticket (@BrianHarrisonAMD maybe?). If not, I'll make a ticket and we can discuss priority during planning.

I don't think I ever added anything.

Are you saying that the testing combine macros is leading to the delay when executing tests through make check?
I think we see this delay even if they should be skipped.

Oddly, I don't notice any delay when executing the tests through the executables that are generated.
I would have expected them to behave the same.

Then we postpone this PR until that investigation has not been done. We can't just randomly do various things. We've already got the test broken due to this.

randyspauldingamd

misclick

randyspauldingamd · 2024-07-29T06:48:25Z

I still have doubts that we have to introduce one more way to run such tests, especially if there is no ability to run some particular configuration.

There are some great discussion points here--I'm super busy the next couple weeks, but I would be interested in throwing a meeting together at some point.

If you only have a single type parameter, I think a strong argument can be made that TYPED_TEST is cleaner. Here's another example I looked at:

https://stackoverflow.com/a/64048808

It doesn't work in this case--and I also see a strong argument against the ugliness of this multi-parameter solution. In this case, the wall clock improvement wins though (as I pointed out in another comment).

randyspauldingamd · 2024-07-29T06:53:55Z

May I ask you to update the test according to the document (either by updating the document, or the test, or even both)

Cool, I had heard some discussion about this but I wasn't aware that it was live. I think there are some interesting discussions to come, so I'll hold off on the documentation for a while. I will see if I can fix the naming tomorrow though.

BrianHarrisonAMD · 2024-07-30T16:50:07Z

test/gtest/conv_immediate.cpp

+template <typename T>
+auto gen_value =
+    [](auto... is) { return static_cast<T>(prng::gen_A_to_B(RAND_INTEGER_MIN, RAND_INTEGER_MAX)); };


Mostly just curious, but is there a reason to use a templated lambda variable instead of a method for the generator?

If you pass is into .generate() method, it supplies input lambda with indexes based on the tensor's dimensions. prng:: generator does not take any arguments, so we have to wrap it into lambda which ignores input indexes and just returns random value for each call.

Oh sorry, I meant I just found it surprising that gen_value was a lambda variable instead of a method like:

template <typename T> T GenValue(auto... is){ return static_cast<T>(prng::gen_A_to_B(RAND_INTEGER_MIN, RAND_INTEGER_MAX)); }

This was existing code, and I don't have an opinion. I went with the conservation of effort approach and simply relocated it.

BrianHarrisonAMD · 2024-07-30T17:19:40Z

test/gtest/conv_immediate.cpp

+class ReferenceConv2d : public ReferenceConvBase<Types>
+{
+protected:
+    void run() { this->iterate_conv_2d(this->run_conv_nd); }
+};
+
+// clang-format off
+#define MakeTypeDefs(layout)                                                                \
+    TypeDefs<miopen::conv::Direction::Forward,          float,      float,      layout>,    \
+    TypeDefs<miopen::conv::Direction::Forward,          float16,    float16,    layout>,    \
+    TypeDefs<miopen::conv::Direction::Forward,          bfloat16,   bfloat16,   layout>,    \
+    TypeDefs<miopen::conv::Direction::Forward,          int8_t,     int32_t,    layout>,    \
+    TypeDefs<miopen::conv::Direction::Forward,          int8_t,     float,      layout>,    \
+                                                                                            \
+    TypeDefs<miopen::conv::Direction::BackwardData,     float,      float,      layout>,    \
+    TypeDefs<miopen::conv::Direction::BackwardData,     float16,    float16,    layout>,    \
+    TypeDefs<miopen::conv::Direction::BackwardData,     bfloat16,   bfloat16,   layout>,    \
+                                                                                            \
+    TypeDefs<miopen::conv::Direction::BackwardWeights,  float,      float,      layout>,    \
+    TypeDefs<miopen::conv::Direction::BackwardWeights,  float16,    float16,    layout>,    \
+    TypeDefs<miopen::conv::Direction::BackwardWeights,  bfloat16,   bfloat16,   layout>
+
+// clang-format on
+
+using Implementations2d =
+    ::testing::Types<MakeTypeDefs(miopenTensorNCHW), MakeTypeDefs(miopenTensorNHWC)>;
+
+TYPED_TEST_CASE(ReferenceConv2d, Implementations2d);
+
+TYPED_TEST(ReferenceConv2d, Test) { this->run(); }
+
+template <class Types>
+struct ReferenceConv3d : ReferenceConvBase<Types>


Do these tests cover the same functionality as gpu_reference_kernel.cpp?

If yes, then should we remove the gpu_reference_kernel tests with this change?

Do these tests cover the same functionality as gpu_reference_kernel.cpp?

If yes, then should we remove the gpu_reference_kernel tests with this change?

Yes, this is a direct replacement/port to gtest. I've just kept the original for reference temporarily.

amberhassaan

a couple of comments. Overall, I agree with @CAHEK7 that we should investigate the source of overhead and then arrive at a solution.

amberhassaan · 2024-07-30T19:49:15Z

test/gtest/cartesian_product.hpp

+// produces the cartesian product of the tuples in A and two copies of the elements of B
+// all elements must be the same type
+template <typename T, typename... Ts, typename = SameType<T, Ts...>>
+inline auto cartesian_product_abb(const std::vector<std::tuple<Ts...>>& A, const std::vector<T>& B)
+    -> std::vector<decltype(std::tuple_cat(A[0], std::make_tuple(B[0], B[0])))>
+{
+    auto C = std::vector<decltype(std::tuple_cat(A[0], std::make_tuple(B[0], B[0])))>{};
+    for(auto a : A)
+    {
+        for(auto b : B)
+        {
+            for(auto b2 : B)
+            {
+                C.push_back(std::tuple_cat(a, std::make_tuple(b, b2)));
+            }
+        }
+    }
+
+    return C;
+}


Also, what's the reasoning behind two copies of B?

amberhassaan · 2024-07-30T19:52:22Z

test/gtest/conv_immediate.cpp

+    {
+        auto lens    = get_layout_lengths(_n, _c, _dims);
+        auto strides = get_strides(lens, _dims.size(), _tensor_layout);
+        descriptor   = miopen::TensorDescriptor{miopen_type<T>{}, lens, strides};


as @CAHEK7 suggested, it's better to use tensor<T> than TensorDescriptor.

Yes, please refer to my reply below.

On the 'two copies of B' question: I was just replicating the original test cases which use repeated values for h and w for image, filter, and pad sizes. It ends up running two squares and two rectangles of each. For example, image sizes tested are [9x9], [14x9], [9x14], and [14x14].

junliume · 2024-08-01T06:55:15Z

@randyspauldingamd can we follow up and avoid a hang PR? Thanks!

randyspauldingamd added 5 commits June 11, 2024 05:28

gtest running

ae42940

typed test w/TRef Tout

89e053a

fully templated

137eceb

cleanup, use macro for TypeDefs

4b83e59

de-clang-format

b0fbfba

randyspauldingamd requested review from JehandadKhan and junliume as code owners June 11, 2024 05:43

randyspauldingamd marked this pull request as draft June 11, 2024 05:43

CAHEK7 requested changes Jun 11, 2024

View reviewed changes

randyspauldingamd added 11 commits June 12, 2024 00:37

use using, remove 'run' template

2499ea8

checkpoint before switching to class

c3eaea7

refine cartesian_products

3f3d2ef

layout_data

926e042

2d transpose working

c73f95f

run_nd - segfaults

d6bf870

cfmt

de3d828

extra testing norm+transpose

f4ad092

cleanup

c28a89d

TensorDescriptor

338c1f2

remove conv_mode

a7383b0

randyspauldingamd marked this pull request as ready for review June 24, 2024 15:50

randyspauldingamd added value_middle testing complexity_low urgency_normal labels Jun 24, 2024

Merge branch 'develop' into rs/conv_immediate

4b35905

randyspauldingamd added the TESTING_CI_PASSED label Jul 27, 2024

randyspauldingamd requested review from CAHEK7 and amberhassaan July 27, 2024 22:38

randyspauldingamd requested review from BrianHarrisonAMD and iq136boy July 27, 2024 22:39

CAHEK7 requested changes Jul 28, 2024

View reviewed changes

randyspauldingamd commented Jul 29, 2024

View reviewed changes

Merge branch 'develop' into rs/conv_immediate

bad26fd

BrianHarrisonAMD reviewed Jul 30, 2024

View reviewed changes

amberhassaan reviewed Jul 30, 2024

View reviewed changes

randyspauldingamd closed this Aug 3, 2024

	inline std::vector<std::tuple<T>> vec_to_tuple(std::vector<T> in)
	inline std::vector<std::tuple<T>> vec_to_tuple(const std::vector<T>& in)

		check = tensor<T>{descriptor.GetLengths(), descriptor.GetStrides()};
		check.data = handle.Read<T>(ddata, check.data.size());


		static std::vector<int> get_batch_size() { return {1, 2}; }

		static bool same(std::vector<int> v) { return std::equal(v.begin() + 1, v.end(), v.begin()); }

Rs/conv immediate #3045

Rs/conv immediate #3045

Conversation

randyspauldingamd commented Jun 11, 2024 • edited Loading

CAHEK7 left a comment

Choose a reason for hiding this comment

CAHEK7 commented Jun 11, 2024

CAHEK7 commented Jul 28, 2024

CAHEK7 left a comment

Choose a reason for hiding this comment

CAHEK7 Jul 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CAHEK7 Jul 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

randyspauldingamd left a comment

Choose a reason for hiding this comment

randyspauldingamd commented Jul 29, 2024

randyspauldingamd commented Jul 29, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BrianHarrisonAMD Jul 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amberhassaan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

junliume commented Aug 1, 2024

randyspauldingamd commented Jun 11, 2024 •

edited

Loading

CAHEK7 Jul 28, 2024 •

edited

Loading

CAHEK7 Jul 28, 2024 •

edited

Loading

BrianHarrisonAMD Jul 30, 2024 •

edited

Loading