Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rs/conv immediate #3045

Closed
wants to merge 18 commits into from
Closed

Rs/conv immediate #3045

wants to merge 18 commits into from

Conversation

randyspauldingamd
Copy link
Contributor

@randyspauldingamd randyspauldingamd commented Jun 11, 2024

Ready for review

  • I decided to keep the TYPED_TEST. It's a bit clunky due to gtest limitations, and somewhat limited in scalability, but compared to alternatives it isn't all that bad and it will be a useful example in the future.
  • It would be nice to add miopenTranspose mode, but the CPU code causes core dumps in this mode. I wasn't able to find the cause in the time available, so this is left for future work.

Copy link
Contributor

@CAHEK7 CAHEK7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggested some simplifications.

But first of all, direction and layout must not be template parameters. It slows down compilation and bloats the code. That's why the biggest problem is not a passing multiple template arguments into TYPED_TEST, but a mixing value-parametrized TEST_P and type-parametrized TYPED_TEST approaches.

Typed approach is not scalable at all and can handle just a few dozens of hand-written cases, while we may need much more and generate it.

test/gtest/conv_immediate.cpp Outdated Show resolved Hide resolved
test/gtest/conv_immediate.cpp Outdated Show resolved Hide resolved
test/gtest/conv_immediate.cpp Outdated Show resolved Hide resolved
test/gtest/conv_immediate.cpp Outdated Show resolved Hide resolved
test/gtest/conv_immediate.cpp Outdated Show resolved Hide resolved
@CAHEK7
Copy link
Contributor

CAHEK7 commented Jun 11, 2024

I would suggest not mixing type-parametrized and value-parametrized tests in our test suite without really strong reason - we've already got quite complex construction and let's not overcomplicate it more.

Most of the tests relies on value parametrization and if we need different types, it can be solved manually, or I can provide some macros - but in this case we can still use standardized test declaration and we can use all the power of value-parametrized test like passing list of test cases or even combine and mix the cases.

@JehandadKhan may I also suggest to stop any testing related activities until we finalize the test design documentation and creat some sort of plan how we will fix current gtests, port legacy remains to gtest and add new tests. I expect another round of quite significant changes in the infrastructure.

@CAHEK7
Copy link
Contributor

CAHEK7 commented Jul 28, 2024

May I ask you to update the test according to the document (either by updating the document, or the test, or even both)

Copy link
Contributor

@CAHEK7 CAHEK7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still have doubts that we have to introduce one more way to run such tests, especially if there is no ability to run some particular configuration.
If you decide to go this way, don't forget do describe this approach in the documentation and put some recommendations why and when we should use TYPED_TEST instead of this TEST_P approach with types:

class Test_Bwd_Mha_F32 : public Test_Bwd_Mha<float>

Comment on lines +50 to +69
// produces the cartesian product of the tuples in A and two copies of the elements of B
// all elements must be the same type
template <typename T, typename... Ts, typename = SameType<T, Ts...>>
inline auto cartesian_product_abb(const std::vector<std::tuple<Ts...>>& A, const std::vector<T>& B)
-> std::vector<decltype(std::tuple_cat(A[0], std::make_tuple(B[0], B[0])))>
{
auto C = std::vector<decltype(std::tuple_cat(A[0], std::make_tuple(B[0], B[0])))>{};
for(auto a : A)
{
for(auto b : B)
{
for(auto b2 : B)
{
C.push_back(std::tuple_cat(a, std::make_tuple(b, b2)));
}
}
}

return C;
}
Copy link
Contributor

@CAHEK7 CAHEK7 Jul 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything in this file looks like reinvention of testing::Combine. I'm not sure we have to do this, but probably it could be useful if we can extend it with filtering and limiting capabilities.
But even in this case we can always alter the result of testing::Combine unless we have too many cases that testing::Combine can fail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was some massive overhead using TEST_P and testing::Combine which I did not want to investigate. This is appx. 8x faster (8-ish seconds vs 60-ish).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If TYPED_TEST is 8x faster, then why we are still using TEST_P? Probably it is worth to investigate the problem and propose a better solution?
Is it testing::Combine? If it's not, then you can use testing::Combine for cartesian product and this code is no needed.
Is it TEST_P and its internal infrastructure? If it's not, then your cartesian product must be used in every test.
Or maybe it's a test structure and resource managing and you can give some recommendations how to do it in a better way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, what's the reasoning behind two copies of B?


// converts a vector of a type to a vector of tuples of the type
template <typename T>
inline std::vector<std::tuple<T>> vec_to_tuple(std::vector<T> in)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
inline std::vector<std::tuple<T>> vec_to_tuple(std::vector<T> in)
inline std::vector<std::tuple<T>> vec_to_tuple(const std::vector<T>& in)

Passing std::vector by value look suspicious. Is it intended?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was not intended. I think it's the same with clang (due to inline), but yours is better anyway (clearer, more portable).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's the same with clang (due to inline)

It is not. And there no chance for that kind of optimization even for the perfectly inlined code - std::vector constructor may throw, alter global state and has side-effects - you can't remove memory allocation.

Comment on lines +116 to +158
static std::vector<int> get_layout_lengths(int n, int c, std::vector<int>& dims)
{
auto ret = std::vector<int>{n, c};
ret.insert(ret.end(), dims.cbegin(), dims.cend());

return ret;
}

static std::vector<int>
get_strides(std::vector<int>& lens, int dims, miopenTensorLayout_t tensor_layout)
{
std::vector<int> strides;
std::string layout_default = miopen::tensor_layout_get_default(dims + 2);
std::string layout_string = miopen::TensorDescriptor::GetLayoutStr(tensor_layout);

miopen::tensor_layout_to_strides(lens, layout_default, layout_string, strides);

constexpr int min_stride_multiplier = 1;
constexpr int max_stride_multiplier = 4;

auto c = prng::gen_A_to_B(min_stride_multiplier, max_stride_multiplier);
for(auto& v : strides)
{
// cppcheck-suppress useStlAlgorithm
v = v * c;
}

return strides;
}

static miopenTensorDescriptor_t init_tensor_descriptor(miopenDataType_t type,
const std::vector<int>& lens,
const std::vector<int>& strides)
{
miopenTensorDescriptor_t desc;

EXPECT_TRUE(miopenCreateTensorDescriptor(&desc) == miopenStatusSuccess);
EXPECT_TRUE(
miopenSetTensorDescriptor(desc, type, lens.size(), lens.data(), strides.data()) ==
miopenStatusSuccess);

return desc;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not depend on T, but each distinct type will generate it's own set of those functions.
Though it's not a big deal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. I've never had a reason to be concerned about bloat in test code. I would personally prefer this over something like an unnamed namespace unless it was a ridiculous size difference. But that's just me :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests take more compilation time than the library itself. We shall think about this too, as well as about the wall clock execution time.

Btw, unnamed namespace and static member function are not equivalent.


tensor<T> check{};
tensor<T> host;
miopen::TensorDescriptor descriptor;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that a separated miopen::TensorDescriptor required, since tensor<T> has already included it.

miopen::TensorDescriptor desc;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, I missed this one. This was left over from an earlier revision that reused buffers. I think you're right, I'll try to make it go away.

Comment on lines +172 to +173
check = tensor<T>{descriptor.GetLengths(), descriptor.GetStrides()};
check.data = handle.Read<T>(ddata, check.data.size());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not efficient, since we allocate twice more memory that we need.
I'd suggest to use handle.ReadTo directly to the already allacated vector. Currently Read<T> creates a new one and replaces previously allocated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I didn't know about ReadTo. Any time difference wasn't noticeable, but it's still a warm and fuzzy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, it's not a problem for tiny tensors. As well as replacing std::vector<T> by const std::vector<T>& in the vet_to_tuple won't show you any difference.

Since you've provided some abstraction, it would be nice to have a good abstraction without obvious pitfalls, or those pitfalls must be visible and clearly documented.


static std::vector<int> get_batch_size() { return {1, 2}; }

static bool same(std::vector<int> v) { return std::equal(v.begin() + 1, v.end(), v.begin()); }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passing std::vector by value looks suspicious.
Also there it an UB if v is empty.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's no longer in use, I'll delete it.

Comment on lines +272 to +297
for(auto test_case : test_cases)
{
int c, hi, wi, fy, fx, py, px;
std::tie(c, hi, wi, fy, fx, py, px) = test_case;

int n = get_batch_size()[prng::gen_canonical<size_t>()];
int g = get_group_size()[prng::gen_canonical<size_t>()];
int k = get_channel_size()[prng::gen_canonical<size_t>()];

int sy = get_stride_dilation_size()[prng::gen_canonical<size_t>()];
int sx = get_stride_dilation_size()[prng::gen_canonical<size_t>()];

int dy = get_stride_dilation_size()[prng::gen_canonical<size_t>()];
int dx = get_stride_dilation_size()[prng::gen_canonical<size_t>()];

int ho = conv_out_size(hi, py, dy, fy, sy);
int wo = conv_out_size(wi, px, dx, fx, sx);

if(fy > hi || fx > wi || (fy - 1) < py || (fx - 1) < px || ho <= 0 || wo <= 0 ||
c % g != 0 || k % g != 0)
continue;
if((fx == 3 && fy == 5) || (fx == 5 && fy == 3))
continue;

f(n, {wi, hi}, c, k, {fy, fx}, {py, px}, {sy, sx}, {dy, dx}, g);
}
Copy link
Contributor

@CAHEK7 CAHEK7 Jul 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem of this approach is that - we don't have separated test cases and if something got failed in the middle, we have to rerun aaaall the cases.

We have a bunch of similar problems in the tests, where we misuse testing::Value instead of testing::ValueIn, let's don't create a new one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid point, but the overhead associated with those test cases is horrible for a test space this vast (it was over 80k test cases at one point). I usually add custom instrumentation anyway, so adding a filter or custom data to isolate the bad ones before rebuilding is trivial.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, not a big deal. May I ask you to investigate that horrible overhead and provide some recommendations when each approach should be used and why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That investigation is far from trivial, so (just to be clear) it will not be part of this PR. There was another team member who pointed this out not long after I saw it so there may already be a ticket (@BrianHarrisonAMD maybe?). If not, I'll make a ticket and we can discuss priority during planning.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I ever added anything.

Are you saying that the testing combine macros is leading to the delay when executing tests through make check?
I think we see this delay even if they should be skipped.

Oddly, I don't notice any delay when executing the tests through the executables that are generated.
I would have expected them to behave the same.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we postpone this PR until that investigation has not been done. We can't just randomly do various things. We've already got the test broken due to this.

Copy link
Contributor Author

@randyspauldingamd randyspauldingamd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

misclick

@randyspauldingamd
Copy link
Contributor Author

I still have doubts that we have to introduce one more way to run such tests, especially if there is no ability to run some particular configuration.

There are some great discussion points here--I'm super busy the next couple weeks, but I would be interested in throwing a meeting together at some point.

If you only have a single type parameter, I think a strong argument can be made that TYPED_TEST is cleaner. Here's another example I looked at:

https://stackoverflow.com/a/64048808

It doesn't work in this case--and I also see a strong argument against the ugliness of this multi-parameter solution. In this case, the wall clock improvement wins though (as I pointed out in another comment).

@randyspauldingamd
Copy link
Contributor Author

May I ask you to update the test according to the document (either by updating the document, or the test, or even both)

Cool, I had heard some discussion about this but I wasn't aware that it was live. I think there are some interesting discussions to come, so I'll hold off on the documentation for a while. I will see if I can fix the naming tomorrow though.

Comment on lines +58 to +60
template <typename T>
auto gen_value =
[](auto... is) { return static_cast<T>(prng::gen_A_to_B(RAND_INTEGER_MIN, RAND_INTEGER_MAX)); };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly just curious, but is there a reason to use a templated lambda variable instead of a method for the generator?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you pass is into .generate() method, it supplies input lambda with indexes based on the tensor's dimensions. prng:: generator does not take any arguments, so we have to wrap it into lambda which ignores input indexes and just returns random value for each call.

Copy link
Contributor

@BrianHarrisonAMD BrianHarrisonAMD Jul 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sorry, I meant I just found it surprising that gen_value was a lambda variable instead of a method like:

template <typename T>
T GenValue(auto... is){ return static_cast<T>(prng::gen_A_to_B(RAND_INTEGER_MIN, RAND_INTEGER_MAX)); }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was existing code, and I don't have an opinion. I went with the conservation of effort approach and simply relocated it.

Comment on lines +490 to +522
class ReferenceConv2d : public ReferenceConvBase<Types>
{
protected:
void run() { this->iterate_conv_2d(this->run_conv_nd); }
};

// clang-format off
#define MakeTypeDefs(layout) \
TypeDefs<miopen::conv::Direction::Forward, float, float, layout>, \
TypeDefs<miopen::conv::Direction::Forward, float16, float16, layout>, \
TypeDefs<miopen::conv::Direction::Forward, bfloat16, bfloat16, layout>, \
TypeDefs<miopen::conv::Direction::Forward, int8_t, int32_t, layout>, \
TypeDefs<miopen::conv::Direction::Forward, int8_t, float, layout>, \
\
TypeDefs<miopen::conv::Direction::BackwardData, float, float, layout>, \
TypeDefs<miopen::conv::Direction::BackwardData, float16, float16, layout>, \
TypeDefs<miopen::conv::Direction::BackwardData, bfloat16, bfloat16, layout>, \
\
TypeDefs<miopen::conv::Direction::BackwardWeights, float, float, layout>, \
TypeDefs<miopen::conv::Direction::BackwardWeights, float16, float16, layout>, \
TypeDefs<miopen::conv::Direction::BackwardWeights, bfloat16, bfloat16, layout>

// clang-format on

using Implementations2d =
::testing::Types<MakeTypeDefs(miopenTensorNCHW), MakeTypeDefs(miopenTensorNHWC)>;

TYPED_TEST_CASE(ReferenceConv2d, Implementations2d);

TYPED_TEST(ReferenceConv2d, Test) { this->run(); }

template <class Types>
struct ReferenceConv3d : ReferenceConvBase<Types>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these tests cover the same functionality as gpu_reference_kernel.cpp?

If yes, then should we remove the gpu_reference_kernel tests with this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these tests cover the same functionality as gpu_reference_kernel.cpp?

If yes, then should we remove the gpu_reference_kernel tests with this change?

Yes, this is a direct replacement/port to gtest. I've just kept the original for reference temporarily.

Copy link
Contributor

@amberhassaan amberhassaan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a couple of comments. Overall, I agree with @CAHEK7 that we should investigate the source of overhead and then arrive at a solution.

Comment on lines +50 to +69
// produces the cartesian product of the tuples in A and two copies of the elements of B
// all elements must be the same type
template <typename T, typename... Ts, typename = SameType<T, Ts...>>
inline auto cartesian_product_abb(const std::vector<std::tuple<Ts...>>& A, const std::vector<T>& B)
-> std::vector<decltype(std::tuple_cat(A[0], std::make_tuple(B[0], B[0])))>
{
auto C = std::vector<decltype(std::tuple_cat(A[0], std::make_tuple(B[0], B[0])))>{};
for(auto a : A)
{
for(auto b : B)
{
for(auto b2 : B)
{
C.push_back(std::tuple_cat(a, std::make_tuple(b, b2)));
}
}
}

return C;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, what's the reasoning behind two copies of B?

{
auto lens = get_layout_lengths(_n, _c, _dims);
auto strides = get_strides(lens, _dims.size(), _tensor_layout);
descriptor = miopen::TensorDescriptor{miopen_type<T>{}, lens, strides};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as @CAHEK7 suggested, it's better to use tensor<T> than TensorDescriptor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please refer to my reply below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the 'two copies of B' question: I was just replicating the original test cases which use repeated values for h and w for image, filter, and pad sizes. It ends up running two squares and two rectangles of each. For example, image sizes tested are [9x9], [14x9], [9x14], and [14x14].

@junliume
Copy link
Collaborator

junliume commented Aug 1, 2024

@randyspauldingamd can we follow up and avoid a hang PR? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants