Change new concat #11800

SebastianBoblest · 2022-06-21T12:14:40Z

This is a proposed change to the new topi x86 concat implementation in #11341. It uses simple lists of int instead of te.tensor.Tensor for the array offsets and loop variable extents. I have only looked at the generated C code and made sure that the relevant unit tests pass.
New version:

 for (int32_t j = 0; j < 4; ++j) {
    concatenate_ext[j] = placeholder[j];
  }
  for (int32_t j1 = 0; j1 < 4; ++j1) {
    concatenate_ext[(j1 + 4)] = placeholder1[j1];
  }
  return 0;

Old version:

  void* const_vector_let = (&(global_workspace_6_var[64]));
  void* cumsum_let = (&(global_workspace_6_var[48]));
  for (int32_t i = 0; i < 2; ++i) {
    ((int64_t*)const_vector_let)[i] = ((i == 1) ? (int64_t)4 : ((i == 0) ? (int64_t)4 : (int64_t)0));
  }
  for (int32_t i1 = 0; i1 < 2; ++i1) {
    ((int64_t*)cumsum_let)[i1] = ((i1 == 1) ? (int64_t)4 : (int64_t)0);
  }

  int64_t cumsum_let[2] = {0, 4};

  for (int64_t j = 0; j < ((int64_t*)const_vector_let)[0]; ++j) {
    concatenate_ext[(((int64_t*)cumsum_let)[0] + j)] = placeholder[j];
  }
  for (int64_t j1 = 0; j1 < ((int64_t*)const_vector_let)[1]; ++j1) {
    concatenate_ext[(((int64_t*)cumsum_let)[1] + j1)] = placeholder1[j1];
  }
  return 0;

@DzAvril @shtinsa @MichaelJKlaiber @UlrikHjort-Bosch @vdkhoi @masahi

…or loop extents and array offsets

altanh · 2022-06-21T18:10:20Z

I got some promising improvements with this change (same setup as in #11341 (comment)):

thanks for the PR!

masahi · 2022-06-22T01:55:00Z

@shtinsa I think const_vector things are intended to be unrolled during scheduling, which will probably end up in the same clean IR as the new version. That's what we do for winograd convolution, where we use const_matrix. Since the new concat is written in IR builder, I'm not sure if we can do the same unrolling for cumsum etc here.

shtinsa · 2022-06-22T13:19:15Z

Hello @masahi yes, I saw the const_matrix solution and it works on IR level, but IMHO it looks like as workaround. I suppose it is necessary to extend the codegen part to support const_vector with implementation of constexpr analogue from c++17, but for the provided concatenation case it is enough to have following code snippets:
static const int cumsum_###[2] = {...};
static const int const_vector_###[2] = {...};
where array sizes and and array internals should be formed during evaluation of const_vector tensors, And they are known in case of static planning, for the case with dynamic planning the concatenation code may have similar structure and cumsum_### and const_vector_### should be formed for the batch size 1, but during the execution stage the values from these arrays should be scaled by batch size value.
These arrays will be moved to data section within the output so file.

areusch · 2022-06-22T15:15:18Z

we discussed the effect of this PR on the codebase at the TVM Community Meeting this morning. @SebastianBoblest notes that this PR caused a significant change in the output in their code (see community meeting recording), and wonders:

how could we improve the development process so we can detect this during the code-review process? it's important for them to be able to track changes like this.

@SebastianBoblest notes that it's hard to write a test against the generated code because level

Eric: might be able to test this more easily at the TIR level, since we could traverse the IR and make assertions on that.

@kparzysz-quic : what don't we like about this particular code snippet:

@SebastianBoblest : this code is mostly padding 0's so the additional arrays are unnecessary and suboptimal.
Krszyztof: note it's going to be hard to read because it's essentially assembly. want to understand whether it's worse in terms of performance, memory, code size etc. need to have specific metrics we are trying to address.

@manupa-arm :

would call this a "fake fusion"--function signature indicates that fusion should happen, but for loop structure looks like it hasn't happened. This PR may have broken fusion.
in this case, USMP was able to translate the tir.allocate node into a local array access that may enable the downstream C compiler to optimize it away. but in other cases, if the tir.allocate remains, no compiler would be able to optimize it.
something in TE is stopping the dependency computation done in the compute_inline pass.
not sure what type of test should be there to ensure operator fusion happens.
Michael: would be possible to check the memory footprint e.g. by checking the allocate nodes?
note there was a test in this PR that noted a change in global workspace size. but it is hard to detect a change like this just from looking at the global workspace.

@kparzysz-quic: small change should not make a large change downstream. difficult to answer because complexity of compiler can cause cascading differences.

this is difficult to detect in general. we need to come up with something to watch to help us know when this changes. the main concern is how do we algorithmically determine when something changes?

@driazati : one way we can address this is to make the generated code part of the test and require folks to update the test.

@kparzysz-quic: if someone unfamiliar with this architecture makes a change in TE and sees that test change, how would they decide whether that change is acceptable or not?
@areusch : we have a tool that translates graph specifications into TIR. at the point TIR hits an arch-specific codegen, it becomes hard for someone to assert what's an improvement vs detriment to the codebase. we could consider that this lends a sense of ownership over that component.
@areusch : we could also codegen a set of example workloads on various backends, then generate a report page in CI and highlight the differences. curious to know what folks would like to see in such a tool.
@kparzysz-quic: what should this tool show to you? we could have a tracking for e.g. allocations and report e.g. an aggregation and graph over time. change-detector tests might slow review down. but metrics can be too coarse.
@areusch: we don't have to make the diff tool into a test--it could just report.

to follow-up here, we should probably work up a small RFC or discuss post to describe the proposed additional tooling.

shtinsa · 2022-06-23T17:33:40Z

I watched the stream and it looks like the problem related incorrect fusing of GRU/Attention output. @SebastianBoblest could you please provide test scenario to reproduce the issue?

SebastianBoblest · 2022-06-24T12:24:15Z

I watched the stream and it looks like the problem related incorrect fusing of GRU/Attention output. @SebastianBoblest could you please provide test scenario to reproduce the issue?

Hi, I will try to create a minimal script for this next week. Thanks for still looking into this!

SebastianBoblest · 2022-06-30T09:42:51Z

I watched the stream and it looks like the problem related incorrect fusing of GRU/Attention output. @SebastianBoblest could you please provide test scenario to reproduce the issue?

Hi I created a minimal example for an lstm.
It includes the python script, the model and the C sources.
I used the TVM main branch of today.
concat_example.zip
.

shtinsa · 2022-06-30T11:38:42Z

Hello, thanks a lot!

* changed x86/concat to use lists of ints instead of te.tensor.Tensor for loop extents and array offsets * typos fixed * removed unused import * fixed micro model test * fixed micro model test

shtinsa · 2022-07-08T11:09:08Z

Hello @SebastianBoblest, sorry for the delayed answer, and I would like to clarify the issue from the sample above. According to the c codegen output I see that new functionality provides following code snippets:

TVM_DLL int32_t tvmgen_default_fused_squeeze_concatenate_1(float* placeholder, float* placeholder1, float* placeholder2, float* concatenate_ext, uint8_t* global_const_workspace_14_var, uint8_t* global_workspace_15_var) {
  void* T_squeeze_let = (&(global_workspace_15_var[32]));
  for (int32_t ax0_ax1_fused = 0; ax0_ax1_fused < 3; ++ax0_ax1_fused) {
    ((float*)T_squeeze_let)[ax0_ax1_fused] = placeholder1[ax0_ax1_fused];
  }
  for (int32_t j = 0; j < 3; ++j) {
    concatenate_ext[j] = ((float*)T_squeeze_let)[j];
  }
  for (int32_t j1 = 0; j1 < 2; ++j1) {
    concatenate_ext[(j1 + 3)] = placeholder2[j1];
  }
  return 0;
}

Where squeeze operation is not fused with concatenation code and I suppose you would like to see something like this:

TVM_DLL int32_t tvmgen_default_fused_squeeze_concatenate_1(float* placeholder, float* placeholder1, float* placeholder2, float* concatenate_ext, uint8_t* global_const_workspace_14_var, uint8_t* global_workspace_15_var) {
  for (int32_t j = 0; j < 3; ++j) {
    concatenate_ext[j] = placeholder1[j];
  }
  for (int32_t j1 = 0; j1 < 2; ++j1) {
    concatenate_ext[(j1 + 3)] = placeholder2[j1];
  }
  return 0;
}

where the first and second loops are fused and the T_squeeze_let buffer removed from the global space?

SebastianBoblest · 2022-07-08T11:13:44Z

Hello @SebastianBoblest, sorry for the delayed answer, and I would like to clarify the issue from the sample above. According to the c codegen output I see that new functionality provides following code snippets:

TVM_DLL int32_t tvmgen_default_fused_squeeze_concatenate_1(float* placeholder, float* placeholder1, float* placeholder2, float* concatenate_ext, uint8_t* global_const_workspace_14_var, uint8_t* global_workspace_15_var) {
  void* T_squeeze_let = (&(global_workspace_15_var[32]));
  for (int32_t ax0_ax1_fused = 0; ax0_ax1_fused < 3; ++ax0_ax1_fused) {
    ((float*)T_squeeze_let)[ax0_ax1_fused] = placeholder1[ax0_ax1_fused];
  }
  for (int32_t j = 0; j < 3; ++j) {
    concatenate_ext[j] = ((float*)T_squeeze_let)[j];
  }
  for (int32_t j1 = 0; j1 < 2; ++j1) {
    concatenate_ext[(j1 + 3)] = placeholder2[j1];
  }
  return 0;
}

Where squeeze operation is not fused with concatenation code and I suppose you would like to see something like this:

TVM_DLL int32_t tvmgen_default_fused_squeeze_concatenate_1(float* placeholder, float* placeholder1, float* placeholder2, float* concatenate_ext, uint8_t* global_const_workspace_14_var, uint8_t* global_workspace_15_var) {
  for (int32_t j = 0; j < 3; ++j) {
    concatenate_ext[j] = placeholder1[j];
  }
  for (int32_t j1 = 0; j1 < 2; ++j1) {
    concatenate_ext[(j1 + 3)] = placeholder2[j1];
  }
  return 0;
}

where the first and second loops are fused and the T_squeeze_let buffer removed from the global space?

Hi @shtinsa,
thanks for taking the time to look into this. Yes I would like exactly the result you describe.

Boblest Sebastian (ETAS-DEV/XPC-Fe1) added 5 commits June 21, 2022 14:04

changed x86/concat to use lists of ints instead of te.tensor.Tensor f…

74da12c

…or loop extents and array offsets

typos fixed

69a7027

removed unused import

e4d3944

fixed micro model test

97b99cc

fixed micro model test

1f7ffd0

altanh approved these changes Jun 21, 2022

View reviewed changes

masahi approved these changes Jun 22, 2022

View reviewed changes

masahi merged commit 5056eb7 into apache:main Jun 22, 2022

shtinsa mentioned this pull request Jun 22, 2022

[TE] Optimized version of concatenation layer #11341

Merged

masahi mentioned this pull request Jun 25, 2022

[Bug] concat([x], axis=1) return random results #11895

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Change new concat #11800

Change new concat #11800

Uh oh!

SebastianBoblest commented Jun 21, 2022

Uh oh!

altanh commented Jun 21, 2022

Uh oh!

masahi commented Jun 22, 2022

Uh oh!

shtinsa commented Jun 22, 2022

Uh oh!

areusch commented Jun 22, 2022 •

edited

Loading

Uh oh!

shtinsa commented Jun 23, 2022

Uh oh!

SebastianBoblest commented Jun 24, 2022

Uh oh!

SebastianBoblest commented Jun 30, 2022

Uh oh!

shtinsa commented Jun 30, 2022

Uh oh!

shtinsa commented Jul 8, 2022

Uh oh!

SebastianBoblest commented Jul 8, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Change new concat #11800

Change new concat #11800

Uh oh!

Conversation

SebastianBoblest commented Jun 21, 2022

Uh oh!

altanh commented Jun 21, 2022

Uh oh!

masahi commented Jun 22, 2022

Uh oh!

shtinsa commented Jun 22, 2022

Uh oh!

areusch commented Jun 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shtinsa commented Jun 23, 2022

Uh oh!

SebastianBoblest commented Jun 24, 2022

Uh oh!

SebastianBoblest commented Jun 30, 2022

Uh oh!

shtinsa commented Jun 30, 2022

Uh oh!

shtinsa commented Jul 8, 2022

Uh oh!

SebastianBoblest commented Jul 8, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

areusch commented Jun 22, 2022 •

edited

Loading