Skip to content

Conversation

@SebastianBoblest
Copy link
Contributor

This is a proposed change to the new topi x86 concat implementation in #11341. It uses simple lists of int instead of te.tensor.Tensor for the array offsets and loop variable extents. I have only looked at the generated C code and made sure that the relevant unit tests pass.
New version:

 for (int32_t j = 0; j < 4; ++j) {
    concatenate_ext[j] = placeholder[j];
  }
  for (int32_t j1 = 0; j1 < 4; ++j1) {
    concatenate_ext[(j1 + 4)] = placeholder1[j1];
  }
  return 0;

Old version:

  void* const_vector_let = (&(global_workspace_6_var[64]));
  void* cumsum_let = (&(global_workspace_6_var[48]));
  for (int32_t i = 0; i < 2; ++i) {
    ((int64_t*)const_vector_let)[i] = ((i == 1) ? (int64_t)4 : ((i == 0) ? (int64_t)4 : (int64_t)0));
  }
  for (int32_t i1 = 0; i1 < 2; ++i1) {
    ((int64_t*)cumsum_let)[i1] = ((i1 == 1) ? (int64_t)4 : (int64_t)0);
  }

  int64_t cumsum_let[2] = {0, 4};

  for (int64_t j = 0; j < ((int64_t*)const_vector_let)[0]; ++j) {
    concatenate_ext[(((int64_t*)cumsum_let)[0] + j)] = placeholder[j];
  }
  for (int64_t j1 = 0; j1 < ((int64_t*)const_vector_let)[1]; ++j1) {
    concatenate_ext[(((int64_t*)cumsum_let)[1] + j1)] = placeholder1[j1];
  }
  return 0;

@DzAvril @shtinsa @MichaelJKlaiber @UlrikHjort-Bosch @vdkhoi @masahi

@altanh
Copy link
Contributor

altanh commented Jun 21, 2022

I got some promising improvements with this change (same setup as in #11341 (comment)):
concat_compare
thanks for the PR!

@masahi
Copy link
Member

masahi commented Jun 22, 2022

@shtinsa I think const_vector things are intended to be unrolled during scheduling, which will probably end up in the same clean IR as the new version. That's what we do for winograd convolution, where we use const_matrix. Since the new concat is written in IR builder, I'm not sure if we can do the same unrolling for cumsum etc here.

@masahi masahi merged commit 5056eb7 into apache:main Jun 22, 2022
@shtinsa
Copy link
Contributor

shtinsa commented Jun 22, 2022

Hello @masahi yes, I saw the const_matrix solution and it works on IR level, but IMHO it looks like as workaround. I suppose it is necessary to extend the codegen part to support const_vector with implementation of constexpr analogue from c++17, but for the provided concatenation case it is enough to have following code snippets:
static const int cumsum_###[2] = {...};
static const int const_vector_###[2] = {...};
where array sizes and and array internals should be formed during evaluation of const_vector tensors, And they are known in case of static planning, for the case with dynamic planning the concatenation code may have similar structure and cumsum_### and const_vector_### should be formed for the batch size 1, but during the execution stage the values from these arrays should be scaled by batch size value.
These arrays will be moved to data section within the output so file.

@areusch
Copy link
Contributor

areusch commented Jun 22, 2022

we discussed the effect of this PR on the codebase at the TVM Community Meeting this morning. @SebastianBoblest notes that this PR caused a significant change in the output in their code (see community meeting recording), and wonders:

  • how could we improve the development process so we can detect this during the code-review process? it's important for them to be able to track changes like this.

@SebastianBoblest notes that it's hard to write a test against the generated code because level

  • Eric: might be able to test this more easily at the TIR level, since we could traverse the IR and make assertions on that.

@kparzysz-quic : what don't we like about this particular code snippet:

  • @SebastianBoblest : this code is mostly padding 0's so the additional arrays are unnecessary and suboptimal.
  • Krszyztof: note it's going to be hard to read because it's essentially assembly. want to understand whether it's worse in terms of performance, memory, code size etc. need to have specific metrics we are trying to address.

@manupa-arm :

  • would call this a "fake fusion"--function signature indicates that fusion should happen, but for loop structure looks like it hasn't happened. This PR may have broken fusion.
  • in this case, USMP was able to translate the tir.allocate node into a local array access that may enable the downstream C compiler to optimize it away. but in other cases, if the tir.allocate remains, no compiler would be able to optimize it.
  • something in TE is stopping the dependency computation done in the compute_inline pass.
  • not sure what type of test should be there to ensure operator fusion happens.
  • Michael: would be possible to check the memory footprint e.g. by checking the allocate nodes?
  • note there was a test in this PR that noted a change in global workspace size. but it is hard to detect a change like this just from looking at the global workspace.

@kparzysz-quic: small change should not make a large change downstream. difficult to answer because complexity of compiler can cause cascading differences.

  • this is difficult to detect in general. we need to come up with something to watch to help us know when this changes. the main concern is how do we algorithmically determine when something changes?

@driazati : one way we can address this is to make the generated code part of the test and require folks to update the test.

  • @kparzysz-quic: if someone unfamiliar with this architecture makes a change in TE and sees that test change, how would they decide whether that change is acceptable or not?
  • @areusch : we have a tool that translates graph specifications into TIR. at the point TIR hits an arch-specific codegen, it becomes hard for someone to assert what's an improvement vs detriment to the codebase. we could consider that this lends a sense of ownership over that component.
  • @areusch : we could also codegen a set of example workloads on various backends, then generate a report page in CI and highlight the differences. curious to know what folks would like to see in such a tool.
  • @kparzysz-quic: what should this tool show to you? we could have a tracking for e.g. allocations and report e.g. an aggregation and graph over time. change-detector tests might slow review down. but metrics can be too coarse.
  • @areusch: we don't have to make the diff tool into a test--it could just report.

to follow-up here, we should probably work up a small RFC or discuss post to describe the proposed additional tooling.

@shtinsa
Copy link
Contributor

shtinsa commented Jun 23, 2022

I watched the stream and it looks like the problem related incorrect fusing of GRU/Attention output. @SebastianBoblest could you please provide test scenario to reproduce the issue?

@SebastianBoblest
Copy link
Contributor Author

I watched the stream and it looks like the problem related incorrect fusing of GRU/Attention output. @SebastianBoblest could you please provide test scenario to reproduce the issue?

Hi, I will try to create a minimal script for this next week. Thanks for still looking into this!

@SebastianBoblest
Copy link
Contributor Author

I watched the stream and it looks like the problem related incorrect fusing of GRU/Attention output. @SebastianBoblest could you please provide test scenario to reproduce the issue?

Hi I created a minimal example for an lstm.
It includes the python script, the model and the C sources.
I used the TVM main branch of today.
concat_example.zip
.

@shtinsa
Copy link
Contributor

shtinsa commented Jun 30, 2022

Hello, thanks a lot!

blackkker pushed a commit to blackkker/tvm that referenced this pull request Jul 7, 2022
* changed x86/concat to use lists of ints instead of te.tensor.Tensor for loop extents and array offsets

* typos fixed

* removed unused import

* fixed micro model test

* fixed micro model test
@shtinsa
Copy link
Contributor

shtinsa commented Jul 8, 2022

Hello @SebastianBoblest, sorry for the delayed answer, and I would like to clarify the issue from the sample above. According to the c codegen output I see that new functionality provides following code snippets:

TVM_DLL int32_t tvmgen_default_fused_squeeze_concatenate_1(float* placeholder, float* placeholder1, float* placeholder2, float* concatenate_ext, uint8_t* global_const_workspace_14_var, uint8_t* global_workspace_15_var) {
  void* T_squeeze_let = (&(global_workspace_15_var[32]));
  for (int32_t ax0_ax1_fused = 0; ax0_ax1_fused < 3; ++ax0_ax1_fused) {
    ((float*)T_squeeze_let)[ax0_ax1_fused] = placeholder1[ax0_ax1_fused];
  }
  for (int32_t j = 0; j < 3; ++j) {
    concatenate_ext[j] = ((float*)T_squeeze_let)[j];
  }
  for (int32_t j1 = 0; j1 < 2; ++j1) {
    concatenate_ext[(j1 + 3)] = placeholder2[j1];
  }
  return 0;
}

Where squeeze operation is not fused with concatenation code and I suppose you would like to see something like this:

TVM_DLL int32_t tvmgen_default_fused_squeeze_concatenate_1(float* placeholder, float* placeholder1, float* placeholder2, float* concatenate_ext, uint8_t* global_const_workspace_14_var, uint8_t* global_workspace_15_var) {
  for (int32_t j = 0; j < 3; ++j) {
    concatenate_ext[j] = placeholder1[j];
  }
  for (int32_t j1 = 0; j1 < 2; ++j1) {
    concatenate_ext[(j1 + 3)] = placeholder2[j1];
  }
  return 0;
}

where the first and second loops are fused and the T_squeeze_let buffer removed from the global space?

@SebastianBoblest
Copy link
Contributor Author

Hello @SebastianBoblest, sorry for the delayed answer, and I would like to clarify the issue from the sample above. According to the c codegen output I see that new functionality provides following code snippets:

TVM_DLL int32_t tvmgen_default_fused_squeeze_concatenate_1(float* placeholder, float* placeholder1, float* placeholder2, float* concatenate_ext, uint8_t* global_const_workspace_14_var, uint8_t* global_workspace_15_var) {
  void* T_squeeze_let = (&(global_workspace_15_var[32]));
  for (int32_t ax0_ax1_fused = 0; ax0_ax1_fused < 3; ++ax0_ax1_fused) {
    ((float*)T_squeeze_let)[ax0_ax1_fused] = placeholder1[ax0_ax1_fused];
  }
  for (int32_t j = 0; j < 3; ++j) {
    concatenate_ext[j] = ((float*)T_squeeze_let)[j];
  }
  for (int32_t j1 = 0; j1 < 2; ++j1) {
    concatenate_ext[(j1 + 3)] = placeholder2[j1];
  }
  return 0;
}

Where squeeze operation is not fused with concatenation code and I suppose you would like to see something like this:

TVM_DLL int32_t tvmgen_default_fused_squeeze_concatenate_1(float* placeholder, float* placeholder1, float* placeholder2, float* concatenate_ext, uint8_t* global_const_workspace_14_var, uint8_t* global_workspace_15_var) {
  for (int32_t j = 0; j < 3; ++j) {
    concatenate_ext[j] = placeholder1[j];
  }
  for (int32_t j1 = 0; j1 < 2; ++j1) {
    concatenate_ext[(j1 + 3)] = placeholder2[j1];
  }
  return 0;
}

where the first and second loops are fused and the T_squeeze_let buffer removed from the global space?

Hi @shtinsa,
thanks for taking the time to look into this. Yes I would like exactly the result you describe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants