Skip to content

Conversation

@junrushao
Copy link
Member

@junrushao junrushao commented Jul 2, 2023

This PR enhances Decode-GEMV rule with the following changes:

  • Normalize the GEMV iter domain to S-R-C via transform-block-layout.
    This would help with further analysis and scheduling, in cases for
    example, when there was no spatial loop in the original reduction
    block.
  • Get rid of the ad hoc iter type analysis, including the logic calling
    into a TVM packed func tir.schedule.GetLoopIterType using
    tvm._ffi.get_global_func.
  • Split out the logic for two separate cases of scheduling, where the
    innermost dimension is spatial or reduction.
  • Introduces suggest_threads_per_block to guess the threads to be
    allocated each threadblock. This helps avoid the previous case where
    dlight allocates 256 threads for a workload whose degree of parallelism
    is only 128.
  • Misc improvements.

This rest of the changes are split out to separate PRs that are already
merged to main.

@tvm-bot
Copy link
Collaborator

tvm-bot commented Jul 2, 2023

Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.

  • No users to tag found in teams: dlight See #10317 for details

Generated by tvm-bot

@junrushao junrushao force-pushed the feature/2023-07-01/gemv-compute-at branch 6 times, most recently from 0596441 to 5d715aa Compare July 3, 2023 06:08
@junrushao junrushao marked this pull request as ready for review July 3, 2023 06:08
@junrushao junrushao force-pushed the feature/2023-07-01/gemv-compute-at branch 5 times, most recently from dc62d15 to 0d8ff66 Compare July 4, 2023 23:51
This PR enhances Decode-GEMV rule with the following changes:
- Normalize the GEMV iter domain to S-R-C via transform-block-layout.
  This would help with further analysis and scheduling, in cases for
  example, when there was no spatial loop in the original reduction
  block.
- Get rid of the ad hoc iter type analysis, including the logic calling
  into a TVM packed func `tir.schedule.GetLoopIterType` using
  `tvm._ffi.get_global_func`.
- Split out the logic for two separate cases of scheduling, where the
  innermost dimension is spatial or reduction.
- Introduces `suggest_threads_per_block` to guess the threads to be
  allocated each threadblock. This helps avoid the previous case where
  dlight allocates 256 threads for a workload whose degree of parallelism
  is only 128.
- Misc improvements.

This rest of the changes are split out to separate PRs that are already
merged to main.
- [x] Pass the hints to arithmetic analyzer that shape variables should
be positive ones (apache#15210)
- [x] Eliminate unnecessary block predicate generation - should be
provable via affine analysis (apache#15193)
- [x] Shrink local memory allocation if only one element `X[threadIdx.x]`
is used (apache#15207)
@junrushao junrushao force-pushed the feature/2023-07-01/gemv-compute-at branch from 0d8ff66 to b25bd0b Compare July 5, 2023 00:51
@MasterJH5574 MasterJH5574 merged commit 04f22a9 into apache:unity Jul 5, 2023
dynamic: List[int] = []
for i, loop in enumerate(loops):
loop_extent = loop.extent
if isinstance(loop_extent, tir.IntImm):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be able to factor out the loop extent into constant and dynamic component, this will handle extents like 32 * n

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants