Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework the Scheduling Example #468

Merged
merged 12 commits into from
Jul 20, 2023
Merged

Rework the Scheduling Example #468

merged 12 commits into from
Jul 20, 2023

Conversation

rachitnigam
Copy link
Collaborator

I was reading the scheduling example and found it a bit magical so trying to rewrite it a bit and add more explanations for how exactly the scheduling operators work and what the goals are. Not ready for review yet so I'll ping when ready.

@codecov-commenter
Copy link

codecov-commenter commented Jul 17, 2023

Codecov Report

Merging #468 (96fabf3) into master (ffc8e6c) will decrease coverage by 0.06%.
The diff coverage is 16.66%.

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

@@            Coverage Diff             @@
##           master     #468      +/-   ##
==========================================
- Coverage   86.42%   86.37%   -0.06%     
==========================================
  Files          78       78              
  Lines       17929    17933       +4     
==========================================
- Hits        15496    15490       -6     
- Misses       2433     2443      +10     
Impacted Files Coverage Δ
src/exo/platforms/x86.py 100.00% <ø> (ø)
src/exo/API_scheduling.py 88.33% <16.66%> (-0.49%) ⬇️

... and 1 file with indirect coverage changes

@rachitnigam rachitnigam marked this pull request as ready for review July 17, 2023 00:33
@gilbo
Copy link
Contributor

gilbo commented Jul 17, 2023

First off, thanks so much for rewriting this. I have a tendency to just plow through with a ton of nitpicky criticisms, so please don't take it as anything other than constructive. This is so much better than what we had before.

I'm reading through the rendered version alone for comments here...

  • The inline LaTeX appears to be broken?
  • It may be worth motivating why we would want to write a 6x16xK dimension matrix multiplication "microkernel". You can refer to the BLIS project here and say that one popular approach to implementing high performance BLAS libraries is built around specializing these kinds of inner-loop routines for different architectures. Creating and tuning these sorts of high-performance micro-kernels up to near peak machine throughput is the most basic step in creating high performance GEMMs and similar routines.
  • You can probably assert that most vectorizing compilers will not correctly optimize this code. JRK may have useful opinions on how braggadocio you can safely get here, but it doesn't hurt to turn it up a bit within reason.
  • "Also, just to keep things easy to follow, we're going to rename our kernel" - a slight tweak, but we recommend that users always rename any variant of a procedure that they are trying to schedule. This prevents name conflicts if and when you try to output multiple versions in the same C-library later on.
  • We have called things like reorder_loops by the term "scheduling operators." I'm not sure if you want to use that term or not. ¯\_(ツ)_/¯ "scheduling command" sounds fine too.
  • You may want to mention that "proc" is short for "Procedure"
  • Your expansion of the 'j k' nested loop pattern is slightly wrong. The first line should not have a _ after the loop. The body is the second loop!
  • "Finally the print(avx) shows us..." You may want to suggest that the best way to do scheduling is ABP: Always Be Printing.
  • "...represented by 12 8-wide vectors..." -> "stored in"?
  • "Even though we're streaming..." You might want to consider referring to this as "register blocking" here. Readers may have less trouble Googling that term and will see that it's a standard optimization strategy for the inner-most loops of programs.
  • In "Vectorizing the Output" and continuing from there, you leave off printing the output. I realize it's a very verbose thing to do, but it might be a good idea none-the-less, given that the example writeup is already long?
  • "repeat which applies a scheduling operator till no new matches are found" A slight correction is to say that it keeps applying until the operation fails whether because of a pattern match failing or for whatever other reason.

Btw, for what K and machine specs are you reporting those perf numbers? It might be a good idea to make sure we're not doing something wrong performance-wise.

@jrk
Copy link
Contributor

jrk commented Jul 17, 2023

Dumb thing this made me notice (but was there all along): is it really only a 2.2x speedup over the trivial triple for loop?

@rachitnigam
Copy link
Collaborator Author

@gilbo Thanks for the comments! I'll address them soon. @jrk the perf numbers are from the old text so I haven't validated them. After @gilbo's comment, I was just planning to remove the output since I haven't been able to validate it on my M1 machine

@rachitnigam
Copy link
Collaborator Author

rachitnigam commented Jul 18, 2023

Okay, addressed the comments! As far as the number go, it seems that Rosetta 2 cannot emulate AVX2 instructions so I can get rid of the output from running that command for now and open and issue to add real numbers from running on an x86 machine in a future PR?

@rachitnigam
Copy link
Collaborator Author

@gilbo if the changes look good, we should merge

@rachitnigam rachitnigam enabled auto-merge (squash) July 19, 2023 17:20
@rachitnigam rachitnigam requested a review from gilbo July 19, 2023 17:20
Copy link
Contributor

@gilbo gilbo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some very slight tweaks. Main thing is we want to pull out those performance numbers before merging the PR

examples/README.md Outdated Show resolved Hide resolved
examples/README.md Outdated Show resolved Hide resolved
examples/README.md Outdated Show resolved Hide resolved
@rachitnigam
Copy link
Collaborator Author

@gilbo addressed comments!

@rachitnigam rachitnigam merged commit 700fe3b into master Jul 20, 2023
5 of 6 checks passed
@rachitnigam rachitnigam deleted the examples branch July 20, 2023 17:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants