Rework the Scheduling Example #468

rachitnigam · 2023-07-10T19:52:46Z

I was reading the scheduling example and found it a bit magical so trying to rewrite it a bit and add more explanations for how exactly the scheduling operators work and what the goals are. Not ready for review yet so I'll ping when ready.

codecov-commenter · 2023-07-17T00:18:30Z

Codecov Report

Merging #468 (96fabf3) into master (ffc8e6c) will decrease coverage by 0.06%.
The diff coverage is 16.66%.

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

@@            Coverage Diff             @@
##           master     #468      +/-   ##
==========================================
- Coverage   86.42%   86.37%   -0.06%     
==========================================
  Files          78       78              
  Lines       17929    17933       +4     
==========================================
- Hits        15496    15490       -6     
- Misses       2433     2443      +10

Impacted Files	Coverage Δ
src/exo/platforms/x86.py	`100.00% <ø> (ø)`
src/exo/API_scheduling.py	`88.33% <16.66%> (-0.49%)`	⬇️

... and 1 file with indirect coverage changes

gilbo · 2023-07-17T07:29:31Z

First off, thanks so much for rewriting this. I have a tendency to just plow through with a ton of nitpicky criticisms, so please don't take it as anything other than constructive. This is so much better than what we had before.

I'm reading through the rendered version alone for comments here...

The inline LaTeX appears to be broken?
It may be worth motivating why we would want to write a 6x16xK dimension matrix multiplication "microkernel". You can refer to the BLIS project here and say that one popular approach to implementing high performance BLAS libraries is built around specializing these kinds of inner-loop routines for different architectures. Creating and tuning these sorts of high-performance micro-kernels up to near peak machine throughput is the most basic step in creating high performance GEMMs and similar routines.
You can probably assert that most vectorizing compilers will not correctly optimize this code. JRK may have useful opinions on how braggadocio you can safely get here, but it doesn't hurt to turn it up a bit within reason.
"Also, just to keep things easy to follow, we're going to rename our kernel" - a slight tweak, but we recommend that users always rename any variant of a procedure that they are trying to schedule. This prevents name conflicts if and when you try to output multiple versions in the same C-library later on.
We have called things like reorder_loops by the term "scheduling operators." I'm not sure if you want to use that term or not. ¯\_(ツ)_/¯ "scheduling command" sounds fine too.
You may want to mention that "proc" is short for "Procedure"
Your expansion of the 'j k' nested loop pattern is slightly wrong. The first line should not have a _ after the loop. The body is the second loop!
"Finally the print(avx) shows us..." You may want to suggest that the best way to do scheduling is ABP: Always Be Printing.
"...represented by 12 8-wide vectors..." -> "stored in"?
"Even though we're streaming..." You might want to consider referring to this as "register blocking" here. Readers may have less trouble Googling that term and will see that it's a standard optimization strategy for the inner-most loops of programs.
In "Vectorizing the Output" and continuing from there, you leave off printing the output. I realize it's a very verbose thing to do, but it might be a good idea none-the-less, given that the example writeup is already long?
"repeat which applies a scheduling operator till no new matches are found" A slight correction is to say that it keeps applying until the operation fails whether because of a pattern match failing or for whatever other reason.

Btw, for what K and machine specs are you reporting those perf numbers? It might be a good idea to make sure we're not doing something wrong performance-wise.

jrk · 2023-07-17T20:00:52Z

Dumb thing this made me notice (but was there all along): is it really only a 2.2x speedup over the trivial triple for loop?

rachitnigam · 2023-07-17T20:04:04Z

@gilbo Thanks for the comments! I'll address them soon. @jrk the perf numbers are from the old text so I haven't validated them. After @gilbo's comment, I was just planning to remove the output since I haven't been able to validate it on my M1 machine

rachitnigam · 2023-07-18T19:08:02Z

Okay, addressed the comments! As far as the number go, it seems that Rosetta 2 cannot emulate AVX2 instructions so I can get rid of the output from running that command for now and open and issue to add real numbers from running on an x86 machine in a future PR?

rachitnigam · 2023-07-19T17:20:00Z

@gilbo if the changes look good, we should merge

gilbo

Some very slight tweaks. Main thing is we want to pull out those performance numbers before merging the PR

examples/README.md

rachitnigam · 2023-07-19T21:10:03Z

@gilbo addressed comments!

rachitnigam added 7 commits July 10, 2023 13:38

Move instructions for repo installation to the top of the README

3466b2f

Rework the Scheduling Example

eb02569

Merge branch 'master' into examples

143e675

Handle cursor for reorder_loops

5ea7f07

First section of the scheduling tut done

37d0ccd

Merge branch 'master' into examples

9a65d1a

remove old explanation

c866407

finish off the rewrite

0d91b90

rachitnigam marked this pull request as ready for review July 17, 2023 00:33

rachitnigam added 2 commits July 18, 2023 14:44

fix math

4cb958e

address other comments

7d8dd21

rachitnigam enabled auto-merge (squash) July 19, 2023 17:20

rachitnigam requested a review from gilbo July 19, 2023 17:20

gilbo requested changes Jul 19, 2023

View reviewed changes

examples/README.md Outdated Show resolved Hide resolved

examples/README.md Outdated Show resolved Hide resolved

examples/README.md Outdated Show resolved Hide resolved

more comments

dd52059

Merge branch 'master' into examples

96fabf3

gilbo approved these changes Jul 20, 2023

View reviewed changes

rachitnigam merged commit 700fe3b into master Jul 20, 2023
5 of 6 checks passed

rachitnigam deleted the examples branch July 20, 2023 17:59

gigony mentioned this pull request Jul 16, 2024

Segfault happens with examples/main.c #678

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework the Scheduling Example #468

Rework the Scheduling Example #468

rachitnigam commented Jul 10, 2023

codecov-commenter commented Jul 17, 2023 •

edited

Loading

gilbo commented Jul 17, 2023 •

edited

Loading

jrk commented Jul 17, 2023

rachitnigam commented Jul 17, 2023

rachitnigam commented Jul 18, 2023 •

edited

Loading

rachitnigam commented Jul 19, 2023

gilbo left a comment

rachitnigam commented Jul 19, 2023

Rework the Scheduling Example #468

Rework the Scheduling Example #468

Conversation

rachitnigam commented Jul 10, 2023

codecov-commenter commented Jul 17, 2023 • edited Loading

Codecov Report

gilbo commented Jul 17, 2023 • edited Loading

jrk commented Jul 17, 2023

rachitnigam commented Jul 17, 2023

rachitnigam commented Jul 18, 2023 • edited Loading

rachitnigam commented Jul 19, 2023

gilbo left a comment

Choose a reason for hiding this comment

rachitnigam commented Jul 19, 2023

codecov-commenter commented Jul 17, 2023 •

edited

Loading

gilbo commented Jul 17, 2023 •

edited

Loading

rachitnigam commented Jul 18, 2023 •

edited

Loading