Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different domains for statements #500

Closed
thisiscam opened this issue Oct 10, 2021 · 10 comments
Closed

Different domains for statements #500

thisiscam opened this issue Oct 10, 2021 · 10 comments

Comments

@thisiscam
Copy link

thisiscam commented Oct 10, 2021

I'm wondering if it is possible to specify a list of statements and associate each statement with its own domain in loopy?
I looked into the documentation, and wasn't sure if this is possible.

If I understand correctly, I might need to take the product of the domains of all the statements, so that loopy will do a projection for each statement? If that's the case, it seems fine but a hassle for my use case.

@kaushikcfd
Copy link
Collaborator

kaushikcfd commented Oct 10, 2021

Hi @thisiscam,
Yep, it's possible to do that. In loopy, a kernel is list of statements+list of domains. Each domain is described via an ISL basic set. For each statement it is known which "inames"1 the statement is a part of, and the final domain for the generated code of a statement is obtained by intersecting the projection of each domain onto the inames of a statement2.

Below is an example kernel that has 2 non-interacting domains and 2 statements, with one domain corresponding to each statement:

knl = lp.make_kernel(
    ["{[i0, i1]: 0<=i0, i1<10}",
     "{[i2, i3]: 0<=i2, i3<10}"],
    """
    <> tmp[i0, i1] = 2*i0 + i1
    out[i2, i3] = 2*tmp[i2, i3]
    """)

print(lp.generate_code_v2(knl).device_code())

generates the code:

__kernel void __attribute__ ((reqd_work_group_size(1, 1, 1))) loopy_kernel(__global int *__restrict__ out)
{
  int tmp[10 * 10];

  for (int i0 = 0; i0 <= 9; ++i0)
    for (int i1 = 0; i1 <= 9; ++i1)
      tmp[10 * i0 + i1] = 2 * i0 + i1;
  for (int i2 = 0; i2 <= 9; ++i2)
    for (int i3 = 0; i3 <= 9; ++i3)
      out[10 * i2 + i3] = 2 * tmp[10 * i2 + i3];
}

Footnotes

  1. loopy's term for loop induction variables

  2. We maintain some internal variables so that the relevant domain is computed in O(N), N being the number of inames surrounding a statement.

@thisiscam
Copy link
Author

thisiscam commented Oct 10, 2021

Thanks for the quick response!

It wasn't clear to me that the list of domains and list of statements have an one-to-one correspondence.
For example, the following code taken from the tutorial has one domain but two statements -- which I don't understand how the mapping is specified/assumed? I looked at your explanation again and I understand this now. It seems like I need to use unique inames for each statement, so that they map to different domains?

knl = lp.make_kernel(
    "{ [i,j,ii,jj]: 0<=i,j,ii,jj<n }",
    """
    out[j,i] = a[i,j] {id=transpose}
    out[ii,jj] = 2*out[ii,jj]  {dep=transpose}
    """,
    [lp.GlobalArg("out", shape=lp.auto, is_input=False), ...])
knl = lp.prioritize_loops(knl, "i,j,ii,jj")

Further, is it possible to specify dependencies via ISL Maps, instead of the {dep=...} syntax?

@kaushikcfd
Copy link
Collaborator

It seems like I need to use unique inames for each statement, so that they map to different domains?

The inames need not be unique i.e. multiple statements can be within a single iname. For example here is a kernel that has multiple statements in the same iname, and the generated code will put the statements sharing inames into the same loop nest.

Further, is it possible to specify dependencies via ISL Maps, instead of the {dep=...} syntax?

Not yet. But #169 is WIP.

@thisiscam
Copy link
Author

thisiscam commented Oct 10, 2021

The inames need not be unique i.e. multiple statements can be within a single iname.

Sorry I wasn't clear. I meant that if I already have a list of statement-domain pairs, in order to map it to loopy representation, I would need to have unique inames for each statement-domain pair, so that loopy will do the correct one-to-one projection?

But #169 is WIP.

Thanks! Any idea on a timeline for this?

I should mention that my current goal is to find a code generator that operates similar to ISL's scheduling and code generation, while frees me from manually translating ISL ASTs into real CPU/GPU code.

@kaushikcfd
Copy link
Collaborator

I would need to have unique inames for each statement-domain pair, so that loopy will do the correct one-to-one projection?

Yep, that sounds correct. In loopy each iname gets mapped to a C-styled for-loop or a hardware loop, but a restriction to keep in mind is that each loop can be entered exactly once, if not possible due to the user provided statement DAG then a scheduling error is raised.

Thanks! Any idea on a timeline for this?

@inducer might have a better idea.

@thisiscam
Copy link
Author

but a restriction to keep in mind is that each loop can be entered exactly once, if not possible due to the user provided statement DAG then a scheduling error is raised.

This seems like a critical restriction. Can you give an example of a statement DAG? Thanks!

@kaushikcfd
Copy link
Collaborator

This kernel for example:

knl = lp.make_kernel(
    "{[i0, i1]: 0<=i0, i1<10}",
    """
    a[i0] = 1  {id=write_a}
    b[i1] = 2  {id=write_b, dep=write_a}
    c[i0] = 3  {id=write_c, dep=write_b}
    """)

print(lp.generate_code_v2(knl).device_code())

fails with the error message:

ERROR: Sorry--loopy did not find a schedule for your kernel

as unless the i0 is entered twice (once for writing "a" and once for writing "c") it won't be possible to have a schedulable loopy kernel.

A common trick in this case is to fix this by the transformation; knl = lp.duplcate_inames(knl, 'i0', within='id:write_c'), after which we generate the following OpenCL code;

__kernel void __attribute__ ((reqd_work_group_size(1, 1, 1))) loopy_kernel(__global int *__restrict__ a, __global int *__restrict__ b, __global int *__restrict__ c)
{
  for (int i0 = 0; i0 <= 9; ++i0)
    a[i0] = 1;
  for (int i1 = 0; i1 <= 9; ++i1)
    b[i1] = 2;
  for (int i0_0 = 0; i0_0 <= 9; ++i0_0)
    c[i0_0] = 3;
}

@thisiscam
Copy link
Author

thisiscam commented Oct 10, 2021

I see.

Does loopy always schedule to multiple loops when the inames are different?
For example, can loopy do fusion for your aforementioned example (without changing to the same inames):

knl = lp.make_kernel(
    ["{[i0, i1]: 0<=i0, i1<10}",
     "{[i2, i3]: 0<=i2, i3<10}"],
    """
    <> tmp[i0, i1] = 2*i0 + i1
    out[i2, i3] = 2*tmp[i2, i3]
    """)

print(lp.generate_code_v2(knl).device_code())

?

A higher level question, I think, is what's the diff between loopy's scheduling and ISL's scheduling?
My understanding is that the two approach work differently: in ISL, I specify statements and their domain, and ISL will help me rearrange them into nested loops (with some automatic fusion etc.)
In loopy, the loop nests are contolled by inames

@kaushikcfd
Copy link
Collaborator

No loopy doesn't attempt fusion automatically because profitability of loop fusion is highly domain specific and hence must be user-guided. We have plans to implement loop fusion heuristics available in the literature in #493. (should be available in main in ~1 week)

@thisiscam
Copy link
Author

@kaushikcfd Many thanks. That helps a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants