Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Output variable placement #1328

Merged
merged 1 commit into from
May 13, 2021

Conversation

lgritz
Copy link
Collaborator

@lgritz lgritz commented Jan 21, 2021

Up until now, client apps declare which output variables are "renderer
outputs" before shader groups are compiled, and also after a shader is
executed, it is up to the app to extract the individual output values
and do with them what it needs to do. Those outputs live on the shader
heap (the temp storage used to run the shader), and information about
where it lives on the heap can be retrieved via find_symbol() and
symbol_address().

New paradigm:

Various inputs and outputs of shaders live in different memory arenas,
and for those inputs/outputs, the app declares (prior to shader group
compilation) which arena it lives in, its offset within the arena, and
the stride between entries (for different shades, such as different
rays in flight, different points on a mesh, pixels in an image, etc.).
The shader JITs into code that automatically pull inputs from their
sources and copy outputs to their destinations with no need for client
app code to call find_symbol or symbol_address or do any extra data
copies.

This symbol location data is communicated to the shading system through
new calls to add_symlocs() passing a new struct called a SymLocationDesc,
which has the aformentioned arena, offset, stride, etc.

Right now, the only supported arena is Outputs, and the actual call
signature to execute(), and to the shader JIT code itself, includes
extra parameters for the base pointer to the output arena (this could
conceivably change from shade to shade) and the integer shade index of
this shade.

I've tried to set this up in a general way, so that additional arenas
will be added later -- places where ShaderGlobals can be customized,
UserData, etc. Also there are placeholders for being able to talk about
absolute addreses and offsets into the heap, though I don't use them
at the moment. Some of the flexibility may be later removed as we learn
more about how we want to use this.

To show it in action, I have modified the "osldeformer" example to use
this new method to place the outputs directly where we want them.
It's just storing results into an array. The output arena base pointer
is the start of the array, the offset of the single output is 0, and
the stride is the size of each array element (one 3-float point).

As a much more complex example, testshade also uses the new technique
(unless you use --no-output-placement, but I'm not sure how long I'll
keep that). This is a good example of storing to images. Testshade thinks
of itself as shading pixels in an image, so there is an ImageBuf for every
renderer output (think of it like an AOV). The output arena base pointer
is the address of the (0,0) pixel of the ImageBuf for the first AOV; the
"offset" of each output variable is the address difference between that
AOV's first ImageBuf pixel and the arena base pointer, the stride is the
pixel data size for each output, and the shadeindex is the pixel index of
that shade.

It all works. Crazy!

OK, not quite all. Here's what hasn't been done yet:

  • testrender does not yet use the new API (only testshade)
  • GPU (have only tried CPU so far)
  • Batch shading -- I don't want to mess with this while Alex is
    still integrating all his work. We'll go back and catch the batch
    shading up after he's all done.

I will probably tackle some of these by amending this PR, though some may
wait for a separate one if it looks hairy. Also, there is admittedly some
cruft and scaffolding and leftovers from early ideas that didn't pan out,
that I will clean up over time.

All the old calls should still work, so this ought not to be a source
compatibility break for renderers using OSL. Not at this point, anyway.

Signed-off-by: Larry Gritz [email protected]

@AlexMWells
Copy link
Contributor

Suggest supporting offsets down to the component level of the outputs to support the renderer using a SOA or AOS for outputs.

@lgritz
Copy link
Collaborator Author

lgritz commented Jan 22, 2021

Alex, do you suppose we need (a) fully arbitrary and separate strides for each component (3 channel strides for colors/points, 16 channel strides for matrices, etc?), (b) just a single component stride (assume R/G spacing is the same as G/B spacing), (c) just a bool for AOS/SOA (only used for batch shading) giving the choice of

RGB       RGB       RGB            ...             RGB
<-stride->
<---- batch size --------------------------->

vs

RRRRRRRRGGGGGGGGBBBBBBBB   RRRRRRRRGGGGGGGGBBBBBBBB
<batchsz>
<-- stride --------------->

@AlexMWells
Copy link
Contributor

I wouldn't assume that the entire set of outputs are AOS or SOA, they could be a mix.
However, I think it would be safe to assume the same component stride could be used for all elements of an output, but their offsets might be different (as maybe they are totally separate arrays for all items in the workgroup, or maybe they are packed to the back to back for a shading count (which may or may not correspond to the the batch size in OSL execution).
ie: Arrays (with no spatial relationship between component arrays)

GGGGGGGGGGGGGGGGGGGGGG
SOMETHING ELSE...
RRRRRRRRRRRRRRRRRRRRRRRRR
SOMETHING ELSE...
BBBBBBBBBBBBBBBBBBBBBBBBBB

for that each component might need a different offset specified, and they all can share a output stride of 1.
address = memory_arena.ptr() + component_offset + shade_index*output_stride*sizeof(component)
Of course if output_stride is 1, it will get optimized away to
address = memory_arena.ptr() + component_offset + shade_index*sizeof(component)

ie: AOSOA (array of structures of arrays) for multiple batches in a single memory arena

RRRRRRRRGGGGGGGGBBBBBBBBRRRRRRRRGGGGGGGGBBBBBBBB
<bat_sz><bat_sz><bat_sz><bat_sz><bat_sz><bat_sz>
<-----batch_stride-----><-----batch_stride----->

This is more complicated, if the memory arena is shared for multiple batches, the an additional "batch index" might be necessary to be passed through or calculated
batch_index=shade_index/bat_sz
and a batch stride to calculate this correctly.
let
local_index = shade_index%bat_sz
they all can share a component stride of 1.
Each component would have a different offset specified, and a shared "batch stride" value
address = memory_arena.ptr() + component_offset + batch_index*batch_stride +local_index*output_stride*sizeof(component)

where batch_index and local_index are not known at compile time.

Of course given a batch_stide of 0 we would generate the simpler form not using the local_index:
ie: AOSOA (array of structures of arrays)

RRRRRRRRGGGGGGGGBBBBBBBB
<bat_sz><bat_sz><bat_sz>

address = memory_arena.ptr() + component_offset + shade_index*output_stride*sizeof(component)

ie: AOS

RGB       RGB       RGB            ...             RGB
<-stride->
<---- batch size --------------------------->

So here we could have 3 different component offsets and a shared output stride
address = memory_arena.ptr() + component_offset + shade_index*output_stride*sizeof(component)

So I guess I would prefer to think of this as an address calculation with enough flexibility to cover the different scenarios vs. an AOS or SOA boolean flag. Maybe the helpers that setup these values do that, or just left to the renderer to set them up correctly.

So it seems specifying per component offsets, shared output_stride, shared batch_stride then the full formula I think simplifies down to cover the above scenarios:

if (batch_stride != 0) {
    address = memory_arena.ptr() + component_offset + batch_index*batch_stride +local_index*output_stride*sizeof(component)    ; 
} else {
    address = memory_arena.ptr() + component_offset + shade_index*output_stride*sizeof(component) ;
    }

Maybe better naming to identify offsets as bytes, which which strides are bytes vs. component size.
And perhaps some of these values "could" be inferred from others for certain scenarios (ie: batch_stride = batch_size*component_count), but there are counter examples that break these assumptions.

Note: using batch_stride is not limited to CPU batching, GPU versions may want to use warp/sub_group sizes for SOA or AOSOA data layouts.

@etheory
Copy link
Contributor

etheory commented Feb 22, 2021

Thanks so much @lgritz! Apologies again for only just getting to this, but I'll be trying this out today. Will provide some thoughts/feedback shortly.

@lgritz
Copy link
Collaborator Author

lgritz commented Feb 22, 2021

No worries, @etheory. And this is a work in progress, I expect revision.

@etheory
Copy link
Contributor

etheory commented Feb 23, 2021

My biggest issue with this solution is as follows:

I could be wrong, so please correct any of the following statements, but I believe that you cannot call add_symlocs after you call optimize_group for possibly obvious reasons. However, we did speak about the concept of "partially" compiling a shader, to be able to tell what values it requests or provides, then being able to truncate the working set to that, then setting the input/output arena with add_symlocs, then calling optimize_group to compile.

Is there a way to "partially" compile a shader such that I can read it's input and output values before finishing compilation?

My issue with what I believe the current setup is, has to do with efficiency.

Say that the user requests the output channels "userOutputA", "userOutputB", "userOutputC" but the shader only defines "userOutputA".

This means that "userOutputB" and "userOutputC" will be empty for this shader. If all shaders in the scene do not define these, then they don't need to be defined at all in the renderer, which can just write an empty image, or choose to skip entirely.

There is also the opportunity to NOT define these in the add_symlocs working set, and save some execution time, that would be otherwise wasted on unused values.

Is there a way to do this in the current setup?

Is that what execute_init is for?

@etheory
Copy link
Contributor

etheory commented Feb 23, 2021

For the basic case, I managed to set up a case where I was able to get it working, which is fantastic! Thank you. I need to look more into what I said above to see how I could solve that.

@lgritz
Copy link
Collaborator Author

lgritz commented Feb 23, 2021

Ah, I hadn't really considered that issue for outputs, but I see what you mean about how for inputs you would want to find out (post-optimization) which inputs are really used, then tell the locations, then have it do the code generation.

Now, as it turns out, Alex Wells' work on the batched CPU shading actually did result in the optimize_group method getting a do_jit parameter. So I think we can aim for doing the optimize step without the JIT, then communicating the data layouts, and then calling optimize_group again with do_jit=true (it won't optimize twice, they are discrete steps internally, and the second call will know the optimize was already done).

I don't want to guarantee that today splitting it like this will 100% work (without looking over the code and seeing if I need to move when certain copies of that symloc data happen, say), but it can surely be made to work just by some rearrangement.

@etheory
Copy link
Contributor

etheory commented Feb 23, 2021

Thanks @lgritz that would be great. If it worked equivalently for inputs and outputs that would be perfect.
Then you could define the shader, let it partially optimize, ask it what's actually used and left over post-optimization, set up the inputs and outputs in the relevant arenas, then finalize the shader. That would give the user the greatest chance to minimize the amount of work that the shader actually has to do at runtime.
Thanks!

@etheory
Copy link
Contributor

etheory commented Feb 23, 2021

FYI @lgritz I just did a test and it seems to work.
I used the following pattern:

// define shader
shadingsys->optimize_group(group.get(), ctx, false);
// search for used symbols and collect them, process them
shadingsys->add_symlocs(group.get(), somevalue);
shadingsys->optimize_group(group.get(), ctx, true);

However is this actually optimal?

Will the second call to optimize_group correctly skip over anything already done by the first false call?

Will the entire process be roughly as fast as just calling optimize_group with true?

Whether currently flaky or not, this, IMO, is the ideal workflow, at least for us, to make the original vision I was imagining, work exactly as we want it to. I am extremely excited!

@etheory
Copy link
Contributor

etheory commented Feb 23, 2021

@lgritz another suggestion for the api.

Currently SymLocationDesc assumes fully qualified names.
But find_symbol uses layername and symbolname.
It would be nice if SymLocationDesc also separated the layername and symbolname, since that way, you could avoid a bunch of string concatenations.

@lgritz
Copy link
Collaborator Author

lgritz commented Feb 23, 2021

That should be just as efficient, yes.

Up until now, client apps declare which output variables are "renderer
outputs" before shader groups are compiled, and also after a shader is
executed, it is up to the app to extract the individual output values
and do with them what it needs to do. Those outputs live on the shader
heap (the temp storage used to run the shader), and information about
where it lives on the heap can be retrieved via find_symbol() and
symbol_address().

New paradigm:

Various inputs and outputs of shaders live in different memory arenas,
and for those inputs/outputs, the app declares (prior to shader group
compilation) which arena it lives in, its offset within the arena, and
the stride between entries (for different shades, such as different
rays in flight, different points on a mesh, pixels in an image, etc.).
The shader JITs into code that automatically pull inputs from their
sources and copy outputs to their destinations with no need for client
app code to call find_symbol or symbol_address or do any extra data
copies.

This symbol location data is communicated to the shading system through
new calls to add_symlocs() passing a new struct called a SymLocationDesc,
which has the aformentioned arena, offset, stride, etc.

Right now, the only supported arena is Outputs, and the actual call
signature to execute(), and to the shader JIT code itself, includes
extra parameters for the base pointer to the output arena (this could
conceivably change from shade to shade) and the integer shade index of
this shade.

I've tried to set this up in a general way, so that additional arenas
will be added later -- places where ShaderGlobals can be customized,
UserData, etc. Also there are placeholders for being able to talk about
absolute addreses and offsets into the heap, though I don't use them
at the moment. Some of the flexibility may be later removed as we learn
more about how we want to use this.

To show it in action, I have modified the "osldeformer" example to use
this new method to place the outputs directly where we want them.
It's just storing results into an array. The output arena base pointer
is the start of the array, the offset of the single output is 0, and
the stride is the size of each array element (one 3-float point).

As a much more complex example, testshade also uses the new technique
(unless you use --no-output-placement, but I'm not sure how long I'll
keep that). This is a good example of storing to images. Testshade thinks
of itself as shading pixels in an image, so there is an ImageBuf for every
renderer output (think of it like an AOV). The output arena base pointer
is the address of the (0,0) pixel of the ImageBuf for the first AOV; the
"offset" of each output variable is the address difference between that
AOV's first ImageBuf pixel and the arena base pointer, the stride is the
pixel data size for each output, and the shadeindex is the pixel index of
that shade.

It all works. Crazy!

OK, not quite all. Here's what hasn't been done yet:

  * testrender does not yet use the new API (only testshade)
  * GPU (have only tried CPU so far)
  * Batch shading -- I don't want to mess with this while Alex is
    still integrating all his work. We'll go back and catch the batch
    shading up after he's all done.

These will come in subsequent PRs.

All the old calls should still work, so this ought not to be a source
compatibility break for renderers using OSL. Not at this point, anyway.

Signed-off-by: Larry Gritz <[email protected]>
@lgritz lgritz force-pushed the lg-outputlayout branch from b9bbdc7 to 46bc646 Compare May 13, 2021 19:59
@lgritz lgritz merged commit 1435e3a into AcademySoftwareFoundation:master May 13, 2021
@lgritz lgritz deleted the lg-outputlayout branch May 13, 2021 20:29
@lgritz lgritz changed the title Output variable placement (draft) Output variable placement Feb 7, 2025
@lgritz lgritz added enhancement Improvement of existing/working features. shading system Related to the runtime shader execution strategic Strategic initative with far-reaching effects (in time and code space) llvm Needs LLVM knowledge labels Feb 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improvement of existing/working features. llvm Needs LLVM knowledge shading system Related to the runtime shader execution strategic Strategic initative with far-reaching effects (in time and code space)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants