Output variable placement #1328

lgritz · 2021-01-21T22:12:02Z

Up until now, client apps declare which output variables are "renderer
outputs" before shader groups are compiled, and also after a shader is
executed, it is up to the app to extract the individual output values
and do with them what it needs to do. Those outputs live on the shader
heap (the temp storage used to run the shader), and information about
where it lives on the heap can be retrieved via find_symbol() and
symbol_address().

New paradigm:

Various inputs and outputs of shaders live in different memory arenas,
and for those inputs/outputs, the app declares (prior to shader group
compilation) which arena it lives in, its offset within the arena, and
the stride between entries (for different shades, such as different
rays in flight, different points on a mesh, pixels in an image, etc.).
The shader JITs into code that automatically pull inputs from their
sources and copy outputs to their destinations with no need for client
app code to call find_symbol or symbol_address or do any extra data
copies.

This symbol location data is communicated to the shading system through
new calls to add_symlocs() passing a new struct called a SymLocationDesc,
which has the aformentioned arena, offset, stride, etc.

Right now, the only supported arena is Outputs, and the actual call
signature to execute(), and to the shader JIT code itself, includes
extra parameters for the base pointer to the output arena (this could
conceivably change from shade to shade) and the integer shade index of
this shade.

I've tried to set this up in a general way, so that additional arenas
will be added later -- places where ShaderGlobals can be customized,
UserData, etc. Also there are placeholders for being able to talk about
absolute addreses and offsets into the heap, though I don't use them
at the moment. Some of the flexibility may be later removed as we learn
more about how we want to use this.

To show it in action, I have modified the "osldeformer" example to use
this new method to place the outputs directly where we want them.
It's just storing results into an array. The output arena base pointer
is the start of the array, the offset of the single output is 0, and
the stride is the size of each array element (one 3-float point).

As a much more complex example, testshade also uses the new technique
(unless you use --no-output-placement, but I'm not sure how long I'll
keep that). This is a good example of storing to images. Testshade thinks
of itself as shading pixels in an image, so there is an ImageBuf for every
renderer output (think of it like an AOV). The output arena base pointer
is the address of the (0,0) pixel of the ImageBuf for the first AOV; the
"offset" of each output variable is the address difference between that
AOV's first ImageBuf pixel and the arena base pointer, the stride is the
pixel data size for each output, and the shadeindex is the pixel index of
that shade.

It all works. Crazy!

OK, not quite all. Here's what hasn't been done yet:

testrender does not yet use the new API (only testshade)
GPU (have only tried CPU so far)
Batch shading -- I don't want to mess with this while Alex is
still integrating all his work. We'll go back and catch the batch
shading up after he's all done.

I will probably tackle some of these by amending this PR, though some may
wait for a separate one if it looks hairy. Also, there is admittedly some
cruft and scaffolding and leftovers from early ideas that didn't pan out,
that I will clean up over time.

All the old calls should still work, so this ought not to be a source
compatibility break for renderers using OSL. Not at this point, anyway.

Signed-off-by: Larry Gritz [email protected]

AlexMWells · 2021-01-21T22:32:43Z

Suggest supporting offsets down to the component level of the outputs to support the renderer using a SOA or AOS for outputs.

lgritz · 2021-01-22T17:57:16Z

Alex, do you suppose we need (a) fully arbitrary and separate strides for each component (3 channel strides for colors/points, 16 channel strides for matrices, etc?), (b) just a single component stride (assume R/G spacing is the same as G/B spacing), (c) just a bool for AOS/SOA (only used for batch shading) giving the choice of

RGB       RGB       RGB            ...             RGB
<-stride->
<---- batch size --------------------------->

vs

RRRRRRRRGGGGGGGGBBBBBBBB   RRRRRRRRGGGGGGGGBBBBBBBB
<batchsz>
<-- stride --------------->

AlexMWells · 2021-01-22T20:25:22Z

I wouldn't assume that the entire set of outputs are AOS or SOA, they could be a mix.
However, I think it would be safe to assume the same component stride could be used for all elements of an output, but their offsets might be different (as maybe they are totally separate arrays for all items in the workgroup, or maybe they are packed to the back to back for a shading count (which may or may not correspond to the the batch size in OSL execution).
ie: Arrays (with no spatial relationship between component arrays)

GGGGGGGGGGGGGGGGGGGGGG
SOMETHING ELSE...
RRRRRRRRRRRRRRRRRRRRRRRRR
SOMETHING ELSE...
BBBBBBBBBBBBBBBBBBBBBBBBBB

for that each component might need a different offset specified, and they all can share a output stride of 1.
address = memory_arena.ptr() + component_offset + shade_index*output_stride*sizeof(component)
Of course if output_stride is 1, it will get optimized away to
address = memory_arena.ptr() + component_offset + shade_index*sizeof(component)

ie: AOSOA (array of structures of arrays) for multiple batches in a single memory arena

RRRRRRRRGGGGGGGGBBBBBBBBRRRRRRRRGGGGGGGGBBBBBBBB
<bat_sz><bat_sz><bat_sz><bat_sz><bat_sz><bat_sz>
<-----batch_stride-----><-----batch_stride----->

This is more complicated, if the memory arena is shared for multiple batches, the an additional "batch index" might be necessary to be passed through or calculated
batch_index=shade_index/bat_sz
and a batch stride to calculate this correctly.
let
local_index = shade_index%bat_sz
they all can share a component stride of 1.
Each component would have a different offset specified, and a shared "batch stride" value
address = memory_arena.ptr() + component_offset + batch_index*batch_stride +local_index*output_stride*sizeof(component)

where batch_index and local_index are not known at compile time.

Of course given a batch_stide of 0 we would generate the simpler form not using the local_index:
ie: AOSOA (array of structures of arrays)

RRRRRRRRGGGGGGGGBBBBBBBB
<bat_sz><bat_sz><bat_sz>

address = memory_arena.ptr() + component_offset + shade_index*output_stride*sizeof(component)

ie: AOS

RGB       RGB       RGB            ...             RGB
<-stride->
<---- batch size --------------------------->

So here we could have 3 different component offsets and a shared output stride
address = memory_arena.ptr() + component_offset + shade_index*output_stride*sizeof(component)

So I guess I would prefer to think of this as an address calculation with enough flexibility to cover the different scenarios vs. an AOS or SOA boolean flag. Maybe the helpers that setup these values do that, or just left to the renderer to set them up correctly.

So it seems specifying per component offsets, shared output_stride, shared batch_stride then the full formula I think simplifies down to cover the above scenarios:

if (batch_stride != 0) {
    address = memory_arena.ptr() + component_offset + batch_index*batch_stride +local_index*output_stride*sizeof(component)    ; 
} else {
    address = memory_arena.ptr() + component_offset + shade_index*output_stride*sizeof(component) ;
    }

Maybe better naming to identify offsets as bytes, which which strides are bytes vs. component size.
And perhaps some of these values "could" be inferred from others for certain scenarios (ie: batch_stride = batch_size*component_count), but there are counter examples that break these assumptions.

Note: using batch_stride is not limited to CPU batching, GPU versions may want to use warp/sub_group sizes for SOA or AOSOA data layouts.

etheory · 2021-02-22T22:55:50Z

Thanks so much @lgritz! Apologies again for only just getting to this, but I'll be trying this out today. Will provide some thoughts/feedback shortly.

lgritz · 2021-02-22T23:58:44Z

No worries, @etheory. And this is a work in progress, I expect revision.

etheory · 2021-02-23T00:40:37Z

My biggest issue with this solution is as follows:

I could be wrong, so please correct any of the following statements, but I believe that you cannot call add_symlocs after you call optimize_group for possibly obvious reasons. However, we did speak about the concept of "partially" compiling a shader, to be able to tell what values it requests or provides, then being able to truncate the working set to that, then setting the input/output arena with add_symlocs, then calling optimize_group to compile.

Is there a way to "partially" compile a shader such that I can read it's input and output values before finishing compilation?

My issue with what I believe the current setup is, has to do with efficiency.

Say that the user requests the output channels "userOutputA", "userOutputB", "userOutputC" but the shader only defines "userOutputA".

This means that "userOutputB" and "userOutputC" will be empty for this shader. If all shaders in the scene do not define these, then they don't need to be defined at all in the renderer, which can just write an empty image, or choose to skip entirely.

There is also the opportunity to NOT define these in the add_symlocs working set, and save some execution time, that would be otherwise wasted on unused values.

Is there a way to do this in the current setup?

Is that what execute_init is for?

etheory · 2021-02-23T02:12:53Z

For the basic case, I managed to set up a case where I was able to get it working, which is fantastic! Thank you. I need to look more into what I said above to see how I could solve that.

lgritz · 2021-02-23T02:41:28Z

Ah, I hadn't really considered that issue for outputs, but I see what you mean about how for inputs you would want to find out (post-optimization) which inputs are really used, then tell the locations, then have it do the code generation.

Now, as it turns out, Alex Wells' work on the batched CPU shading actually did result in the optimize_group method getting a do_jit parameter. So I think we can aim for doing the optimize step without the JIT, then communicating the data layouts, and then calling optimize_group again with do_jit=true (it won't optimize twice, they are discrete steps internally, and the second call will know the optimize was already done).

I don't want to guarantee that today splitting it like this will 100% work (without looking over the code and seeing if I need to move when certain copies of that symloc data happen, say), but it can surely be made to work just by some rearrangement.

etheory · 2021-02-23T03:09:18Z

Thanks @lgritz that would be great. If it worked equivalently for inputs and outputs that would be perfect.
Then you could define the shader, let it partially optimize, ask it what's actually used and left over post-optimization, set up the inputs and outputs in the relevant arenas, then finalize the shader. That would give the user the greatest chance to minimize the amount of work that the shader actually has to do at runtime.
Thanks!

etheory · 2021-02-23T03:16:19Z

FYI @lgritz I just did a test and it seems to work.
I used the following pattern:

// define shader
shadingsys->optimize_group(group.get(), ctx, false);
// search for used symbols and collect them, process them
shadingsys->add_symlocs(group.get(), somevalue);
shadingsys->optimize_group(group.get(), ctx, true);

However is this actually optimal?

Will the second call to optimize_group correctly skip over anything already done by the first false call?

Will the entire process be roughly as fast as just calling optimize_group with true?

Whether currently flaky or not, this, IMO, is the ideal workflow, at least for us, to make the original vision I was imagining, work exactly as we want it to. I am extremely excited!

etheory · 2021-02-23T03:23:22Z

@lgritz another suggestion for the api.

Currently SymLocationDesc assumes fully qualified names.
But find_symbol uses layername and symbolname.
It would be nice if SymLocationDesc also separated the layername and symbolname, since that way, you could avoid a bunch of string concatenations.

lgritz · 2021-02-23T05:15:16Z

That should be just as efficient, yes.

Up until now, client apps declare which output variables are "renderer outputs" before shader groups are compiled, and also after a shader is executed, it is up to the app to extract the individual output values and do with them what it needs to do. Those outputs live on the shader heap (the temp storage used to run the shader), and information about where it lives on the heap can be retrieved via find_symbol() and symbol_address(). New paradigm: Various inputs and outputs of shaders live in different memory arenas, and for those inputs/outputs, the app declares (prior to shader group compilation) which arena it lives in, its offset within the arena, and the stride between entries (for different shades, such as different rays in flight, different points on a mesh, pixels in an image, etc.). The shader JITs into code that automatically pull inputs from their sources and copy outputs to their destinations with no need for client app code to call find_symbol or symbol_address or do any extra data copies. This symbol location data is communicated to the shading system through new calls to add_symlocs() passing a new struct called a SymLocationDesc, which has the aformentioned arena, offset, stride, etc. Right now, the only supported arena is Outputs, and the actual call signature to execute(), and to the shader JIT code itself, includes extra parameters for the base pointer to the output arena (this could conceivably change from shade to shade) and the integer shade index of this shade. I've tried to set this up in a general way, so that additional arenas will be added later -- places where ShaderGlobals can be customized, UserData, etc. Also there are placeholders for being able to talk about absolute addreses and offsets into the heap, though I don't use them at the moment. Some of the flexibility may be later removed as we learn more about how we want to use this. To show it in action, I have modified the "osldeformer" example to use this new method to place the outputs directly where we want them. It's just storing results into an array. The output arena base pointer is the start of the array, the offset of the single output is 0, and the stride is the size of each array element (one 3-float point). As a much more complex example, testshade also uses the new technique (unless you use --no-output-placement, but I'm not sure how long I'll keep that). This is a good example of storing to images. Testshade thinks of itself as shading pixels in an image, so there is an ImageBuf for every renderer output (think of it like an AOV). The output arena base pointer is the address of the (0,0) pixel of the ImageBuf for the first AOV; the "offset" of each output variable is the address difference between that AOV's first ImageBuf pixel and the arena base pointer, the stride is the pixel data size for each output, and the shadeindex is the pixel index of that shade. It all works. Crazy! OK, not quite all. Here's what hasn't been done yet: * testrender does not yet use the new API (only testshade) * GPU (have only tried CPU so far) * Batch shading -- I don't want to mess with this while Alex is still integrating all his work. We'll go back and catch the batch shading up after he's all done. These will come in subsequent PRs. All the old calls should still work, so this ought not to be a source compatibility break for renderers using OSL. Not at this point, anyway. Signed-off-by: Larry Gritz <[email protected]>

lgritz force-pushed the lg-outputlayout branch from fa09f2a to 4fa6fb6 Compare February 18, 2021 07:36

lgritz force-pushed the lg-outputlayout branch from 4fa6fb6 to b9bbdc7 Compare March 17, 2021 06:01

lgritz force-pushed the lg-outputlayout branch from b9bbdc7 to 46bc646 Compare May 13, 2021 19:59

lgritz merged commit 1435e3a into AcademySoftwareFoundation:master May 13, 2021

lgritz deleted the lg-outputlayout branch May 13, 2021 20:29

lgritz changed the title ~~Output variable placement (draft)~~ Output variable placement Feb 7, 2025

lgritz added enhancement Improvement of existing/working features. shading system Related to the runtime shader execution strategic Strategic initative with far-reaching effects (in time and code space) llvm Needs LLVM knowledge labels Feb 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Output variable placement #1328

Output variable placement #1328

lgritz commented Jan 21, 2021

AlexMWells commented Jan 21, 2021

lgritz commented Jan 22, 2021

AlexMWells commented Jan 22, 2021

etheory commented Feb 22, 2021

lgritz commented Feb 22, 2021

etheory commented Feb 23, 2021 •

edited

Loading

etheory commented Feb 23, 2021

lgritz commented Feb 23, 2021

etheory commented Feb 23, 2021 •

edited

Loading

etheory commented Feb 23, 2021 •

edited

Loading

etheory commented Feb 23, 2021 •

edited

Loading

lgritz commented Feb 23, 2021

Output variable placement #1328

Output variable placement #1328

Conversation

lgritz commented Jan 21, 2021

AlexMWells commented Jan 21, 2021

lgritz commented Jan 22, 2021

AlexMWells commented Jan 22, 2021

etheory commented Feb 22, 2021

lgritz commented Feb 22, 2021

etheory commented Feb 23, 2021 • edited Loading

etheory commented Feb 23, 2021

lgritz commented Feb 23, 2021

etheory commented Feb 23, 2021 • edited Loading

etheory commented Feb 23, 2021 • edited Loading

etheory commented Feb 23, 2021 • edited Loading

lgritz commented Feb 23, 2021

etheory commented Feb 23, 2021 •

edited

Loading

etheory commented Feb 23, 2021 •

edited

Loading

etheory commented Feb 23, 2021 •

edited

Loading

etheory commented Feb 23, 2021 •

edited

Loading