Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Userdata input placement #1391

Merged
merged 1 commit into from
Aug 19, 2021

Conversation

lgritz
Copy link
Collaborator

@lgritz lgritz commented Jul 22, 2021

This is analogous to the "output placement" that we added recently, but
for input of userdata.

Reminder: "userdata" means shader parameters marked as lockgeom=0 that
are expected to be supplied on a per-shade basis, such as interpolated
vertex data on geometric primitives. The primitive data binds by name to
matching shader parameters.

This patch concerns how the shader gets this per-point data.

Old school: in the middle of the shader, when userdata is needed, the
shader calls osl_bind_interpolated_param(), which makes a callback to
RendererServices::get_userdata(), which... does something magical and
renderer-specific to retrieve the data by name. This is expensive and
also challenging to implement on the GPU (which is why it remains
unimplemented).

This patch changes the methodology to the following, which is somewhat
analogous to how we did output placement:

  1. It is assumed that there is an arena of memory where the userdata
    is going to be, and the pointer to its start is passed down to the
    shader via execute() (just like how the shaderglobals and output
    arena pointers are passed), along with an index of the current
    shade point.

  2. The renderer pre-declares SymLocationDesc records for the userdata,
    which specifies the offsets and strides within the arena where the
    data can be found.

  3. Prior to launching the shader execute(), the renderer is expected
    fill in those values in the userdata arena for the points to be
    shaded.

  4. When JITing the part of the shader that initializes those userdata
    parameter values, we simply JIT a simple data copy directly from
    the computed position within the arena (offsetting the right amount
    for this data, at this shade index) -- just a couple of
    instructions per shade, no function calls, no RendererServices
    callbacks, no name lookups!

  5. There is no 4. That's it.

See the changes to osldeformer.cpp for an example of how this looks from
the renderer side. It's really straightforward.

Currently, if there is no SymLocationDesc for a lockgeom=0 parameter
that needs initialization, it will fall back to the old callback. We
may fully deprecate the old way, but not for a while, so this is back
compatible for now.

This seems to work, passes all tests, doesn't break anything, and works
for the osldeformer example!

There is follow-up work to do: I think we can pretty easily do the
same thing for ShaderGlobals ("make your own SG struct"), named marix
retrieval, and other attributes. Will do those in subsequent PRs if
this meets everybody's approval.

Signed-off-by: Larry Gritz [email protected]

@lgritz lgritz force-pushed the lg-inputplacement branch from a955955 to 1a3a69a Compare July 22, 2021 22:38
@johnhaddon
Copy link
Contributor

There is follow-up work to do: I think we can pretty easily do the
same thing for ShaderGlobals ("make your own SG struct"), named marix
retrieval, and other attributes.

I've got a few questions about how this approach might apply to the "other attributes" part, which I assume means replacing calls to RendererServices::get_attribute() :

  • How would it interact with unknown_attributes_needed? Will get_attribute() always be available to provide for this use case, or will it eventually be removed, with all attributes required to be determined up front?
  • In get_attribute() and get_userdata(), we have the opportunity to convert from our source type to the TypeDesc required by OSL. We use an extended version of ShadingSystem::convert_value() for that in Gaffer. Will type conversions be available via the new approach, and if so, would you consider a PR for the more laissez-faire conversions in the link?

@sfriedmapixar
Copy link
Contributor

The pull request looks good to me overall, but I wanted to provide some context around it for a possible future roadmap. Forgive the long essay, but I wanted this recorded somewhere, and this seems to be a reasonable place.

This is very close to where we were in our usage of OSL, but we have since moved on to something more akin to how the library bitcode is provided in the optix datapath. Instead of the definition of the buffer layout living in OSL, the renderer provides implementations of the shadeop code as a bitcode library that can be put in the module, and the JIT can directly call. For example, you can replace the codegen call to "OSL_SHADEOP int
osl_bind_interpolated_param" with a call to a function that's defined in the bitcode library provided as a string to the shadingsystem. In fact, this can be done for basically any shadeop.

The main benefit this has is that the renderer then has control over the data layout of the buffer that values are coming from. Stride and offset are pretty flexible, but if you want to do fancier things like compressed lookup, they aren't enough. And as this pull request shows, you have to pipe the base pointer around, which isn't too onerous, but with the bitcode method, any and all of these could be passed as part of the blind renderstate stored on the shader globals, and interpreted by the renderer side code in renderer native types/layouts.

That being said, it's good to provide a least-common-denominator in the open-source distribution so that different software packages can get OSL integrated as easily as possible. In fact, it could be setup so that when building OSL, the built-in osl_bind_interpolated_param is compiled to bitcode and embeded as the default value of the bitcode attribute on the shading system, then a renderer basically just sets the shadingsystem attribute to new bitcode iff it want's custom handling.

Additionally, if we look at the userdata call, there are several things like the type and whether or not there are derivatives that we might also want to handle with separate entry points instead of passing them as data to be decoded on the other end. This would be akin to template expansion in C++ using call-site vectoring to skip all the decoding steps of those parameters. This in fact could be implemented on the user side with c-entry points calling a C++ templated implementation, eliminating a bunch of the otherwise branchy code in those implementations. The other thing to change to make this efficient is have "userdata" handles, similar to texture handles, that allow the renderer to pre-resolve and stash info about the remaining args earlier on -- effectively a blind version of the SymLocationDesc. Basically find_symloc is replaced with grabbing the handle, and the implementation of symloc_ptr + the memcpy to dst is what would be in the implementation bitcode. To John's first point above, a getattribute and getuserdata replaced this way could fall back to dynamic callbacks if the handle was not found (nullptr) like you currently do when the find_symloc fails.

I don't think this pull request interferes with that future, and indeed provides an improved fallback case, so I think that it is a good first step.

Going further than the custom bitcode, you start to get into doing something where the OSL code-gen makes a callback into the renderer to let it do custom code-gen at run-time. That is likely too far for OSL because that then puts the onus on the renderer developers to understand how LLVM works, or for OSL to provide some sort of API that's a reasonable subset of LLVM's code-gen. The beauty of not going that far and just providing bitcode via a shadingsystem->setattribute call is that the developers of the renderer can write the code in basically any language that has an LLVM frontend, use that to compile their source to bitcode, and get that dumped into the text section of their binary, basically "it's just a string." So developers only need to use an LLVM based compiler as a binary tool in their build process (which OSL already requires for advanced usage), they don't have to use it as a library they understand and code against.

@sfriedmapixar
Copy link
Contributor

To answer part of John's second question above, if I understand it correctly, you would use your extended convert_value when creating the buffer of values that the userdata_base_ptr points to. What might not be clear, and could probably use some clarification in the example code, is that the renderer using this interface is expected to ask the shading system which values it'll need to provide after the renderer has asked the shading system to optimize the group. Then the renderer sets up the symloc records indicating where it will put those values in the buffer, and before each shading execution, it pre-fills in the buffer with all of those values. This does mean it may fill in values that dynamically will not be used, but the idea is that doing this for all values upfront and avoiding all the searching in the middle of shading ends up providing a decent performance gain.

@lgritz lgritz force-pushed the lg-inputplacement branch from 1a3a69a to b7503bf Compare August 3, 2021 15:07
@lgritz
Copy link
Collaborator Author

lgritz commented Aug 19, 2021

@johnhaddon, sorry for the delay. DigiPro, Open Source Day, and SIGGRAPH definitely sapped the free time.

* How would it interact with `unknown_attributes_needed`? Will `get_attribute()` always be available to provide for this use case, or will it eventually be removed, with all attributes required to be determined up front?

On the CPU side, currently any attributes not retrieved this way because they have been pre-declared with the input placement stuff will continue to fall back on the RendererServices call.

On the GPU side, I think we've taken for granted for a while that attributes, textures, etc., that can't have their names figured out by the time the runtime optimization is complete, will probably fail or not be supported. Though the ultimate degree that it might be made to work is still TBD.

As noted elsewhere in this discussion, the "placement" approach may eventually be subsumed or augmented with some more general "supply the code" approach via LLVM modules. I think this places a much higher technical and implementation burden on renderers, but I support the idea and note that this also may restore the plausibility of supporting types of queries whose values can't be reduced to constants during the runtime optimization stage.

* In `get_attribute()` and `get_userdata()`, we have the opportunity to convert from our source type to the `TypeDesc` required by OSL. We use an [extended version of `ShadingSystem::convert_value()`](https://github.com/GafferHQ/gaffer/blob/master/src/GafferOSL/ShadingEngine.cpp#L208-L259) for that in Gaffer. Will type conversions be available via the new approach, and if so, would you consider a PR for the more laissez-faire conversions in the link?

Definitely we would entertain PRs or requests for more flexible conversion rules. It's all evolving as we go. The patch you cite above looks sensible, I think. We would welcome a PR to add this to the logic of our convert_value.

@lgritz
Copy link
Collaborator Author

lgritz commented Aug 19, 2021

@sfriedmapixar (and @AlexMWells who has also brought this up): I do like the "supply the bitcode and we'll add it to the module" approach. I think certainly we would want to add such a mechanism, probably accounting for almost everything we currently do with RendererServices.

My intuition is that we would want this to be in addition to the input/output placement work, rather than 100% replacing it, merely because the placement approach is really simple for renderers to integrate OSL without needing to have a pretty substantial burden of additional code to write, adding LLVM/clang as a more direct dependency (not merely coming as a black box along with OSL, but requiring their attention and catering). But maybe a simpler way of saying that is "supplying module code may be the core implementation, which renderers may use when they need the flexibility, and the implementation of placement and RendererServices may migrate to simply using the default set of those modules that come already supplied by OSL."

If you have already gone down this road and figured out what works well, I would definitely encourage (if not beg) you to either make a direct PR or spell it out in as much detail as possible. If at all possible for you to do, that would seem much more efficient than my fumbling around to re-implement it myself from scratch and then have several back and forths in which you tell me what I did wrong and how to fix pitfalls that you already experienced and fixed when you did it the first time.

@lgritz
Copy link
Collaborator Author

lgritz commented Aug 19, 2021

I was hoping to have input from @etheory, but this has languished long enough, so I'm going to merge so that I can move on to the next steps in this process. This is only in master and is known to be in flux, so I'm not worried if minor problems are uncovered and we wish to continue revising the approach.

This is analogous to the "output placement" that we added recently, but
for input of userdata.

Reminder: "userdata" means shader parameters marked as lockgeom=0 that
are expected to be supplied on a per-shade basis, such as interpolated
vertex data on geometric primitives. The primitive data binds by name to
matching shader parameters.

This patch concerns how the shader gets this per-point data.

Old school: in the middle of the shader, when userdata is needed, the
shader calls osl_bind_interpolated_param(), which makes a callback to
RendererServices::get_userdata(), which... does something magical and
renderer-specific to retrieve the data by name.  This is expensive and
also challenging to implement on the GPU (which is why it remains
unimplemented).

This patch changes the methodology to the following, which is somewhat
analogous to how we did output placement:

0. It is assumed that there is an arena of memory where the userdata
   is going to be, and the pointer to its start is passed down to the
   shader via execute() (just like how the shaderglobals and output
   arena pointers are passed), along with an index of the current
   shade point.

1. The renderer pre-declares SymLocationDesc records for the userdata,
   which specifies the offsets and strides within the arena where the
   data can be found.

2. Prior to launching the shader execute(), the renderer is expected
   fill in those values in the userdata arena for the points to be
   shaded.

3. When JITing the part of the shader that initializes those userdata
   parameter values, we simply JIT a simple data copy directly from
   the computed position within the arena (offsetting the right amount
   for this data, at this shade index) -- just a couple of
   instructions per shade, no function calls, no RendererServices
   callbacks, no name lookups!

4. There is no 4. That's it.

See the changes to osldeformer.cpp for an example of how this looks from
the renderer side. It's really straightforward.

Currently, if there is no SymLocationDesc for a lockgeom=0 parameter
that needs initialization, it will fall back to the old callback. We
may fully deprecate the old way, but not for a while, so this is back
compatible for now.

This seems to work, passes all tests, doesn't break anything, and works
for the osldeformer example!

There is follow-up work to do: I think we can pretty easily do the
same thing for ShaderGlobals ("make your own SG struct"), named marix
retrieval, and other attributes. Will do those in subsequent PRs if
this meets everybody's approval.

Signed-off-by: Larry Gritz <[email protected]>
@lgritz lgritz force-pushed the lg-inputplacement branch from b7503bf to d05a932 Compare August 19, 2021 20:32
@etheory
Copy link
Contributor

etheory commented Aug 19, 2021

So sorry @lgritz, I'll still have a look now anyway. This has been on my to do list all week.

@lgritz lgritz merged commit 30d2a4a into AcademySoftwareFoundation:master Aug 19, 2021
@johnhaddon
Copy link
Contributor

Thanks for the update @lgritz - I'll carry on with the RendererServices approach for now, but will look into a PR and/or the "supply the code" approach when we come to upgrade to the latest and greatest version.

@lgritz lgritz deleted the lg-inputplacement branch August 23, 2021 04:32
@etheory
Copy link
Contributor

etheory commented Apr 3, 2023

@lgritz is there a timeframe for when we can expect the same as this for get_attribute calls? And if not how we could help do this? I don't think we are confident to write it all ourselves, but definitely to contribute to something. @AlexMWells I recall that you added stuff for get_transform and transform_points but I can't seem to find that PR.

We would need something within the next 2 months or we'd been in some trouble on our end. How could we make this happen? Tagging @curtisblack

@AlexMWells
Copy link
Contributor

@etheory the current design we have intends for renderer services to be implemented by bitcode free functions, with a fallback for non-bitcode (precompiled on the host) free functions to be registered so Shader llvm code gen can generate calls to it. The main difference between bitcode and precompiled is that bitcode is included in the same modules as the shader llvm code gen and can be inlined/optimized as such (and would be required for GPUs), whereas a precompiled free function would have to have a real function call.

There is already means for these bitcode free function renderer services in main branch. But only those called by opmatrix.cpp so far:
https://github.com/AcademySoftwareFoundation/OpenShadingLanguage/blob/main/src/include/OSL/rs_free_function.h
Testshade does have a mode to exercise the bitcode interface: --use_rs_bitcode or setenv TESTSHADE_RS_BITCODE 1

We have a PR almost ready to go which adds free function based error reporting, printf, warning, and error messages (there are obviously more renderer services calls to make free function versions of, but the general pattern is there and working).

Now for getattribute, we have a design we like, which shifts most of the spaghetti code of handling different attribute names and types to compile time vs. runtime. This is done by allowing a renderer to provide(register) different function names and or parameters to getattribute shader calls. So during llvm code gen the Render is given the attribute name and type and will be asked for a specific function name and constant parameters to pass to said function. The beautiful thing is when provided bitcode, those calls can be inlined and do what every they want. Such as dereference an index based lookup table inside the renderer using an index/offset known at compile time (because the renderer provided the index/offset at compile time). More complex renderers could do multiple dereferences or data decompression if needed.

To implement this, we would introduce the concept of an AttributeGetterSpec, which has the name of the function for llvm code gen to call, along with a list of extra compile time constant arguments to pass the function. So at code gen time, the renderer would be asked to fill out an AttributeGetterSpec for each combination of attribute name and type. The code gen would generate calls to the function name in the AttributeGetterSpec along with any additional compile time constant arguments. The renderer would have supplied the bitcode for said functions to the ShadingSystem, so it all get inlined/optimized together. And for backwards compatibility (for non-bitcode) on the host, an actual host side function address could have been specified as part of the AttributeGetterSpec and plumbed into the lazy function helper which is basically the linker marrying precompiled shader functions (noise, spline, etc.) with llvm code gen. No reason it could not handle precompiled renderer free functions as well.

Here is what the design looks like (this also has a replacement for populating shader globals):
https://www.godbolt.org/z/1vsG55svM

That was the direction we were heading, you can see if it applies to your situation and run with it. If the design doesn't address your issues, we would like to know that as well.

@curtisblack
Copy link
Contributor

Hi @AlexMWells, thanks for the explanation. That all sounds very useful, I can see a lot of potential for that to simplify and streamline things on our end.

For our usage we know the memory locations of all attributes at shader compile time, so our aim is to replace any named lookup with a direct memory read to an offset from a base pointer inside the shader. If I'm understanding correctly this sounds possible within your proposed design. So this would work perfect for us.

I look forward to seeing this developed. Is it too early to estimate when this might be available or how much work is involved?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants