-
Notifications
You must be signed in to change notification settings - Fork 375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Userdata input placement #1391
Userdata input placement #1391
Conversation
a955955
to
1a3a69a
Compare
I've got a few questions about how this approach might apply to the "other attributes" part, which I assume means replacing calls to
|
The pull request looks good to me overall, but I wanted to provide some context around it for a possible future roadmap. Forgive the long essay, but I wanted this recorded somewhere, and this seems to be a reasonable place. This is very close to where we were in our usage of OSL, but we have since moved on to something more akin to how the library bitcode is provided in the optix datapath. Instead of the definition of the buffer layout living in OSL, the renderer provides implementations of the shadeop code as a bitcode library that can be put in the module, and the JIT can directly call. For example, you can replace the codegen call to "OSL_SHADEOP int The main benefit this has is that the renderer then has control over the data layout of the buffer that values are coming from. Stride and offset are pretty flexible, but if you want to do fancier things like compressed lookup, they aren't enough. And as this pull request shows, you have to pipe the base pointer around, which isn't too onerous, but with the bitcode method, any and all of these could be passed as part of the blind renderstate stored on the shader globals, and interpreted by the renderer side code in renderer native types/layouts. That being said, it's good to provide a least-common-denominator in the open-source distribution so that different software packages can get OSL integrated as easily as possible. In fact, it could be setup so that when building OSL, the built-in osl_bind_interpolated_param is compiled to bitcode and embeded as the default value of the bitcode attribute on the shading system, then a renderer basically just sets the shadingsystem attribute to new bitcode iff it want's custom handling. Additionally, if we look at the userdata call, there are several things like the type and whether or not there are derivatives that we might also want to handle with separate entry points instead of passing them as data to be decoded on the other end. This would be akin to template expansion in C++ using call-site vectoring to skip all the decoding steps of those parameters. This in fact could be implemented on the user side with c-entry points calling a C++ templated implementation, eliminating a bunch of the otherwise branchy code in those implementations. The other thing to change to make this efficient is have "userdata" handles, similar to texture handles, that allow the renderer to pre-resolve and stash info about the remaining args earlier on -- effectively a blind version of the SymLocationDesc. Basically find_symloc is replaced with grabbing the handle, and the implementation of symloc_ptr + the memcpy to dst is what would be in the implementation bitcode. To John's first point above, a getattribute and getuserdata replaced this way could fall back to dynamic callbacks if the handle was not found (nullptr) like you currently do when the find_symloc fails. I don't think this pull request interferes with that future, and indeed provides an improved fallback case, so I think that it is a good first step. Going further than the custom bitcode, you start to get into doing something where the OSL code-gen makes a callback into the renderer to let it do custom code-gen at run-time. That is likely too far for OSL because that then puts the onus on the renderer developers to understand how LLVM works, or for OSL to provide some sort of API that's a reasonable subset of LLVM's code-gen. The beauty of not going that far and just providing bitcode via a shadingsystem->setattribute call is that the developers of the renderer can write the code in basically any language that has an LLVM frontend, use that to compile their source to bitcode, and get that dumped into the text section of their binary, basically "it's just a string." So developers only need to use an LLVM based compiler as a binary tool in their build process (which OSL already requires for advanced usage), they don't have to use it as a library they understand and code against. |
To answer part of John's second question above, if I understand it correctly, you would use your extended convert_value when creating the buffer of values that the userdata_base_ptr points to. What might not be clear, and could probably use some clarification in the example code, is that the renderer using this interface is expected to ask the shading system which values it'll need to provide after the renderer has asked the shading system to optimize the group. Then the renderer sets up the symloc records indicating where it will put those values in the buffer, and before each shading execution, it pre-fills in the buffer with all of those values. This does mean it may fill in values that dynamically will not be used, but the idea is that doing this for all values upfront and avoiding all the searching in the middle of shading ends up providing a decent performance gain. |
1a3a69a
to
b7503bf
Compare
@johnhaddon, sorry for the delay. DigiPro, Open Source Day, and SIGGRAPH definitely sapped the free time.
On the CPU side, currently any attributes not retrieved this way because they have been pre-declared with the input placement stuff will continue to fall back on the RendererServices call. On the GPU side, I think we've taken for granted for a while that attributes, textures, etc., that can't have their names figured out by the time the runtime optimization is complete, will probably fail or not be supported. Though the ultimate degree that it might be made to work is still TBD. As noted elsewhere in this discussion, the "placement" approach may eventually be subsumed or augmented with some more general "supply the code" approach via LLVM modules. I think this places a much higher technical and implementation burden on renderers, but I support the idea and note that this also may restore the plausibility of supporting types of queries whose values can't be reduced to constants during the runtime optimization stage.
Definitely we would entertain PRs or requests for more flexible conversion rules. It's all evolving as we go. The patch you cite above looks sensible, I think. We would welcome a PR to add this to the logic of our convert_value. |
@sfriedmapixar (and @AlexMWells who has also brought this up): I do like the "supply the bitcode and we'll add it to the module" approach. I think certainly we would want to add such a mechanism, probably accounting for almost everything we currently do with RendererServices. My intuition is that we would want this to be in addition to the input/output placement work, rather than 100% replacing it, merely because the placement approach is really simple for renderers to integrate OSL without needing to have a pretty substantial burden of additional code to write, adding LLVM/clang as a more direct dependency (not merely coming as a black box along with OSL, but requiring their attention and catering). But maybe a simpler way of saying that is "supplying module code may be the core implementation, which renderers may use when they need the flexibility, and the implementation of placement and RendererServices may migrate to simply using the default set of those modules that come already supplied by OSL." If you have already gone down this road and figured out what works well, I would definitely encourage (if not beg) you to either make a direct PR or spell it out in as much detail as possible. If at all possible for you to do, that would seem much more efficient than my fumbling around to re-implement it myself from scratch and then have several back and forths in which you tell me what I did wrong and how to fix pitfalls that you already experienced and fixed when you did it the first time. |
I was hoping to have input from @etheory, but this has languished long enough, so I'm going to merge so that I can move on to the next steps in this process. This is only in master and is known to be in flux, so I'm not worried if minor problems are uncovered and we wish to continue revising the approach. |
This is analogous to the "output placement" that we added recently, but for input of userdata. Reminder: "userdata" means shader parameters marked as lockgeom=0 that are expected to be supplied on a per-shade basis, such as interpolated vertex data on geometric primitives. The primitive data binds by name to matching shader parameters. This patch concerns how the shader gets this per-point data. Old school: in the middle of the shader, when userdata is needed, the shader calls osl_bind_interpolated_param(), which makes a callback to RendererServices::get_userdata(), which... does something magical and renderer-specific to retrieve the data by name. This is expensive and also challenging to implement on the GPU (which is why it remains unimplemented). This patch changes the methodology to the following, which is somewhat analogous to how we did output placement: 0. It is assumed that there is an arena of memory where the userdata is going to be, and the pointer to its start is passed down to the shader via execute() (just like how the shaderglobals and output arena pointers are passed), along with an index of the current shade point. 1. The renderer pre-declares SymLocationDesc records for the userdata, which specifies the offsets and strides within the arena where the data can be found. 2. Prior to launching the shader execute(), the renderer is expected fill in those values in the userdata arena for the points to be shaded. 3. When JITing the part of the shader that initializes those userdata parameter values, we simply JIT a simple data copy directly from the computed position within the arena (offsetting the right amount for this data, at this shade index) -- just a couple of instructions per shade, no function calls, no RendererServices callbacks, no name lookups! 4. There is no 4. That's it. See the changes to osldeformer.cpp for an example of how this looks from the renderer side. It's really straightforward. Currently, if there is no SymLocationDesc for a lockgeom=0 parameter that needs initialization, it will fall back to the old callback. We may fully deprecate the old way, but not for a while, so this is back compatible for now. This seems to work, passes all tests, doesn't break anything, and works for the osldeformer example! There is follow-up work to do: I think we can pretty easily do the same thing for ShaderGlobals ("make your own SG struct"), named marix retrieval, and other attributes. Will do those in subsequent PRs if this meets everybody's approval. Signed-off-by: Larry Gritz <[email protected]>
b7503bf
to
d05a932
Compare
So sorry @lgritz, I'll still have a look now anyway. This has been on my to do list all week. |
Thanks for the update @lgritz - I'll carry on with the RendererServices approach for now, but will look into a PR and/or the "supply the code" approach when we come to upgrade to the latest and greatest version. |
@lgritz is there a timeframe for when we can expect the same as this for get_attribute calls? And if not how we could help do this? I don't think we are confident to write it all ourselves, but definitely to contribute to something. @AlexMWells I recall that you added stuff for get_transform and transform_points but I can't seem to find that PR. We would need something within the next 2 months or we'd been in some trouble on our end. How could we make this happen? Tagging @curtisblack |
@etheory the current design we have intends for renderer services to be implemented by bitcode free functions, with a fallback for non-bitcode (precompiled on the host) free functions to be registered so Shader llvm code gen can generate calls to it. The main difference between bitcode and precompiled is that bitcode is included in the same modules as the shader llvm code gen and can be inlined/optimized as such (and would be required for GPUs), whereas a precompiled free function would have to have a real function call. There is already means for these bitcode free function renderer services in main branch. But only those called by opmatrix.cpp so far: We have a PR almost ready to go which adds free function based error reporting, printf, warning, and error messages (there are obviously more renderer services calls to make free function versions of, but the general pattern is there and working). Now for getattribute, we have a design we like, which shifts most of the spaghetti code of handling different attribute names and types to compile time vs. runtime. This is done by allowing a renderer to provide(register) different function names and or parameters to getattribute shader calls. So during llvm code gen the Render is given the attribute name and type and will be asked for a specific function name and constant parameters to pass to said function. The beautiful thing is when provided bitcode, those calls can be inlined and do what every they want. Such as dereference an index based lookup table inside the renderer using an index/offset known at compile time (because the renderer provided the index/offset at compile time). More complex renderers could do multiple dereferences or data decompression if needed. To implement this, we would introduce the concept of an AttributeGetterSpec, which has the name of the function for llvm code gen to call, along with a list of extra compile time constant arguments to pass the function. So at code gen time, the renderer would be asked to fill out an AttributeGetterSpec for each combination of attribute name and type. The code gen would generate calls to the function name in the AttributeGetterSpec along with any additional compile time constant arguments. The renderer would have supplied the bitcode for said functions to the ShadingSystem, so it all get inlined/optimized together. And for backwards compatibility (for non-bitcode) on the host, an actual host side function address could have been specified as part of the AttributeGetterSpec and plumbed into the lazy function helper which is basically the linker marrying precompiled shader functions (noise, spline, etc.) with llvm code gen. No reason it could not handle precompiled renderer free functions as well. Here is what the design looks like (this also has a replacement for populating shader globals): That was the direction we were heading, you can see if it applies to your situation and run with it. If the design doesn't address your issues, we would like to know that as well. |
Hi @AlexMWells, thanks for the explanation. That all sounds very useful, I can see a lot of potential for that to simplify and streamline things on our end. For our usage we know the memory locations of all attributes at shader compile time, so our aim is to replace any named lookup with a direct memory read to an offset from a base pointer inside the shader. If I'm understanding correctly this sounds possible within your proposed design. So this would work perfect for us. I look forward to seeing this developed. Is it too early to estimate when this might be available or how much work is involved? |
This is analogous to the "output placement" that we added recently, but
for input of userdata.
Reminder: "userdata" means shader parameters marked as lockgeom=0 that
are expected to be supplied on a per-shade basis, such as interpolated
vertex data on geometric primitives. The primitive data binds by name to
matching shader parameters.
This patch concerns how the shader gets this per-point data.
Old school: in the middle of the shader, when userdata is needed, the
shader calls osl_bind_interpolated_param(), which makes a callback to
RendererServices::get_userdata(), which... does something magical and
renderer-specific to retrieve the data by name. This is expensive and
also challenging to implement on the GPU (which is why it remains
unimplemented).
This patch changes the methodology to the following, which is somewhat
analogous to how we did output placement:
It is assumed that there is an arena of memory where the userdata
is going to be, and the pointer to its start is passed down to the
shader via execute() (just like how the shaderglobals and output
arena pointers are passed), along with an index of the current
shade point.
The renderer pre-declares SymLocationDesc records for the userdata,
which specifies the offsets and strides within the arena where the
data can be found.
Prior to launching the shader execute(), the renderer is expected
fill in those values in the userdata arena for the points to be
shaded.
When JITing the part of the shader that initializes those userdata
parameter values, we simply JIT a simple data copy directly from
the computed position within the arena (offsetting the right amount
for this data, at this shade index) -- just a couple of
instructions per shade, no function calls, no RendererServices
callbacks, no name lookups!
There is no 4. That's it.
See the changes to osldeformer.cpp for an example of how this looks from
the renderer side. It's really straightforward.
Currently, if there is no SymLocationDesc for a lockgeom=0 parameter
that needs initialization, it will fall back to the old callback. We
may fully deprecate the old way, but not for a while, so this is back
compatible for now.
This seems to work, passes all tests, doesn't break anything, and works
for the osldeformer example!
There is follow-up work to do: I think we can pretty easily do the
same thing for ShaderGlobals ("make your own SG struct"), named marix
retrieval, and other attributes. Will do those in subsequent PRs if
this meets everybody's approval.
Signed-off-by: Larry Gritz [email protected]