-
Notifications
You must be signed in to change notification settings - Fork 375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PR/OptBatchedAnalysis #1318
PR/OptBatchedAnalysis #1318
Conversation
… which triggers additional analysis during runtime optimization to set Symbol::is_uniform, Symbol::forced_llvm_bool(), and Opcode::requires_masking. Only coalesce_temps for symbols whose is_uniform() and forced_llvm_bool() values match. Update testshade to set opt_batched_analysis when batched execution is requested.
…s a BatchedRendererServices for WidthOf<16> or WidthOf<8>. Changed testshade from explicitly enabling opt_batched_analysis to explicitly disabling it when not in batched execution mode. Removed test for opt_batched_analysis from coalesce symbols pass and just rely on a Symbol's is_uniform() and forced_llvm_bool() to be set correctly.
src/liboslexec/runtimeoptimize.cpp
Outdated
@@ -2718,7 +2719,10 @@ RuntimeOptimizer::coalesce_temporaries () | |||
if (coalescable (*t) && | |||
equivalent (s->typespec(), t->typespec()) && | |||
s->has_derivs() == t->has_derivs() && | |||
(slast < t->firstuse() || sfirst > t->lastuse())) { | |||
(slast < t->firstuse() || sfirst > t->lastuse()) && | |||
( (!m_opt_batched_analysis) || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious... is this test !m_opt_batched_analysis
necessary? If the analysis is not done, then all the symbols will match each other in their is_uniform and forced_llvm_bool, right? Those will never be different without batched analysis, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, if m_opt_batched_analysis is false, then is_uniform == true && forced_llvm_bool == false for all symbols, so you are correct there. However as analysis of forcing boolean was originally done independent of the uniform vs varying analysis, and theoretically could be used by scalar code I thought it would be useful to call out that the extra restrictions are only necessary to support the batched code generator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, we can drop the "(!m_opt_batched_analysis) ||" portion, as the 2 flags being checked should be set properly to function during scalar and we aren't really saving much by skipping checking to 2 symbol flags vs. 1 global flag.
I'm not sure what the attribute is for. Why not just do the analysis if it's a SS that supports batch shading (which you can tell from the rendererservices, right?), and not if you don't? As I pointed out inline, I think it's not actually necessary to test the state of the option in coalesce_temporaries. So maybe the whole option is not necessary? Would you ever want to disable the analysis if you had a batched renderer, or enable it if you didn't? |
RendererServices::batched(WidthOf<16>) and/or RendererServices::batched(WidthOf<8>) right now return a BatchedRendererServices<#> *. We could see if either returns a non-nullptr and choose to perform batched analysis or not. But renderers may support batching but have controls to enable/disable it (testshade --batched), so those controls would need to be plumbed to the RendererServices methods in that case. That would also require BatchedRendererServices to run the batched analysis (which for debugging it might be nice to do in scalar mode). I think I wanted a simple way to enable/disable optimizing for batching or not independent of a renderers actually support. Main use case is disabling batching and not paying the overhead of batched analysis. This does perhaps make me thing this is more of a ShadingGroup decision that ShadingSystem. |
I would propose this opt_batched_analysis be moved into ShaderGroup, then render can choose per shadergroup if it should be optimized to allow batching or not. Would be logic bug (assert) if attempt to batched execution shading group that didn't have the attribute set. |
It does seem strange to me that if you're using batch shading, there's this additional attribute you need to set separately for every group, otherwise you'll have errors. The feels like an unnecessary extra step that will frustrate people. Maybe... turn it on if the renderer seems to support batched shading, but in the special case where (a) the renderer supports batched shading, and (b) the app isn't going to use batched shading, and (c) the batched analysis is a noticeable perf issue, THEN have an attribute that allows you to force it off even though it looks like it should be turned on? |
Sounds reasonable, should it be a shading group attribute, so it can be controlled individually? Or have it be an explicit disable: |
Let's keep it as "opt_batched_analysis", and like you said, set to 0 explicitly to disable. By default, it will enable if the RS appears to support batch. I don't have a strong feeling for whether it should be per-group. I think it's ok globally, too, which would basically give you three modes:
We are losing the (I think uncommon) case where the renderer supports batch, and the app will use it on some groups but not others, and wants to turn it off group-by-group. (Is the analysis expensive enough to care?) But on the other hand, this gives us the extra flexibility of being able to use or not use batch shading on any group in this case, without having to pre-declare for each group our intent ahead of time. But if the batched analysis is expensive and you feel strongly that it needs to be able to be turned off for just a subset of groups, then I won't object to making it a per-group attribute. |
Analysis cost is not free, but its not terribly expensive, somewhat linear with # of operations. I think you have a good point that it should not be a logic error and just work if a renderer has BatchedRendererServices. So lets assume it's enabled by default: Questions:
I requested some additional input on this, so lets see if more comments come in... |
Slight refinement: Don't accidentally incur significant overhead. If the batched analysis is already small compared to the rest of the runtime optimization step and JIT, then I'm not inclined to add any app-side complexity by making it selectable.
I have no intuition about whether this is important, so I defer completely to the experience of those using this code already in batched renderers. If the PRMan folks say that it's important for performance to run only some things batched and not others (and in particular, to also not even do the analysis where not strictly necessary), then it's case closed and we definitely want an option per-group. But if they say "who cares", I lean toward the choice that makes fewer options and less app-side complexity. |
As far as making sure we disable the batched analysis when optimizing a non-batched shader, I agree that the bar is "significant" overhead. This PR isn't quite what we have deployed as Alex has done some clean-up and improvement since then on it, so I don't have any production measurement of what the overhead here is. I will say that in our experience, some times it is good to run the batched and others it's good to run the non-batched in the same render -- YMMV based on your renderer -- so the per-shader-group setting would be needed if the overhead is significant, and it would be reasonable for the default to depend on whether any batched shading would happen or not (and up to the renderer to turn it off at a finer grain if necessary). A third and slightly different goal I wanted to mention is to enable debugging of the uniform optimization. It's great that we get this optimization now without the RSL-era uniform/varying keywords having to be manually placed by the shader authors, however it is non-trivial and I have run into bugs in this in production. So as a renderer developer, it would be nice to have a knob setting that always assumes varying in batched mode to help debug that particular pass of optimization -- which is different than the assumption of uniform made here. |
My advice is to add the explicit opt_batched_analysis as global, and if the analysis time appears to be important, it is not hard to add per-group override later. Yes, a "make everything varying" option sounds like a great idea! That also has the benefit of being able to turn it on just to more easily measure precisely how much we're saving by exploiting uniforms at all. You may be amused to know that in the earliest days of OSL -- before perf issues sent us down the LLVM JIT road, back when it was envisioned as a grid-based interpreter a la Reyes (even though we were using it as a ray tracer, we would try very hard to batch rays by material) -- we had uniform/varying analysis, and it was by far the coolest of any renderer I'd worked on. It wasn't merely that it analyzed the shader to figure out which things were uniform or varying without needing those keywords in the language. No, it was better. It did it dynamically. That is, it figured out what was "definitely uniform" and "maybe varying", and for the latter it would allocate enough space for the whole grid of values, but treated it as uniform (only using slot [0]) until something varying actually got assigned to it (or something uniform when all runflags were not on), at which point it would copy slot [0] to all the other slots and treat it varying thereafter, for that execution of the shader on that grid. And if the then-varying variable was subsequently assigned a uniform value and all grid runflags were turned on, it would demote it back to uniform! All this was an additional savings, because actually even the variables that needed to be varying didn't need to do varying operations for their whole lifetimes. It was the slickest Reyes interpreter I ever wrote. But alas, in a ray tracer, it was very hard to make big enough grids to be as performant as we liked. It needed to be comparable speed to our old C compiled shaders to be accepted by production, and it sure wasn't, a big crisis putting the whole project in jeopardy. So we scrapped it for single-point-shading but fully JITed to machine code using LLVM, and that's what got us ready for production (in fact, we ended up faster than the native C shaders, because of all the runtime optimization). On the renderer side, we also were able to then simplify by removing the complicated code dedicated to trying to generate coherent batches, and that was also nice. |
So this pull request as is has the "opt_batched_analysis as global", right? The difference might be the default, right now it just defaults to false, instead it could default to "true" if a BatchRendererServices is provided by RendererServices. Then in testshade, instead of explicitly enabling opt_batched_analysis=1, it would explicitly disable opt_batched_analysis=0 when --batched command line option is not present. |
Yes, that sounds right to me, Alex. |
ok, PR has been updated as requested, passing checks except CI / Linux bleeding edge: gcc10, C++17, llvm11, oiio/ocio/exr/pybind-master, avx2 (pull_request) whose issue appears unrelated. |
Yeah, looks like an API-breaking change in OpenColorIO master is breaking our "bleeding edge" test. Not related to your patch at all. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Add ShadingSystem attribute "opt_batched_analysis"(defaults to false) which triggers additional analysis during runtime optimization to set Symbol::is_uniform, Symbol::forced_llvm_bool(), and Opcode::requires_masking. Only coalesce_temps for symbols whose is_uniform() and forced_llvm_bool() values match. Update testshade to set opt_batched_analysis when batched execution is requested.
Description
Tests
Checklist: