PR/OptBatchedAnalysis #1318

AlexMWells · 2020-12-22T19:12:55Z

Add ShadingSystem attribute "opt_batched_analysis"(defaults to false) which triggers additional analysis during runtime optimization to set Symbol::is_uniform, Symbol::forced_llvm_bool(), and Opcode::requires_masking. Only coalesce_temps for symbols whose is_uniform() and forced_llvm_bool() values match. Update testshade to set opt_batched_analysis when batched execution is requested.

Description

Tests

Checklist:

I have read the contribution guidelines.
I have previously submitted a Contributor License Agreement.
I have updated the documentation, if applicable.
I have ensured that the change is tested somewhere in the testsuite (adding new test cases if necessary).
My code follows the prevailing code style of this project.

… which triggers additional analysis during runtime optimization to set Symbol::is_uniform, Symbol::forced_llvm_bool(), and Opcode::requires_masking. Only coalesce_temps for symbols whose is_uniform() and forced_llvm_bool() values match. Update testshade to set opt_batched_analysis when batched execution is requested.

…s a BatchedRendererServices for WidthOf<16> or WidthOf<8>. Changed testshade from explicitly enabling opt_batched_analysis to explicitly disabling it when not in batched execution mode. Removed test for opt_batched_analysis from coalesce symbols pass and just rely on a Symbol's is_uniform() and forced_llvm_bool() to be set correctly.

lgritz · 2020-12-25T01:44:13Z

src/liboslexec/runtimeoptimize.cpp

@@ -2718,7 +2719,10 @@ RuntimeOptimizer::coalesce_temporaries ()
                if (coalescable (*t) &&
                      equivalent (s->typespec(), t->typespec()) &&
                      s->has_derivs() == t->has_derivs() &&
-                      (slast < t->firstuse() || sfirst > t->lastuse())) {
+                      (slast < t->firstuse() || sfirst > t->lastuse()) &&
+                      ( (!m_opt_batched_analysis) ||


Just curious... is this test !m_opt_batched_analysis necessary? If the analysis is not done, then all the symbols will match each other in their is_uniform and forced_llvm_bool, right? Those will never be different without batched analysis, correct?

Currently, if m_opt_batched_analysis is false, then is_uniform == true && forced_llvm_bool == false for all symbols, so you are correct there. However as analysis of forcing boolean was originally done independent of the uniform vs varying analysis, and theoretically could be used by scalar code I thought it would be useful to call out that the extra restrictions are only necessary to support the batched code generator.

OK, we can drop the "(!m_opt_batched_analysis) ||" portion, as the 2 flags being checked should be set properly to function during scalar and we aren't really saving much by skipping checking to 2 symbol flags vs. 1 global flag.

lgritz · 2020-12-25T01:49:17Z

I'm not sure what the attribute is for. Why not just do the analysis if it's a SS that supports batch shading (which you can tell from the rendererservices, right?), and not if you don't? As I pointed out inline, I think it's not actually necessary to test the state of the option in coalesce_temporaries. So maybe the whole option is not necessary?

Would you ever want to disable the analysis if you had a batched renderer, or enable it if you didn't?

AlexMWells · 2021-01-04T19:14:04Z

RendererServices::batched(WidthOf<16>) and/or RendererServices::batched(WidthOf<8>) right now return a BatchedRendererServices<#> *. We could see if either returns a non-nullptr and choose to perform batched analysis or not. But renderers may support batching but have controls to enable/disable it (testshade --batched), so those controls would need to be plumbed to the RendererServices methods in that case. That would also require BatchedRendererServices to run the batched analysis (which for debugging it might be nice to do in scalar mode).

I think I wanted a simple way to enable/disable optimizing for batching or not independent of a renderers actually support. Main use case is disabling batching and not paying the overhead of batched analysis. This does perhaps make me thing this is more of a ShadingGroup decision that ShadingSystem.

AlexMWells · 2021-01-04T19:25:47Z

I would propose this opt_batched_analysis be moved into ShaderGroup, then render can choose per shadergroup if it should be optimized to allow batching or not. Would be logic bug (assert) if attempt to batched execution shading group that didn't have the attribute set.

lgritz · 2021-01-04T19:34:13Z

It does seem strange to me that if you're using batch shading, there's this additional attribute you need to set separately for every group, otherwise you'll have errors. The feels like an unnecessary extra step that will frustrate people.

Maybe... turn it on if the renderer seems to support batched shading, but in the special case where (a) the renderer supports batched shading, and (b) the app isn't going to use batched shading, and (c) the batched analysis is a noticeable perf issue, THEN have an attribute that allows you to force it off even though it looks like it should be turned on?

AlexMWells · 2021-01-04T20:07:13Z

Sounds reasonable, should it be a shading group attribute, so it can be controlled individually?
And naming suggestions for the attribute? Could leave it as
opt_batched_analysis
and they need to set it to 0 to disable. It would default to enabled if RendererServices::batched(WidthOf<16>) or RendererServices::batched(WidthOf<8>) return non-null.

Or have it be an explicit disable:
disable_opt_batched_analysis=1

lgritz · 2021-01-04T20:31:21Z

Let's keep it as "opt_batched_analysis", and like you said, set to 0 explicitly to disable. By default, it will enable if the RS appears to support batch.

I don't have a strong feeling for whether it should be per-group. I think it's ok globally, too, which would basically give you three modes:

Renderer doesn't suppport batch, so batched analysis is not done.
Renderer supports batch, batched analysis is on by default.
Renderer supports batch but the app is not going to use it (at all), and so disables analysis via the attribute.

We are losing the (I think uncommon) case where the renderer supports batch, and the app will use it on some groups but not others, and wants to turn it off group-by-group. (Is the analysis expensive enough to care?) But on the other hand, this gives us the extra flexibility of being able to use or not use batch shading on any group in this case, without having to pre-declare for each group our intent ahead of time. But if the batched analysis is expensive and you feel strongly that it needs to be able to be turned off for just a subset of groups, then I won't object to making it a per-group attribute.

AlexMWells · 2021-01-04T20:55:58Z

Analysis cost is not free, but its not terribly expensive, somewhat linear with # of operations.
Goal 1: Don't accidentally incur overhead. For example testshade when --batch is not specified, testshade supports batching but we are choosing not to use it, so would want to explicitly set "opt_batched_analysis=0" either in each ShadingGroup or ShadingSystem.
Goal 2: Allow renderer to pick and choose which shading groups it will support batching, perhaps from profile guided tuning an renderer decides to disable batching on a subset of shading groups and wishes to not have any additional overhead (even if it is small).

I think you have a good point that it should not be a logic error and just work if a renderer has BatchedRendererServices. So lets assume it's enabled by default:

Questions:

Do we want to be able to disable batched analysis only at the ShadingSystem?
Do we want to be able to disable batched analysis only at the ShaderGroup?
Do we just always want to do batched analysis (no way to avoid overhead) when a BatchedRendererServices exists and not have any additional attributes?
Do we want to be able to disable batched analysis both at the ShadingSystem and ShaderGroup?

I requested some additional input on this, so lets see if more comments come in...

lgritz · 2021-01-04T21:06:54Z

Goal 1: Don't accidentally incur overhead. For example testshade when --batch is not specified, testshade supports batching but we are choosing not to use it, so would want to explicitly set "opt_batched_analysis=0" either in each ShadingGroup or ShadingSystem.

Slight refinement: Don't accidentally incur significant overhead. If the batched analysis is already small compared to the rest of the runtime optimization step and JIT, then I'm not inclined to add any app-side complexity by making it selectable.

Goal 2: Allow renderer to pick and choose which shading groups it will support batching, perhaps from profile guided tuning an renderer decides to disable batching on a subset of shading groups and wishes to not have any additional overhead (even if it is small).

I have no intuition about whether this is important, so I defer completely to the experience of those using this code already in batched renderers. If the PRMan folks say that it's important for performance to run only some things batched and not others (and in particular, to also not even do the analysis where not strictly necessary), then it's case closed and we definitely want an option per-group. But if they say "who cares", I lean toward the choice that makes fewer options and less app-side complexity.

sfriedmapixar · 2021-01-05T01:43:43Z

As far as making sure we disable the batched analysis when optimizing a non-batched shader, I agree that the bar is "significant" overhead. This PR isn't quite what we have deployed as Alex has done some clean-up and improvement since then on it, so I don't have any production measurement of what the overhead here is.

I will say that in our experience, some times it is good to run the batched and others it's good to run the non-batched in the same render -- YMMV based on your renderer -- so the per-shader-group setting would be needed if the overhead is significant, and it would be reasonable for the default to depend on whether any batched shading would happen or not (and up to the renderer to turn it off at a finer grain if necessary).

A third and slightly different goal I wanted to mention is to enable debugging of the uniform optimization. It's great that we get this optimization now without the RSL-era uniform/varying keywords having to be manually placed by the shader authors, however it is non-trivial and I have run into bugs in this in production. So as a renderer developer, it would be nice to have a knob setting that always assumes varying in batched mode to help debug that particular pass of optimization -- which is different than the assumption of uniform made here.

lgritz · 2021-01-05T02:05:47Z

My advice is to add the explicit opt_batched_analysis as global, and if the analysis time appears to be important, it is not hard to add per-group override later.

Yes, a "make everything varying" option sounds like a great idea! That also has the benefit of being able to turn it on just to more easily measure precisely how much we're saving by exploiting uniforms at all.

You may be amused to know that in the earliest days of OSL -- before perf issues sent us down the LLVM JIT road, back when it was envisioned as a grid-based interpreter a la Reyes (even though we were using it as a ray tracer, we would try very hard to batch rays by material) -- we had uniform/varying analysis, and it was by far the coolest of any renderer I'd worked on. It wasn't merely that it analyzed the shader to figure out which things were uniform or varying without needing those keywords in the language. No, it was better. It did it dynamically. That is, it figured out what was "definitely uniform" and "maybe varying", and for the latter it would allocate enough space for the whole grid of values, but treated it as uniform (only using slot [0]) until something varying actually got assigned to it (or something uniform when all runflags were not on), at which point it would copy slot [0] to all the other slots and treat it varying thereafter, for that execution of the shader on that grid. And if the then-varying variable was subsequently assigned a uniform value and all grid runflags were turned on, it would demote it back to uniform! All this was an additional savings, because actually even the variables that needed to be varying didn't need to do varying operations for their whole lifetimes.

It was the slickest Reyes interpreter I ever wrote. But alas, in a ray tracer, it was very hard to make big enough grids to be as performant as we liked. It needed to be comparable speed to our old C compiled shaders to be accepted by production, and it sure wasn't, a big crisis putting the whole project in jeopardy. So we scrapped it for single-point-shading but fully JITed to machine code using LLVM, and that's what got us ready for production (in fact, we ended up faster than the native C shaders, because of all the runtime optimization). On the renderer side, we also were able to then simplify by removing the complicated code dedicated to trying to generate coherent batches, and that was also nice.

AlexMWells · 2021-01-05T02:37:57Z

So this pull request as is has the "opt_batched_analysis as global", right? The difference might be the default, right now it just defaults to false, instead it could default to "true" if a BatchRendererServices is provided by RendererServices. Then in testshade, instead of explicitly enabling opt_batched_analysis=1, it would explicitly disable opt_batched_analysis=0 when --batched command line option is not present.

lgritz · 2021-01-05T04:27:36Z

Yes, that sounds right to me, Alex.

AlexMWells · 2021-01-05T21:26:12Z

ok, PR has been updated as requested, passing checks except CI / Linux bleeding edge: gcc10, C++17, llvm11, oiio/ocio/exr/pybind-master, avx2 (pull_request) whose issue appears unrelated.

lgritz · 2021-01-05T21:31:36Z

Yeah, looks like an API-breaking change in OpenColorIO master is breaking our "bleeding edge" test. Not related to your patch at all.

lgritz

LGTM

AlexMWells added 2 commits February 3, 2016 13:32

lgritz reviewed Dec 25, 2020

View reviewed changes

lgritz approved these changes Jan 5, 2021

View reviewed changes

lgritz merged commit 9d8aaab into AcademySoftwareFoundation:master Jan 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PR/OptBatchedAnalysis #1318

PR/OptBatchedAnalysis #1318

AlexMWells commented Dec 22, 2020

lgritz Dec 25, 2020

AlexMWells Jan 4, 2021

AlexMWells Jan 5, 2021

lgritz commented Dec 25, 2020

AlexMWells commented Jan 4, 2021

AlexMWells commented Jan 4, 2021

lgritz commented Jan 4, 2021

AlexMWells commented Jan 4, 2021

lgritz commented Jan 4, 2021

AlexMWells commented Jan 4, 2021

lgritz commented Jan 4, 2021

sfriedmapixar commented Jan 5, 2021

lgritz commented Jan 5, 2021

AlexMWells commented Jan 5, 2021

lgritz commented Jan 5, 2021

AlexMWells commented Jan 5, 2021

lgritz commented Jan 5, 2021

lgritz left a comment

PR/OptBatchedAnalysis #1318

PR/OptBatchedAnalysis #1318

Conversation

AlexMWells commented Dec 22, 2020

Description

Tests

Checklist:

lgritz Dec 25, 2020

Choose a reason for hiding this comment

AlexMWells Jan 4, 2021

Choose a reason for hiding this comment

AlexMWells Jan 5, 2021

Choose a reason for hiding this comment

lgritz commented Dec 25, 2020

AlexMWells commented Jan 4, 2021

AlexMWells commented Jan 4, 2021

lgritz commented Jan 4, 2021

AlexMWells commented Jan 4, 2021

lgritz commented Jan 4, 2021

AlexMWells commented Jan 4, 2021

lgritz commented Jan 4, 2021

sfriedmapixar commented Jan 5, 2021

lgritz commented Jan 5, 2021

AlexMWells commented Jan 5, 2021

lgritz commented Jan 5, 2021

AlexMWells commented Jan 5, 2021

lgritz commented Jan 5, 2021

lgritz left a comment

Choose a reason for hiding this comment