Pr/batched spline #1422

AlexMWells · 2021-10-11T23:27:28Z

Description

Implement spline and spline inverse for batched SIMD implementation.

Added

        template <int M00, int M01, int M02, int M03,
                  int M10, int M11, int M12, int M13,
                  int M20, int M21, int M22, int M23,
                  int M30, int M31, int M32, int M33, int DivisorT>
        struct sfm::StaticMatrix44

to model a matrix whose elements are known at compile time. Internally it utilizes proxys instead of floats in order to optimize for special cases of 0, 1, -1 elements and not even generate math when possible. The StaticMatrix44 is used by wide_opspline.cpp to create specialized functions for each of the spline basis types. A dispatch function routes the compile time optimized spline or spline inverse routine.

Altered splineimpl.h to provide "int basis_type_of(StringParam basis_name)" to allow the batched implementation to map a string to spline enum value used as an index in the dispatch function to functions optimized for each type of spline basis (instead of generic 4x4 matrix implementation)

Added "T sfm::select_val(bool cond, const T left, const T right)" as alternative to the ternary operator which appeared to possibly take address of its arguments which prevented privatization of those variables when compiled inside a SIMD loop.

Added note to BatchedAnalysis implementation identifying that local variables scope is lost between .osl and .oso and when a function is inlined multiple times, its local variables could be treated as varying when they could have been uniform if somehow identified as a local variable by the Analyzer.

Performance observations: initial testing shows for splineinverse shows up to 1.5x speedup dispatching to the compile time optimized StaticMatrix44 versions. Vectorization of 16 data lanes (512bit) can speedup to 2x+ on top of that.

Tests

Enable BATCHED testing of:

spline-boundarybug
spline-derivbug
spline-nonuniformknots
spline
splineinverse-ident
splineinverse

Added new BATCHED_REGRESSION tests:

spline-reg
splineinverse-knots-ascend-reg
splineinverse-knots-descend-reg

Checklist:

I have read the contribution guidelines.
I have previously submitted a Contributor License Agreement.
I have updated the documentation, if applicable.
I have ensured that the change is tested somewhere in the testsuite (adding new test cases if necessary).
My code follows the prevailing code style of this project.

Added batched llvm code gen for: llvm_gen_blackbody, llvm_gen_luminance, llvm_gen_transformc (including code gen of loop to identify unique "from" and "to" space combinations to call library function with). Made ColorSystem const correct. Pulled much of ColorSystem's implementation into src/liboslexec/opcolor_impl.h so it could be inlined in both opcolor.cpp and wide_opcolor.cpp. To enable SIMD fast path for ColorSystem::blackbody_rgb added new methods to expose underlying lookup table optimization. bool can_lookup_blackbody(float T /*Kelvin*/) const; Color3 lookup_blackbody_rgb (float T /*Kelvin*/) const; Color3 compute_blackbody_rgb (float T /*Kelvin*/) const; When can_lookup_blackbody() returns true then lookup_blackbody_rgb can be safely called, otherwise the expensive compute_blackbody_rgb is required. Updated implementation of ColorSystem to succesfully vectorize on multiple compilers, optimized lookup table data types and indexed dereferences to generate prefered 32bit indexed gathers vs 64bit gathers which require multiple instructions and registers. To improve code generation and less chances of aliasing issues, changed nested M[i][j] operator access to Matrix in ImathMatrix.h to directly access underlying 2d array M.x[i][j] also changed Vector operator V[i] to directly access V.x, V.y, V.z in OSL/IMathx/IMath*.h Moved helper testIfAnyLaneIsNonZero from wide_opmatrix.cpp to be testIfAnyLaneIsNonZero in OSL/wide.h Added sfm::min_val to provide an implementation of min which returns values not references to the original through a ternary (which can cause issues with vectorization). Added Clang specific sfm::min_val and sfm::max_val implementations to assist in vectorization. Enabled BATCHED execution of existing testsuites: blackbody, color, transformc, wavelength_color Added new testsuite BATCHED regression tests: blackbody-reg, color-reg, luminance-reg, transformc-reg, wavelength_color-reg Added sfm::min_val to provide an implementation of min which returns values not references to the original through a ternary (which can cause issues with vectorization). Added Clang specific sfm::min_val and sfm::max_val implementations to assist in vectorization. Signed-off-by: Alex M. Wells <[email protected]>

Use single precision for bb_spectrum object/operator and manually implement std::pow(wlm,-5.0f) to better enable vectorization. Reduced masked code generation for rgb_to_hsl. Removed duplicated code by implementing ColorSystem::blackbody_rgb(T) in terms of ColorSystem::can_lookup_blackbody(T), ColorSystem::lookup_blackbody_rgb(T), and ColorSystem::compute_blackbody_rgb(T). Removed some casting to uint16_t that was unnecessary. Removed some test shaders that were not intended to be promoted. Signed-off-by: Alex M. Wells <[email protected]>

Added batched llvm code gen for: llvm_gen_blackbody, llvm_gen_luminance, llvm_gen_transformc (including code gen of loop to identify unique "from" and "to" space combinations to call library function with). Made ColorSystem const correct. Pulled much of ColorSystem's implementation into src/liboslexec/opcolor_impl.h so it could be inlined in both opcolor.cpp and wide_opcolor.cpp. To enable SIMD fast path for ColorSystem::blackbody_rgb added new methods to expose underlying lookup table optimization. bool can_lookup_blackbody(float T /*Kelvin*/) const; Color3 lookup_blackbody_rgb (float T /*Kelvin*/) const; Color3 compute_blackbody_rgb (float T /*Kelvin*/) const; When can_lookup_blackbody() returns true then lookup_blackbody_rgb can be safely called, otherwise the expensive compute_blackbody_rgb is required. Updated implementation of ColorSystem to succesfully vectorize on multiple compilers, optimized lookup table data types and indexed dereferences to generate prefered 32bit indexed gathers vs 64bit gathers which require multiple instructions and registers. To improve code generation and less chances of aliasing issues, changed nested M[i][j] operator access to Matrix in ImathMatrix.h to directly access underlying 2d array M.x[i][j] also changed Vector operator V[i] to directly access V.x, V.y, V.z in OSL/IMathx/IMath*.h Moved helper testIfAnyLaneIsNonZero from wide_opmatrix.cpp to be testIfAnyLaneIsNonZero in OSL/wide.h Added sfm::min_val to provide an implementation of min which returns values not references to the original through a ternary (which can cause issues with vectorization). Added Clang specific sfm::min_val and sfm::max_val implementations to assist in vectorization. Enabled BATCHED execution of existing testsuites: blackbody, color, transformc, wavelength_color Added new testsuite BATCHED regression tests: blackbody-reg, color-reg, luminance-reg, transformc-reg, wavelength_color-reg Added sfm::min_val to provide an implementation of min which returns values not references to the original through a ternary (which can cause issues with vectorization). Added Clang specific sfm::min_val and sfm::max_val implementations to assist in vectorization. Signed-off-by: Alex M. Wells <[email protected]>

Use single precision for bb_spectrum object/operator and manually implement std::pow(wlm,-5.0f) to better enable vectorization. Reduced masked code generation for rgb_to_hsl. Removed duplicated code by implementing ColorSystem::blackbody_rgb(T) in terms of ColorSystem::can_lookup_blackbody(T), ColorSystem::lookup_blackbody_rgb(T), and ColorSystem::compute_blackbody_rgb(T). Removed some casting to uint16_t that was unnecessary. Removed some test shaders that were not intended to be promoted. Signed-off-by: Alex M. Wells <[email protected]>

Added template <int M00, int M01, int M02, int M03, int M10, int M11, int M12, int M13, int M20, int M21, int M22, int M23, int M30, int M31, int M32, int M33, int DivisorT> struct sfm::StaticMatrix44 to model a matrix whose elements are known at compile time. Internally it utilizes proxys instead of floats in order to optimize for special cases of 0, 1, -1 elements and not even generate math when possible. The StaticMatrix44 is used by wide_opspline.cpp to create specialized functions for each of the spline basis types. A dispatch function routes the compile time optimized spline or spline inverse routine. Altered splineimpl.h to provide "int basis_type_of(StringParam basis_name)" to allow the batched implementation to map a string to spline enum value used as an index in the dispatch function to functions optimized for each type of spline basis (instead of generic 4x4 matrix implementation) Added "T sfm::select_val(bool cond, const T left, const T right)" as alternative to the ternary operator which appeared to possibly take address of its arguments which prevented privatization of those variables when compiled inside a SIMD loop. Added note to BatchedAnalysis implementation identifying that local variables scope is lost between .osl and .oso and when a function is inlined multiple times, its local variables could be treated as varying when they could have been uniform if somehow identified as a local variable by the Analyzer. Enable BATCHED testing of: spline-boundarybug spline-derivbug spline-nonuniformknots spline splineinverse-ident splineinverse Added new BATCHED_REGRESSION tests: spline-reg splineinverse-knots-ascend-reg splineinverse-knots-descend-reg Signed-off-by: Alex M. Wells <[email protected]>

Signed-off-by: Alex M. Wells <[email protected]>

…nverse as OIIO::invert has a loop which was called from inside the search interval loop, created a nested loop. Under SIMD/SIMT the nested loop could execute incoherently. Instead we choose to use the class sfm::Inverter to manage state between is_bracketed_by(min,max), meant to be called from the search interval loop, and bracketed_invert(min,max) meant to be called after exiting the search interval loop, avoiding a nested loop that could be executed at different interval's for each SIMD/SIMT lane/thread. Signed-off-by: Alex M. Wells <[email protected]>

Signed-off-by: Alex M. Wells <[email protected]>

lgritz · 2021-10-19T05:11:35Z

src/include/OSL/sfm_staticmatrix.h

@@ -0,0 +1,479 @@
+// Copyright Contributors to the Open Shading Language project.


Aside: I wonder if this file should be in src/liboslexec (or even in src/liboslexec/wide)?

I imagine it will eventually get used by the scalar code gen if the code gen generates calls to spline's whose basis is known at compile time: spline_bspline(....), spline_hermite(....). So if moving it, probably src/liboslexec. It is an implementation detail at this point. So I can move it.

I would like to use it in the scalar code, too. But I'm not sure we want to include it in the exported public headers. Anything exposed there requires a lot more caution about any future deprecation or ABI changes.

Understood, I'll move the file as part of the PR/BatchedTexture.

Or feel free to move it, sooner

There's no rush, whenever you're touching it next.

lgritz

This all looks fine. Will merge.

AlexMWells added 8 commits October 11, 2021 16:10

clang-format of src/include/OSL/sfm_staticmatrix.h

814e593

Signed-off-by: Alex M. Wells <[email protected]>

Fix maybe-uninitialized warning for gcc builds

d4258ec

Signed-off-by: Alex M. Wells <[email protected]>

lgritz reviewed Oct 19, 2021

View reviewed changes

lgritz approved these changes Oct 19, 2021

View reviewed changes

lgritz merged commit 03a31f5 into AcademySoftwareFoundation:master Oct 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pr/batched spline #1422

Pr/batched spline #1422

AlexMWells commented Oct 11, 2021

lgritz Oct 19, 2021

AlexMWells Oct 19, 2021

lgritz Oct 19, 2021

AlexMWells Oct 19, 2021

AlexMWells Oct 19, 2021

lgritz Oct 19, 2021

lgritz left a comment

		@@ -0,0 +1,479 @@
		// Copyright Contributors to the Open Shading Language project.

Pr/batched spline #1422

Pr/batched spline #1422

Conversation

AlexMWells commented Oct 11, 2021

Description

Tests

Checklist:

lgritz Oct 19, 2021

Choose a reason for hiding this comment

AlexMWells Oct 19, 2021

Choose a reason for hiding this comment

lgritz Oct 19, 2021

Choose a reason for hiding this comment

AlexMWells Oct 19, 2021

Choose a reason for hiding this comment

AlexMWells Oct 19, 2021

Choose a reason for hiding this comment

lgritz Oct 19, 2021

Choose a reason for hiding this comment

lgritz left a comment

Choose a reason for hiding this comment