-
Notifications
You must be signed in to change notification settings - Fork 375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pr/batched spline #1422
Pr/batched spline #1422
Conversation
Added batched llvm code gen for: llvm_gen_blackbody, llvm_gen_luminance, llvm_gen_transformc (including code gen of loop to identify unique "from" and "to" space combinations to call library function with). Made ColorSystem const correct. Pulled much of ColorSystem's implementation into src/liboslexec/opcolor_impl.h so it could be inlined in both opcolor.cpp and wide_opcolor.cpp. To enable SIMD fast path for ColorSystem::blackbody_rgb added new methods to expose underlying lookup table optimization. bool can_lookup_blackbody(float T /*Kelvin*/) const; Color3 lookup_blackbody_rgb (float T /*Kelvin*/) const; Color3 compute_blackbody_rgb (float T /*Kelvin*/) const; When can_lookup_blackbody() returns true then lookup_blackbody_rgb can be safely called, otherwise the expensive compute_blackbody_rgb is required. Updated implementation of ColorSystem to succesfully vectorize on multiple compilers, optimized lookup table data types and indexed dereferences to generate prefered 32bit indexed gathers vs 64bit gathers which require multiple instructions and registers. To improve code generation and less chances of aliasing issues, changed nested M[i][j] operator access to Matrix in ImathMatrix.h to directly access underlying 2d array M.x[i][j] also changed Vector operator V[i] to directly access V.x, V.y, V.z in OSL/IMathx/IMath*.h Moved helper testIfAnyLaneIsNonZero from wide_opmatrix.cpp to be testIfAnyLaneIsNonZero in OSL/wide.h Added sfm::min_val to provide an implementation of min which returns values not references to the original through a ternary (which can cause issues with vectorization). Added Clang specific sfm::min_val and sfm::max_val implementations to assist in vectorization. Enabled BATCHED execution of existing testsuites: blackbody, color, transformc, wavelength_color Added new testsuite BATCHED regression tests: blackbody-reg, color-reg, luminance-reg, transformc-reg, wavelength_color-reg Added sfm::min_val to provide an implementation of min which returns values not references to the original through a ternary (which can cause issues with vectorization). Added Clang specific sfm::min_val and sfm::max_val implementations to assist in vectorization. Signed-off-by: Alex M. Wells <[email protected]>
Use single precision for bb_spectrum object/operator and manually implement std::pow(wlm,-5.0f) to better enable vectorization. Reduced masked code generation for rgb_to_hsl. Removed duplicated code by implementing ColorSystem::blackbody_rgb(T) in terms of ColorSystem::can_lookup_blackbody(T), ColorSystem::lookup_blackbody_rgb(T), and ColorSystem::compute_blackbody_rgb(T). Removed some casting to uint16_t that was unnecessary. Removed some test shaders that were not intended to be promoted. Signed-off-by: Alex M. Wells <[email protected]>
Added batched llvm code gen for: llvm_gen_blackbody, llvm_gen_luminance, llvm_gen_transformc (including code gen of loop to identify unique "from" and "to" space combinations to call library function with). Made ColorSystem const correct. Pulled much of ColorSystem's implementation into src/liboslexec/opcolor_impl.h so it could be inlined in both opcolor.cpp and wide_opcolor.cpp. To enable SIMD fast path for ColorSystem::blackbody_rgb added new methods to expose underlying lookup table optimization. bool can_lookup_blackbody(float T /*Kelvin*/) const; Color3 lookup_blackbody_rgb (float T /*Kelvin*/) const; Color3 compute_blackbody_rgb (float T /*Kelvin*/) const; When can_lookup_blackbody() returns true then lookup_blackbody_rgb can be safely called, otherwise the expensive compute_blackbody_rgb is required. Updated implementation of ColorSystem to succesfully vectorize on multiple compilers, optimized lookup table data types and indexed dereferences to generate prefered 32bit indexed gathers vs 64bit gathers which require multiple instructions and registers. To improve code generation and less chances of aliasing issues, changed nested M[i][j] operator access to Matrix in ImathMatrix.h to directly access underlying 2d array M.x[i][j] also changed Vector operator V[i] to directly access V.x, V.y, V.z in OSL/IMathx/IMath*.h Moved helper testIfAnyLaneIsNonZero from wide_opmatrix.cpp to be testIfAnyLaneIsNonZero in OSL/wide.h Added sfm::min_val to provide an implementation of min which returns values not references to the original through a ternary (which can cause issues with vectorization). Added Clang specific sfm::min_val and sfm::max_val implementations to assist in vectorization. Enabled BATCHED execution of existing testsuites: blackbody, color, transformc, wavelength_color Added new testsuite BATCHED regression tests: blackbody-reg, color-reg, luminance-reg, transformc-reg, wavelength_color-reg Added sfm::min_val to provide an implementation of min which returns values not references to the original through a ternary (which can cause issues with vectorization). Added Clang specific sfm::min_val and sfm::max_val implementations to assist in vectorization. Signed-off-by: Alex M. Wells <[email protected]>
Use single precision for bb_spectrum object/operator and manually implement std::pow(wlm,-5.0f) to better enable vectorization. Reduced masked code generation for rgb_to_hsl. Removed duplicated code by implementing ColorSystem::blackbody_rgb(T) in terms of ColorSystem::can_lookup_blackbody(T), ColorSystem::lookup_blackbody_rgb(T), and ColorSystem::compute_blackbody_rgb(T). Removed some casting to uint16_t that was unnecessary. Removed some test shaders that were not intended to be promoted. Signed-off-by: Alex M. Wells <[email protected]>
Added template <int M00, int M01, int M02, int M03, int M10, int M11, int M12, int M13, int M20, int M21, int M22, int M23, int M30, int M31, int M32, int M33, int DivisorT> struct sfm::StaticMatrix44 to model a matrix whose elements are known at compile time. Internally it utilizes proxys instead of floats in order to optimize for special cases of 0, 1, -1 elements and not even generate math when possible. The StaticMatrix44 is used by wide_opspline.cpp to create specialized functions for each of the spline basis types. A dispatch function routes the compile time optimized spline or spline inverse routine. Altered splineimpl.h to provide "int basis_type_of(StringParam basis_name)" to allow the batched implementation to map a string to spline enum value used as an index in the dispatch function to functions optimized for each type of spline basis (instead of generic 4x4 matrix implementation) Added "T sfm::select_val(bool cond, const T left, const T right)" as alternative to the ternary operator which appeared to possibly take address of its arguments which prevented privatization of those variables when compiled inside a SIMD loop. Added note to BatchedAnalysis implementation identifying that local variables scope is lost between .osl and .oso and when a function is inlined multiple times, its local variables could be treated as varying when they could have been uniform if somehow identified as a local variable by the Analyzer. Enable BATCHED testing of: spline-boundarybug spline-derivbug spline-nonuniformknots spline splineinverse-ident splineinverse Added new BATCHED_REGRESSION tests: spline-reg splineinverse-knots-ascend-reg splineinverse-knots-descend-reg Signed-off-by: Alex M. Wells <[email protected]>
Signed-off-by: Alex M. Wells <[email protected]>
…nverse as OIIO::invert has a loop which was called from inside the search interval loop, created a nested loop. Under SIMD/SIMT the nested loop could execute incoherently. Instead we choose to use the class sfm::Inverter to manage state between is_bracketed_by(min,max), meant to be called from the search interval loop, and bracketed_invert(min,max) meant to be called after exiting the search interval loop, avoiding a nested loop that could be executed at different interval's for each SIMD/SIMT lane/thread. Signed-off-by: Alex M. Wells <[email protected]>
Signed-off-by: Alex M. Wells <[email protected]>
@@ -0,0 +1,479 @@ | |||
// Copyright Contributors to the Open Shading Language project. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aside: I wonder if this file should be in src/liboslexec (or even in src/liboslexec/wide)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I imagine it will eventually get used by the scalar code gen if the code gen generates calls to spline's whose basis is known at compile time: spline_bspline(....), spline_hermite(....). So if moving it, probably src/liboslexec. It is an implementation detail at this point. So I can move it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to use it in the scalar code, too. But I'm not sure we want to include it in the exported public headers. Anything exposed there requires a lot more caution about any future deprecation or ABI changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood, I'll move the file as part of the PR/BatchedTexture.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or feel free to move it, sooner
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no rush, whenever you're touching it next.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all looks fine. Will merge.
Description
Implement spline and spline inverse for batched SIMD implementation.
Added
to model a matrix whose elements are known at compile time. Internally it utilizes proxys instead of floats in order to optimize for special cases of 0, 1, -1 elements and not even generate math when possible. The StaticMatrix44 is used by wide_opspline.cpp to create specialized functions for each of the spline basis types. A dispatch function routes the compile time optimized spline or spline inverse routine.
Altered splineimpl.h to provide "int basis_type_of(StringParam basis_name)" to allow the batched implementation to map a string to spline enum value used as an index in the dispatch function to functions optimized for each type of spline basis (instead of generic 4x4 matrix implementation)
Added "T sfm::select_val(bool cond, const T left, const T right)" as alternative to the ternary operator which appeared to possibly take address of its arguments which prevented privatization of those variables when compiled inside a SIMD loop.
Added note to BatchedAnalysis implementation identifying that local variables scope is lost between .osl and .oso and when a function is inlined multiple times, its local variables could be treated as varying when they could have been uniform if somehow identified as a local variable by the Analyzer.
Performance observations: initial testing shows for splineinverse shows up to 1.5x speedup dispatching to the compile time optimized StaticMatrix44 versions. Vectorization of 16 data lanes (512bit) can speedup to 2x+ on top of that.
Tests
Enable BATCHED testing of:
Added new BATCHED_REGRESSION tests:
Checklist: