Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pr/batched spline #1422

Merged
merged 8 commits into from
Oct 19, 2021
Merged

Conversation

AlexMWells
Copy link
Contributor

Description

Implement spline and spline inverse for batched SIMD implementation.

Added

        template <int M00, int M01, int M02, int M03,
                  int M10, int M11, int M12, int M13,
                  int M20, int M21, int M22, int M23,
                  int M30, int M31, int M32, int M33, int DivisorT>
        struct sfm::StaticMatrix44

to model a matrix whose elements are known at compile time. Internally it utilizes proxys instead of floats in order to optimize for special cases of 0, 1, -1 elements and not even generate math when possible. The StaticMatrix44 is used by wide_opspline.cpp to create specialized functions for each of the spline basis types. A dispatch function routes the compile time optimized spline or spline inverse routine.

Altered splineimpl.h to provide "int basis_type_of(StringParam basis_name)" to allow the batched implementation to map a string to spline enum value used as an index in the dispatch function to functions optimized for each type of spline basis (instead of generic 4x4 matrix implementation)

Added "T sfm::select_val(bool cond, const T left, const T right)" as alternative to the ternary operator which appeared to possibly take address of its arguments which prevented privatization of those variables when compiled inside a SIMD loop.

Added note to BatchedAnalysis implementation identifying that local variables scope is lost between .osl and .oso and when a function is inlined multiple times, its local variables could be treated as varying when they could have been uniform if somehow identified as a local variable by the Analyzer.

Performance observations: initial testing shows for splineinverse shows up to 1.5x speedup dispatching to the compile time optimized StaticMatrix44 versions. Vectorization of 16 data lanes (512bit) can speedup to 2x+ on top of that.

Tests

Enable BATCHED testing of:

  • spline-boundarybug
  • spline-derivbug
  • spline-nonuniformknots
  • spline
  • splineinverse-ident
  • splineinverse

Added new BATCHED_REGRESSION tests:

  • spline-reg
  • splineinverse-knots-ascend-reg
  • splineinverse-knots-descend-reg

Checklist:

  • I have read the contribution guidelines.
  • I have previously submitted a Contributor License Agreement.
  • I have updated the documentation, if applicable.
  • I have ensured that the change is tested somewhere in the testsuite (adding new test cases if necessary).
  • My code follows the prevailing code style of this project.

Added batched llvm code gen for: llvm_gen_blackbody, llvm_gen_luminance, llvm_gen_transformc (including code gen of loop to identify unique "from" and "to" space combinations to call library function with).

Made ColorSystem const correct.
Pulled much of ColorSystem's implementation into src/liboslexec/opcolor_impl.h so it could be inlined in both opcolor.cpp and wide_opcolor.cpp.
To enable SIMD fast path for ColorSystem::blackbody_rgb added new methods to expose underlying lookup table optimization.
    bool can_lookup_blackbody(float T /*Kelvin*/) const;

    Color3 lookup_blackbody_rgb (float T /*Kelvin*/) const;

    Color3 compute_blackbody_rgb (float T /*Kelvin*/) const;
When can_lookup_blackbody() returns true
then lookup_blackbody_rgb can be safely called,
otherwise the expensive compute_blackbody_rgb is required.

Updated implementation of ColorSystem to succesfully vectorize on multiple compilers, optimized lookup table data types and indexed dereferences to generate prefered 32bit indexed gathers vs 64bit gathers which require multiple instructions and registers.

To improve code generation and less chances of aliasing issues, changed nested M[i][j] operator access to Matrix in ImathMatrix.h to directly access underlying 2d array M.x[i][j] also changed Vector operator V[i] to directly access V.x, V.y, V.z in OSL/IMathx/IMath*.h

Moved helper testIfAnyLaneIsNonZero from wide_opmatrix.cpp to be testIfAnyLaneIsNonZero in OSL/wide.h

Added sfm::min_val to provide an implementation of min which returns values not references to the original through a ternary (which can cause issues with vectorization).  Added Clang specific sfm::min_val and sfm::max_val implementations to assist in vectorization.

Enabled BATCHED execution of existing testsuites: blackbody, color, transformc, wavelength_color
Added new testsuite BATCHED regression tests:  blackbody-reg, color-reg, luminance-reg, transformc-reg, wavelength_color-reg
Added sfm::min_val to provide an implementation of min which returns values not references to the original through a ternary (which can cause issues with vectorization).  Added Clang specific sfm::min_val and sfm::max_val implementations to assist in vectorization.

Signed-off-by: Alex M. Wells <[email protected]>
Use single precision for bb_spectrum object/operator and manually implement std::pow(wlm,-5.0f) to better enable vectorization.
Reduced masked code generation for rgb_to_hsl.
Removed duplicated code by implementing ColorSystem::blackbody_rgb(T) in terms of ColorSystem::can_lookup_blackbody(T), ColorSystem::lookup_blackbody_rgb(T), and ColorSystem::compute_blackbody_rgb(T).  Removed some casting to uint16_t that was unnecessary.

Removed some test shaders that were not intended to be promoted.

Signed-off-by: Alex M. Wells <[email protected]>
Added batched llvm code gen for: llvm_gen_blackbody, llvm_gen_luminance, llvm_gen_transformc (including code gen of loop to identify unique "from" and "to" space combinations to call library function with).

Made ColorSystem const correct.
Pulled much of ColorSystem's implementation into src/liboslexec/opcolor_impl.h so it could be inlined in both opcolor.cpp and wide_opcolor.cpp.
To enable SIMD fast path for ColorSystem::blackbody_rgb added new methods to expose underlying lookup table optimization.
    bool can_lookup_blackbody(float T /*Kelvin*/) const;

    Color3 lookup_blackbody_rgb (float T /*Kelvin*/) const;

    Color3 compute_blackbody_rgb (float T /*Kelvin*/) const;
When can_lookup_blackbody() returns true
then lookup_blackbody_rgb can be safely called,
otherwise the expensive compute_blackbody_rgb is required.

Updated implementation of ColorSystem to succesfully vectorize on multiple compilers, optimized lookup table data types and indexed dereferences to generate prefered 32bit indexed gathers vs 64bit gathers which require multiple instructions and registers.

To improve code generation and less chances of aliasing issues, changed nested M[i][j] operator access to Matrix in ImathMatrix.h to directly access underlying 2d array M.x[i][j] also changed Vector operator V[i] to directly access V.x, V.y, V.z in OSL/IMathx/IMath*.h

Moved helper testIfAnyLaneIsNonZero from wide_opmatrix.cpp to be testIfAnyLaneIsNonZero in OSL/wide.h

Added sfm::min_val to provide an implementation of min which returns values not references to the original through a ternary (which can cause issues with vectorization).  Added Clang specific sfm::min_val and sfm::max_val implementations to assist in vectorization.

Enabled BATCHED execution of existing testsuites: blackbody, color, transformc, wavelength_color
Added new testsuite BATCHED regression tests:  blackbody-reg, color-reg, luminance-reg, transformc-reg, wavelength_color-reg
Added sfm::min_val to provide an implementation of min which returns values not references to the original through a ternary (which can cause issues with vectorization).  Added Clang specific sfm::min_val and sfm::max_val implementations to assist in vectorization.

Signed-off-by: Alex M. Wells <[email protected]>
Use single precision for bb_spectrum object/operator and manually implement std::pow(wlm,-5.0f) to better enable vectorization.
Reduced masked code generation for rgb_to_hsl.
Removed duplicated code by implementing ColorSystem::blackbody_rgb(T) in terms of ColorSystem::can_lookup_blackbody(T), ColorSystem::lookup_blackbody_rgb(T), and ColorSystem::compute_blackbody_rgb(T).  Removed some casting to uint16_t that was unnecessary.

Removed some test shaders that were not intended to be promoted.

Signed-off-by: Alex M. Wells <[email protected]>
Added
    template <int M00, int M01, int M02, int M03,
              int M10, int M11, int M12, int M13,
              int M20, int M21, int M22, int M23,
              int M30, int M31, int M32, int M33, int DivisorT>
    struct sfm::StaticMatrix44
to model a matrix whose elements are known at compile time.  Internally it utilizes proxys instead of floats in order to optimize for special cases of 0, 1, -1 elements and not even generate math when possible.  The StaticMatrix44 is used by wide_opspline.cpp to create specialized functions for each of the spline basis types.  A dispatch function routes the compile time optimized spline or spline inverse routine.

Altered splineimpl.h to provide "int basis_type_of(StringParam basis_name)" to allow the batched implementation to map a string to spline enum value used as an index in the dispatch function to functions optimized for each type of spline basis (instead of generic 4x4 matrix implementation)

Added "T sfm::select_val(bool cond, const T left, const T right)" as alternative to the ternary operator which appeared to possibly take address of its arguments which prevented privatization of those variables when compiled inside a SIMD loop.

Added note to BatchedAnalysis implementation identifying that local variables scope is lost between .osl and .oso and when a function is inlined multiple times, its local variables could be treated as varying when they could have been uniform if somehow identified as a local variable by the Analyzer.

Enable BATCHED testing of:
  spline-boundarybug
  spline-derivbug
  spline-nonuniformknots
  spline
  splineinverse-ident
  splineinverse

Added new BATCHED_REGRESSION tests:
  spline-reg
  splineinverse-knots-ascend-reg
  splineinverse-knots-descend-reg

Signed-off-by: Alex M. Wells <[email protected]>
…nverse as OIIO::invert has a loop which was called from inside the search interval loop, created a nested loop. Under SIMD/SIMT the nested loop could execute incoherently. Instead we choose to use the class sfm::Inverter to manage state between is_bracketed_by(min,max), meant to be called from the search interval loop, and bracketed_invert(min,max) meant to be called after exiting the search interval loop, avoiding a nested loop that could be executed at different interval's for each SIMD/SIMT lane/thread.

Signed-off-by: Alex M. Wells <[email protected]>
@@ -0,0 +1,479 @@
// Copyright Contributors to the Open Shading Language project.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside: I wonder if this file should be in src/liboslexec (or even in src/liboslexec/wide)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imagine it will eventually get used by the scalar code gen if the code gen generates calls to spline's whose basis is known at compile time: spline_bspline(....), spline_hermite(....). So if moving it, probably src/liboslexec. It is an implementation detail at this point. So I can move it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to use it in the scalar code, too. But I'm not sure we want to include it in the exported public headers. Anything exposed there requires a lot more caution about any future deprecation or ABI changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood, I'll move the file as part of the PR/BatchedTexture.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or feel free to move it, sooner

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no rush, whenever you're touching it next.

Copy link
Collaborator

@lgritz lgritz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks fine. Will merge.

@lgritz lgritz merged commit 03a31f5 into AcademySoftwareFoundation:master Oct 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants