Introduce cpp (and also adds lasr3 and steqr3) #991

thijssteel · 2024-02-29T15:34:39Z

related to issue #988

This PR introduces some C++ code to LAPACK.

The general design for the C++ code is:

C wrappers on top of existing fortran code
C++ wrappers on top of those C wrappers
C++ code for the new algorithm and its tests (note, those tests are not automated, so definitely need improving)
C wrappers around the new C++ code
Fortran wrappers around those C wrappers so current users can still use a familiar API.

A lot of things are introduced in this PR:

Vector and Matrix classes, these are thin wrappers around a pointer and a size, allowing things like verbose selection of submatrices, columns, ..., all with bounds checks for every step.
LASR3, an algorithm which efficiently applies a sequence of plane rotations to a matrix. (Note, this applies multiple sequences, making it more a level 3 routine than lasr, which is more a level 2 routine, hence the name) This algorithm is described in the paper: "Restructuring the Tridiagonal and Bidiagonal QR Algorithms for Performance" F. G. Van Zee et al. I have made a modification of this algorithm which packs the matrix A into an aligned workspace for efficiency. This means we also achieve high performance for matrices that are not aligned and even for access patterns that are normally not efficient (applying a sequence from the left on a column major matrix).
STEQR3, a variant of STEQR (QR algorithm for symmetric eigenvalues) which uses LASR3 to efficiently apply the rotations.

Directory structure

This PR adds some new folders:

- lapack_cpp
   - example
   - include
      - lapack_c
      - lapack_cpp 
   - src
      - lapack_c
      - lapack_cpp
      - lapack_fortran

Because we implement the cpp code using templates, they are all defined in headers, which reside in the include folder. The src folder mostly contains wrappers.

There are three main routines taking part in this PR: lasr3, rot, and lartg.

lasr3 is written in C++, the main code is in include/lapack_cpp/lapack/lasr3.hpp. The wrappers to call this function from c are defined in include/lapack_c/lasr3.h and src/lapack_c/lasr3.cpp. The fortran wrappers are defined in src/lapack_fortran/<sdcz>lasr3.f90
rot has a full C++ implementation, defined in include/lapack_cpp/blas/rot.hpp, but this is just the default implementation. Through c wrappers defined in include/lapack_c/rot.h, src/lapack_c/rot.cpp, and src/lapack_cpp/rot.cpp, it actually uses the fortran implementation for supported types if USE_FORTRAN_BLAS is set.
lartg does not have a full C++ implementation. The header defined in include/lapack_cpp/lapack/lartg.hpp is empty. The actual implementation is provided through the interfaces with fortran, defined in src/lapack_cpp/lartg.cpp, src/lapack_c/rot.f90, and include/lapack_c/rot.h

This directory structure is not necessarily optimal. Perhaps it is better to put the headers and source files in the same directory for easy navigation (it is trivial for build tools to copy them to an include directory)

Building

Currently, the example can be built using GNU Make, but ideally, there would be a proper build which also includes the newly added lasr3 algorithm in the lapack binary, both using GNU Make and Cmake.

Workspace

An inevitable discussion for any LAPACK variant is workspace. I chose the following design:

Algorithms do not internally allocate memory. While a few allocations means nothing compared to the factorization of a large matrix, imagine if every larf call internally allocated a vector.
The fortran interface keeps the old lwork = -1 workspace query design.
The cpp interface has a separate function for the workspace query. For lasr3, this is lasr3_workquery, which returns an int, representing the required workspace size.
The cpp interface stores its workspace in a MemoryBlock, but this argument is optional. If it is not provided, a workspace query is performed, a MemoryBlock is internally allocated and provided to the full routine. By default, this workspace is allocated using aligned_alloc, just for a tiny bit of extra performance.

Matrix class

A matrix/vector is just lightweight wrapper around a pointer and a few integers defining its leading dimension and size. Because it knows its size, it can internally check for out of bounds access, with much more precision than something like valgrind ever could. The following code

real A(50,50)
call srot( 100, A(1, 1), lda, A(2, 1), lda, c, s)

would fail inside srot. Not only could the matrix class have detected such a mistake at its source, it is actually impossible to write the mistake.

Matrix A(50, 50, memoryblock);
rot(A.row(1), A.row(2), c, s);

A question that was posed is: why don't we use mdspan or Eigen? Eigen matrices own their data, making it a bit more difficult to make fortran wrappers. The expression template system and fixed size matrices, while great for many applications are also extra features that would take effort to fully support and I don't think it is appropriate for LAPACK. mdspan is a closer match to our matrix design, but it again has extra features that I don't want to have to support. An mdspan matrix can have a more general layout that just column/row major. Because it is also a thin wrapper around some data, it is trivial to convert between the two matrix types (so long as they have a compatible stride), so users using mdspan should be able to use the cpp interface without much effort.

About optimized BLAS/LAPACK

Linking the C++ code to optimized BLAS or even LAPACK can achieved through the c/fortran wrappers. But vendors could eventually choose to just provide their own binaries for those wrappers, essentially linking our C++ with their C/C++ code, without going through Fortran.

About the algorithm

Why would we care about efficiently applying givens rotations? Essentially, because of eigenvalues. The implicit QR version of the SVD and symmetric QR algorithm do not achieve high flop rates, mainly because of the lack of an efficient method to apply givens rotations. The routine to calculate a Hessenberg-triangular reduction achieves high efficiency by accumulating blocks of rotations into an orthogonal matrix and then using gemm, but that involves extra flops (not just to do the accumulation, but also because the orthogonal matrix for kk rotations, is usually of size (2(k+1))(2*(k+1)), applying that to a (2*(k+1))*n matrix takes just 6*k*k*n flops using rotations, but approximately 8*k*k*n flops using gemm.

Naively applying the rotations using a triple for loop and rot, achieves about 3 GFlop/s when applying from the left, and 10 GFlops/s when applying from the right. This implementation achieves about 40 GFlop/s in double precision for both sides on my laptop (intel i7-1255U), which is very close to the serial gemm rate (57 GFlops). Also note that the theoretical peak of this routine is only 75% of the peak gemm rate, because the ratio of additions and multiplications is not 1.

Final note

I expect (or hope) a lot of people will have comments on whether we should use C++ and on my implementation. To keep the discussions easy to follow, I suggest that conceptual comments on whether or not we should use C++ are directed to issue #998, comments on the specific way I introduced C++ and my implementation of the algorithm should be posted here.

christoph-conrads · 2024-03-01T16:54:56Z

LAPACK_CPP/src/lapack_cpp/rot_cpp.cpp

+ * @param C
+ */
+template <typename T, typename TC, typename TS, typename idx_t>
+inline void rot_c_wrapper(const Vector<T, idx_t>& x,


Have you considered overloading? The advantage of overloading at this point would be that the user can add his own overloads:

inline void rot_c_wrapper(const Vector<float, idx_t>& x, const Vector<float, idx_t>& y, TC c, TS s) { assert(x.size() == y.size()); return lapack_c_srot(x.size(), x.ptr(), x.stride(), y.ptr(), y.stride(), c, s); } inline void rot_c_wrapper(const Vector<double, idx_t>& x, const Vector<double, idx_t>& y, TC c, TS s) { assert(x.size() == y.size()); return lapack_c_drot(x.size(), x.ptr(), x.stride(), y.ptr(), y.stride(), c, s); } ...

rot_c_wrapper is not meant to be user callable. It is just a utility so we can call it when instantiating the function we actually want to overload: rot. It is a pattern I designed for gemm, where the logic is more complicated. I guess having this function at all does not make that much sense here.

christoph-conrads · 2024-03-01T16:56:31Z

LAPACK_CPP/src/lapack_cpp/rot_cpp.cpp

+        lapack_c_zrot(x.size(), x.ptr(), x.stride(), y.ptr(), y.stride(), c, s);
+    }
+    else {
+        assert(false);


The code below fails at compile time:

static_assert(!std::is_same<T, T>::value, "unknown data type T")

References:

How to assert that a constexpr if else clause never happen?

christoph-conrads · 2024-03-01T17:19:46Z

comments on the specific way I introduced C++ and my implementation of the algorithm should be posted here.

C++11 introduced the auto keyword. Throughout the code you use the C++98 pattern though, e.g.,

const idx_t n_timings = 100;

In C++11 style I would write:

const auto n_timings = idx_t{100};

The C++11 style avoids all coercions and non-representable values cause compiler errors. With C++98 style non-representable value are silently truncated/rounded/etc; GCC 12 with -Wextra -Wall prints no warnings whereas Clang 14 prints warnings when this happens. For this reason and due to the more uniform syntax in combination with the using keyword (using UC = unsigned char;) I prefer the C++11 style. [The C++11 style does not work though when the type name contains a space like unsigned char.]

What are your thoughts about this?

christoph-conrads · 2024-03-01T18:22:00Z

comments on the specific way I introduced C++ and my implementation of the algorithm should be posted here.

I can'tquite wrap my head around the file names, specifically the repeated cpp in src/lapack_cpp/rot_cpp.cpp and the mixed language indicator in src/lapack_c/rot_c.f90.

I drafted a CMake build in christoph-conrads/introduce-cpp that you may be interested in.

thijssteel · 2024-03-01T18:33:01Z

I can'tquite wrap my head around the file names, specifically the repeated cpp in src/lapack_cpp/rot_cpp.cpp and the mixed language indicator in src/lapack_c/rot_c.f90

The file names are mostly to make them unique, so that someone exporting all the .o files to a single obj folder does not run into issues.

The lapack_c defines all the c interfaces, it just happens to be a lot easier to define those interfaces in fortran than in c. But perhaps the name can be improved.

thijssteel · 2024-03-02T19:29:55Z

C++11 introduced the auto keyword. Throughout the code you use the C++98 pattern though

I was not familiar with that style. My opinion on it is mixed. I can try to start using it, but i worry about enforcing language specific things on that level may waste a lot of time. I also feel that way about move vs refernce. On the other hand, if we start from a good example, it may not be a large cost

thijssteel · 2024-04-30T07:02:22Z

I've found a pretty significant issue with this PR. I rely on template instantiation in the cpp files, but that may fail if the original function was inlined. I am not entirely sure how to get around that.

thijssteel added 2 commits February 29, 2024 16:04

copy over files from lapackv4 repo

5b521e6

add fortran wrappers for lasr3

1a95ff9

christoph-conrads reviewed Mar 1, 2024

View reviewed changes

thijssteel added 2 commits March 1, 2024 18:10

small change in template of lasr3

2bdd724

use direct overloading for rot

f57a792

thijssteel added 5 commits March 4, 2024 09:41

change filenames

81be118

add c wrapper to cpp rot

0d9803a

add steqr3

a4c4a4e

add steqr3

b44c64f

add fortran wrappers for steqr3

7a2ecc2

thijssteel changed the title ~~Introduce cpp (and also adds lasr3, which applies a sequence of rotations efficiently)~~ Introduce cpp (and also adds lasr3 and steqr3) Mar 6, 2024

add steqr wrappers for comparison

7094cb0

ilayn mentioned this pull request Mar 27, 2024

RFC: Adoption of std::mdspan like structures in C++ code scipy/scipy#20334

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce cpp (and also adds lasr3 and steqr3) #991

Introduce cpp (and also adds lasr3 and steqr3) #991

thijssteel commented Feb 29, 2024 •

edited

Loading

christoph-conrads Mar 1, 2024

thijssteel Mar 1, 2024

christoph-conrads Mar 1, 2024

christoph-conrads commented Mar 1, 2024

christoph-conrads commented Mar 1, 2024

thijssteel commented Mar 1, 2024

thijssteel commented Mar 2, 2024

thijssteel commented Apr 30, 2024

Introduce cpp (and also adds lasr3 and steqr3) #991

Are you sure you want to change the base?

Introduce cpp (and also adds lasr3 and steqr3) #991

Conversation

thijssteel commented Feb 29, 2024 • edited Loading

Directory structure

Building

Workspace

Matrix class

About optimized BLAS/LAPACK

About the algorithm

Final note

christoph-conrads Mar 1, 2024

Choose a reason for hiding this comment

thijssteel Mar 1, 2024

Choose a reason for hiding this comment

christoph-conrads Mar 1, 2024

Choose a reason for hiding this comment

christoph-conrads commented Mar 1, 2024

christoph-conrads commented Mar 1, 2024

thijssteel commented Mar 1, 2024

thijssteel commented Mar 2, 2024

thijssteel commented Apr 30, 2024

thijssteel commented Feb 29, 2024 •

edited

Loading