-
Notifications
You must be signed in to change notification settings - Fork 441
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce cpp (and also adds lasr3 and steqr3) #991
base: master
Are you sure you want to change the base?
Conversation
* @param C | ||
*/ | ||
template <typename T, typename TC, typename TS, typename idx_t> | ||
inline void rot_c_wrapper(const Vector<T, idx_t>& x, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you considered overloading? The advantage of overloading at this point would be that the user can add his own overloads:
inline void rot_c_wrapper(const Vector<float, idx_t>& x, const Vector<float, idx_t>& y, TC c, TS s) {
assert(x.size() == y.size());
return lapack_c_srot(x.size(), x.ptr(), x.stride(), y.ptr(), y.stride(), c, s);
}
inline void rot_c_wrapper(const Vector<double, idx_t>& x, const Vector<double, idx_t>& y, TC c, TS s) {
assert(x.size() == y.size());
return lapack_c_drot(x.size(), x.ptr(), x.stride(), y.ptr(), y.stride(), c, s);
}
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rot_c_wrapper
is not meant to be user callable. It is just a utility so we can call it when instantiating the function we actually want to overload: rot
. It is a pattern I designed for gemm, where the logic is more complicated. I guess having this function at all does not make that much sense here.
lapack_c_zrot(x.size(), x.ptr(), x.stride(), y.ptr(), y.stride(), c, s); | ||
} | ||
else { | ||
assert(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code below fails at compile time:
static_assert(!std::is_same<T, T>::value, "unknown data type T")
References:
C++11 introduced the const idx_t n_timings = 100; In C++11 style I would write: const auto n_timings = idx_t{100}; The C++11 style avoids all coercions and non-representable values cause compiler errors. With C++98 style non-representable value are silently truncated/rounded/etc; GCC 12 with What are your thoughts about this? |
I can'tquite wrap my head around the file names, specifically the repeated I drafted a CMake build in christoph-conrads/introduce-cpp that you may be interested in. |
The file names are mostly to make them unique, so that someone exporting all the .o files to a single obj folder does not run into issues. The lapack_c defines all the c interfaces, it just happens to be a lot easier to define those interfaces in fortran than in c. But perhaps the name can be improved. |
I was not familiar with that style. My opinion on it is mixed. I can try to start using it, but i worry about enforcing language specific things on that level may waste a lot of time. I also feel that way about move vs refernce. On the other hand, if we start from a good example, it may not be a large cost |
I've found a pretty significant issue with this PR. I rely on template instantiation in the cpp files, but that may fail if the original function was inlined. I am not entirely sure how to get around that. |
related to issue #988
This PR introduces some C++ code to LAPACK.
The general design for the C++ code is:
A lot of things are introduced in this PR:
Directory structure
This PR adds some new folders:
Because we implement the cpp code using templates, they are all defined in headers, which reside in the
include
folder. Thesrc
folder mostly contains wrappers.There are three main routines taking part in this PR:
lasr3
,rot
, andlartg
.lasr3
is written in C++, the main code is ininclude/lapack_cpp/lapack/lasr3.hpp
. The wrappers to call this function from c are defined ininclude/lapack_c/lasr3.h
andsrc/lapack_c/lasr3.cpp
. The fortran wrappers are defined insrc/lapack_fortran/<sdcz>lasr3.f90
rot
has a full C++ implementation, defined ininclude/lapack_cpp/blas/rot.hpp
, but this is just the default implementation. Through c wrappers defined ininclude/lapack_c/rot.h
,src/lapack_c/rot.cpp
, andsrc/lapack_cpp/rot.cpp
, it actually uses the fortran implementation for supported types ifUSE_FORTRAN_BLAS
is set.lartg
does not have a full C++ implementation. The header defined ininclude/lapack_cpp/lapack/lartg.hpp
is empty. The actual implementation is provided through the interfaces with fortran, defined insrc/lapack_cpp/lartg.cpp
,src/lapack_c/rot.f90
, andinclude/lapack_c/rot.h
This directory structure is not necessarily optimal. Perhaps it is better to put the headers and source files in the same directory for easy navigation (it is trivial for build tools to copy them to an include directory)
Building
Currently, the example can be built using GNU Make, but ideally, there would be a proper build which also includes the newly added lasr3 algorithm in the lapack binary, both using GNU Make and Cmake.
Workspace
An inevitable discussion for any LAPACK variant is workspace. I chose the following design:
larf
call internally allocated a vector.lwork = -1
workspace query design.lasr3
, this islasr3_workquery
, which returns an int, representing the required workspace size.MemoryBlock
, but this argument is optional. If it is not provided, a workspace query is performed, aMemoryBlock
is internally allocated and provided to the full routine. By default, this workspace is allocated using aligned_alloc, just for a tiny bit of extra performance.Matrix class
A matrix/vector is just lightweight wrapper around a pointer and a few integers defining its leading dimension and size. Because it knows its size, it can internally check for out of bounds access, with much more precision than something like valgrind ever could. The following code
would fail inside
srot
. Not only could the matrix class have detected such a mistake at its source, it is actually impossible to write the mistake.A question that was posed is: why don't we use
mdspan
orEigen
? Eigen matrices own their data, making it a bit more difficult to make fortran wrappers. The expression template system and fixed size matrices, while great for many applications are also extra features that would take effort to fully support and I don't think it is appropriate for LAPACK. mdspan is a closer match to our matrix design, but it again has extra features that I don't want to have to support. An mdspan matrix can have a more general layout that just column/row major. Because it is also a thin wrapper around some data, it is trivial to convert between the two matrix types (so long as they have a compatible stride), so users using mdspan should be able to use the cpp interface without much effort.About optimized BLAS/LAPACK
Linking the C++ code to optimized BLAS or even LAPACK can achieved through the c/fortran wrappers. But vendors could eventually choose to just provide their own binaries for those wrappers, essentially linking our C++ with their C/C++ code, without going through Fortran.
About the algorithm
Why would we care about efficiently applying givens rotations? Essentially, because of eigenvalues. The implicit QR version of the SVD and symmetric QR algorithm do not achieve high flop rates, mainly because of the lack of an efficient method to apply givens rotations. The routine to calculate a Hessenberg-triangular reduction achieves high efficiency by accumulating blocks of rotations into an orthogonal matrix and then using gemm, but that involves extra flops (not just to do the accumulation, but also because the orthogonal matrix for kk rotations, is usually of size (2(k+1))(2*(k+1)), applying that to a (2*(k+1))*n matrix takes just
6*k*k*n
flops using rotations, but approximately8*k*k*n
flops using gemm.Naively applying the rotations using a triple for loop and rot, achieves about 3 GFlop/s when applying from the left, and 10 GFlops/s when applying from the right. This implementation achieves about 40 GFlop/s in double precision for both sides on my laptop (intel i7-1255U), which is very close to the serial gemm rate (57 GFlops). Also note that the theoretical peak of this routine is only 75% of the peak gemm rate, because the ratio of additions and multiplications is not 1.
Final note
I expect (or hope) a lot of people will have comments on whether we should use C++ and on my implementation. To keep the discussions easy to follow, I suggest that conceptual comments on whether or not we should use C++ are directed to issue #998, comments on the specific way I introduced C++ and my implementation of the algorithm should be posted here.