Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorporating SVML #22

Open
aminya opened this issue Nov 24, 2019 · 32 comments
Open

Incorporating SVML #22

aminya opened this issue Nov 24, 2019 · 32 comments
Labels

Comments

@aminya
Copy link
Member

aminya commented Nov 24, 2019

Integrate Intel SVML
https://software.intel.com/en-us/cpp-compiler-developer-guide-and-reference-intrinsics-for-short-vector-math-library-svml-operations

@Crown421
Copy link
Collaborator

Could you explain what you mean?
From what I can tell SVML is not a shared library of specialized functions as VML, but instead a collection of intrinsics that operate on "packed" vectors. I understand this as essentially operating on abcd instead of the array [a, b, c, d].
I can't find a native packed array type of Julia, instead this blog post indicates that intrisics are part of SIMD and compiler optimization.
This indicates to me that the use of SVML requires significantly more magic than this package and integration into the compiler.

@aminya
Copy link
Member Author

aminya commented Nov 26, 2019

@aminya
Copy link
Member Author

aminya commented Nov 26, 2019

@chriselrod Yesterday, you were talking about vectorization in the compiler. Is that related to this issue?

@Crown421
Copy link
Collaborator

Crown421 commented Nov 26, 2019

In the link you cited it says "[...] Numba automatically configures the LLVM back end to use the SVML intrinsic functions where ever possible".
I understand this as something they do in the LLVM compiler when e.g. loops are unrolled. From the blog posted that I posted above I think julia already e.g. packs two Float64 into a 128 register and essentially runs on both of them at the same time. (so, a loop over 10 elements actually only runs 5 clock cycles).
From what I understand SVML would replace the instructions on that packed vector with the specialized/ proprietary intel ones.

@aminya
Copy link
Member Author

aminya commented Nov 26, 2019

So from what I understand, probably SVML should be implemented at a lower level.
https://github.com/eschnett/SIMD.jl or https://github.com/KristofferC/SIMDIntrinsics.jl are good places to start.

eschnett/SIMD.jl#59 (comment)

@aminya
Copy link
Member Author

aminya commented Nov 26, 2019

Maybe @KristofferC and @eschnett can help.

@chriselrod
Copy link

This comment is sort of a meandering mess, so I at least labeled sections.

-fveclib= and Numba

Clang/LLVM has the optional flag -fveclib=.
This is probably what Numba is doing.
You also have to link the library.

Vectorization in GCC

When I use vectorization, I normally mean to operate on packed vectors in registers. This is different from what it commonly means in languages like R or Stan.

This is also what all fast
GCC has a similar option called mveclibabi, but if you don't use it, GCC will use and automatically link it's own (incredibly fast) vector library, so I haven't bothered.
I showed some of the assembly from gfortran yesterday:

.L4:
	movq	%rbx, %r12
	salq	$6, %r12
	vmovupd	(%r14,%r12), %zmm0
	incq	%rbx
	call	_ZGVeN8v_cos@PLT
	vmovupd	%zmm0, 0(%r13,%r12)
	cmpq	%rbx, %r15
	jne	.L4

This is doing packed operations like @Crown421 described, except it is operating on 8 at a time (to fill 512 bit registers), not just 2.
This benchmarked at close to 9x faster, that means call _ZGVeN8v_cos@PLT is actually faster than Base.cos, even though Base.cos calculates a single answer, while call _ZGVeN8v_cos@PLT calculates eight answers.

In this way, vectorization is well integrated into recent versions of GCC.

LoopVectorization

LoopVectorization.jl automates doing this unrolling manually.

using LoopVectorization
function cos_sleef!(a, b)
    @vvectorize for i in eachindex(a, b)
        a[i] = cos(b[i])
    end
end
b = randn(2000,2000); a = similar(b);

Using @code_llvm debuginfo=:none cos_sleef!(a, b):

; julia> @code_llvm debuginfo=:none cos_sleef!(a, b)

define nonnull %jl_value_t addrspace(10)* @"japi1_cos_sleef!_17907"(%jl_value_t addrspace(10)*, %jl_value_t addrspace(10)**, i32) #0 {
top:
  %3 = alloca %jl_value_t addrspace(10)**, align 8
  store volatile %jl_value_t addrspace(10)** %1, %jl_value_t addrspace(10)*** %3, align 8
  %4 = load %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)** %1, align 8
  %5 = getelementptr inbounds %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)** %1, i64 1
  %6 = load %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)** %5, align 8
  %7 = addrspacecast %jl_value_t addrspace(10)* %4 to %jl_value_t addrspace(11)*
  %8 = bitcast %jl_value_t addrspace(11)* %7 to %jl_array_t addrspace(11)*
  %9 = getelementptr inbounds %jl_array_t, %jl_array_t addrspace(11)* %8, i64 0, i32 1
  %10 = load i64, i64 addrspace(11)* %9, align 8
  %11 = addrspacecast %jl_value_t addrspace(10)* %6 to %jl_value_t addrspace(11)*
  %12 = bitcast %jl_value_t addrspace(11)* %11 to %jl_array_t addrspace(11)*
  %13 = getelementptr inbounds %jl_array_t, %jl_array_t addrspace(11)* %12, i64 0, i32 1
  %14 = load i64, i64 addrspace(11)* %13, align 8
  %15 = icmp slt i64 %14, %10
  %16 = select i1 %15, i64 %14, i64 %10
  %17 = lshr i64 %16, 3
  %18 = and i64 %16, 7
  %19 = addrspacecast %jl_value_t addrspace(11)* %7 to %jl_value_t*
  %20 = bitcast %jl_value_t* %19 to i64*
  %21 = load i64, i64* %20, align 8
  %22 = addrspacecast %jl_value_t addrspace(11)* %11 to %jl_value_t*
  %23 = bitcast %jl_value_t* %22 to i64*
  %24 = load i64, i64* %23, align 8
  %25 = icmp eq i64 %17, 0
  br i1 %25, label %L292, label %L22.preheader

L22.preheader:                                    ; preds = %top
  %26 = inttoptr i64 %24 to i8*
  %27 = inttoptr i64 %21 to i8*
  br label %L22

L22:                                              ; preds = %L22.preheader, %L22
  %value_phi2 = phi i64 [ %34, %L22 ], [ 1, %L22.preheader ]
  %value_phi3 = phi i64 [ %32, %L22 ], [ 0, %L22.preheader ]
  %28 = shl i64 %value_phi3, 3
  %29 = getelementptr i8, i8* %26, i64 %28
  %ptr.i = bitcast i8* %29 to <8 x double>*
  %res.i = load <8 x double>, <8 x double>* %ptr.i, align 8
  %res.i112 = fmul fast <8 x double> %res.i, <double 0x3E645F306DC9C883, double 0x3E645F306DC9C883, double 0x3E645F306DC9C883, double 0x3E645F306DC9C883, double 0x3E645F306DC9C883, double 0x3E645F306DC9C883, double 0x3E645F306DC9C883, double 0x3E645F306DC9C883>
  %res.i111 = fadd fast <8 x double> %res.i112, <double 0xBE545F306DC9C883, double 0xBE545F306DC9C883, double 0xBE545F306DC9C883, double 0xBE545F306DC9C883, double 0xBE545F306DC9C883, double 0xBE545F306DC9C883, double 0xBE545F306DC9C883, double 0xBE545F306DC9C883>
  %res.i110 = call <8 x double> @llvm.trunc.v8f64(<8 x double> %res.i111)
  %res.i109 = fmul fast <8 x double> %res.i, <double 0x3FD45F306DC9C883, double 0x3FD45F306DC9C883, double 0x3FD45F306DC9C883, double 0x3FD45F306DC9C883, double 0x3FD45F306DC9C883, double 0x3FD45F306DC9C883, double 0x3FD45F306DC9C883, double 0x3FD45F306DC9C883>
  %res.i108 = fadd fast <8 x double> %res.i109, <double -5.000000e-01, double -5.000000e-01, double -5.000000e-01, double -5.000000e-01, double -5.000000e-01, double -5.000000e-01, double -5.000000e-01, double -5.000000e-01>
  %res.i107.neg = fmul fast <8 x double> %res.i110, <double 0xC160000000000000, double 0xC160000000000000, double 0xC160000000000000, double 0xC160000000000000, double 0xC160000000000000, double 0xC160000000000000, double 0xC160000000000000, double 0xC160000000000000>
  %res.i106 = fadd fast <8 x double> %res.i108, %res.i107.neg
  %res.i105 = call <8 x double> @llvm.rint.v8f64(<8 x double> %res.i106)
  %res.i104 = fmul fast <8 x double> %res.i105, <double 2.000000e+00, double 2.000000e+00, double 2.000000e+00, double 2.000000e+00, double 2.000000e+00, double 2.000000e+00, double 2.000000e+00, double 2.000000e+00>
  %res.i103 = fadd fast <8 x double> %res.i104, <double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00>
  %res.i102 = call <8 x double> @llvm.fmuladd.v8f64(<8 x double> %res.i110, <8 x double> <double 0xC17921FB50000000, double 0xC17921FB50000000, double 0xC17921FB50000000, double 0xC17921FB50000000, double 0xC17921FB50000000, double 0xC17921FB50000000, double 0xC17921FB50000000, double 0xC17921FB50000000>, <8 x double> %res.i)
  %res.i101 = call <8 x double> @llvm.fmuladd.v8f64(<8 x double> %res.i103, <8 x double> <double 0xBFF921FB50000000, double 0xBFF921FB50000000, double 0xBFF921FB50000000, double 0xBFF921FB50000000, double 0xBFF921FB50000000, double 0xBFF921FB50000000, double 0xBFF921FB50000000, double 0xBFF921FB50000000>, <8 x double> %res.i102)
  %res.i100 = call <8 x double> @llvm.fmuladd.v8f64(<8 x double> %res.i110, <8 x double> <double 0xBFD110B460000000, double 0xBFD110B460000000, double 0xBFD110B460000000, double 0xBFD110B460000000, double 0xBFD110B460000000, double 0xBFD110B460000000, double 0xBFD110B460000000, double 0xBFD110B460000000>, <8 x double> %res.i101)
  %res.i99 = call <8 x double> @llvm.fmuladd.v8f64(<8 x double> %res.i103, <8 x double> <double 0xBE5110B460000000, double 0xBE5110B460000000, double 0xBE5110B460000000, double 0xBE5110B460000000, double 0xBE5110B460000000, double 0xBE5110B460000000, double 0xBE5110B460000000, double 0xBE5110B460000000>, <8 x double> %res.i100)
  %res.i98 = call <8 x double> @llvm.fmuladd.v8f64(<8 x double> %res.i110, <8 x double> <double 0xBE11A62630000000, double 0xBE11A62630000000, double 0xBE11A62630000000, double 0xBE11A62630000000, double 0xBE11A62630000000, double 0xBE11A62630000000, double 0xBE11A62630000000, double 0xBE11A62630000000>, <8 x double> %res.i99)
  %res.i97 = call <8 x double> @llvm.fmuladd.v8f64(<8 x double> %res.i103, <8 x double> <double 0xBC91A62630000000, double 0xBC91A62630000000, double 0xBC91A62630000000, double 0xBC91A62630000000, double 0xBC91A62630000000, double 0xBC91A62630000000, double 0xBC91A62630000000, double 0xBC91A62630000000>, <8 x double> %res.i98)
  %res.i96 = fmul fast <8 x double> %res.i110, <double 0x4170000000000000, double 0x4170000000000000, double 0x4170000000000000, double 0x4170000000000000, double 0x4170000000000000, double 0x4170000000000000, double 0x4170000000000000, double 0x4170000000000000>
  %res.i95 = fadd fast <8 x double> %res.i103, %res.i96
  %res.i94 = call <8 x double> @llvm.fmuladd.v8f64(<8 x double> %res.i95, <8 x double> <double 0xBAE8A2E03707344A, double 0xBAE8A2E03707344A, double 0xBAE8A2E03707344A, double 0xBAE8A2E03707344A, double 0xBAE8A2E03707344A, double 0xBAE8A2E03707344A, double 0xBAE8A2E03707344A, double 0xBAE8A2E03707344A>, <8 x double> %res.i97)
  %res.i93 = fmul fast <8 x double> %res.i94, %res.i94
  %30 = fptosi <8 x double> %res.i103 to <8 x i64>
  %res.i92 = and <8 x i64> %30, <i64 2, i64 2, i64 2, i64 2, i64 2, i64 2, i64 2, i64 2>
  %res.i90 = icmp eq <8 x i64> %res.i92, zeroinitializer
  %res.i89 = fsub fast <8 x double> <double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00>, %res.i94
  %res.i88 = select <8 x i1> %res.i90, <8 x double> %res.i89, <8 x double> %res.i94
  %res.i86 = fmul fast <8 x double> %res.i93, %res.i93
  %res.i85 = call <8 x double> @llvm.fmuladd.v8f64(<8 x double> %res.i93, <8 x double> <double 0x3F8111111111110F, double 0x3F8111111111110F, double 0x3F8111111111110F, double 0x3F8111111111110F, double 0x3F8111111111110F, double 0x3F8111111111110F, double 0x3F8111111111110F, double 0x3F8111111111110F>, <8 x double> <double 0xBFC5555555555555, double 0xBFC5555555555555, double 0xBFC5555555555555, double 0xBFC5555555555555, double 0xBFC5555555555555, double 0xBFC5555555555555, double 0xBFC5555555555555, double 0xBFC5555555555555>)
  %res.i84 = call <8 x double> @llvm.fmuladd.v8f64(<8 x double> %res.i93, <8 x double> <double 0x3EC71DE3A5568A50, double 0x3EC71DE3A5568A50, double 0x3EC71DE3A5568A50, double 0x3EC71DE3A5568A50, double 0x3EC71DE3A5568A50, double 0x3EC71DE3A5568A50, double 0x3EC71DE3A5568A50, double 0x3EC71DE3A5568A50>, <8 x double> <double 0xBF2A01A01A019FC7, double 0xBF2A01A01A019FC7, double 0xBF2A01A01A019FC7, double 0xBF2A01A01A019FC7, double 0xBF2A01A01A019FC7, double 0xBF2A01A01A019FC7, double 0xBF2A01A01A019FC7, double 0xBF2A01A01A019FC7>)
  %res.i83 = call <8 x double> @llvm.fmuladd.v8f64(<8 x double> %res.i93, <8 x double> <double 0x3DE6124601C23966, double 0x3DE6124601C23966, double 0x3DE6124601C23966, double 0x3DE6124601C23966, double 0x3DE6124601C23966, double 0x3DE6124601C23966, double 0x3DE6124601C23966, double 0x3DE6124601C23966>, <8 x double> <double 0xBE5AE64567CB5786, double 0xBE5AE64567CB5786, double 0xBE5AE64567CB5786, double 0xBE5AE64567CB5786, double 0xBE5AE64567CB5786, double 0xBE5AE64567CB5786, double 0xBE5AE64567CB5786, double 0xBE5AE64567CB5786>)
  %res.i82 = call <8 x double> @llvm.fmuladd.v8f64(<8 x double> %res.i93, <8 x double> <double 0x3CE94FA618796592, double 0x3CE94FA618796592, double 0x3CE94FA618796592, double 0x3CE94FA618796592, double 0x3CE94FA618796592, double 0x3CE94FA618796592, double 0x3CE94FA618796592, double 0x3CE94FA618796592>, <8 x double> <double 0xBD6AE7EA531357BF, double 0xBD6AE7EA531357BF, double 0xBD6AE7EA531357BF, double 0xBD6AE7EA531357BF, double 0xBD6AE7EA531357BF, double 0xBD6AE7EA531357BF, double 0xBD6AE7EA531357BF, double 0xBD6AE7EA531357BF>)
  %res.i81 = fmul fast <8 x double> %res.i86, %res.i86
  %res.i80 = call <8 x double> @llvm.fmuladd.v8f64(<8 x double> %res.i84, <8 x double> %res.i86, <8 x double> %res.i85)
  %res.i79 = call <8 x double> @llvm.fmuladd.v8f64(<8 x double> %res.i82, <8 x double> %res.i86, <8 x double> %res.i83)
  %res.i78 = fmul fast <8 x double> %res.i81, %res.i81
  %res.i77 = call <8 x double> @llvm.fmuladd.v8f64(<8 x double> %res.i79, <8 x double> %res.i81, <8 x double> %res.i80)
  %res.i76 = call <8 x double> @llvm.fmuladd.v8f64(<8 x double> %res.i78, <8 x double> <double 0xBC62622B22D526BE, double 0xBC62622B22D526BE, double 0xBC62622B22D526BE, double 0xBC62622B22D526BE, double 0xBC62622B22D526BE, double 0xBC62622B22D526BE, double 0xBC62622B22D526BE, double 0xBC62622B22D526BE>, <8 x double> %res.i77)
  %res.i75 = fmul fast <8 x double> %res.i88, %res.i76
  %res.i74 = call <8 x double> @llvm.fmuladd.v8f64(<8 x double> %res.i93, <8 x double> %res.i75, <8 x double> %res.i88)
  %res.i73 = call <8 x double> @llvm.fabs.v8f64(<8 x double> %res.i)
  %res.i71 = fcmp une <8 x double> %res.i73, <double 0x7FF0000000000000, double 0x7FF0000000000000, double 0x7FF0000000000000, double 0x7FF0000000000000, double 0x7FF0000000000000, double 0x7FF0000000000000, double 0x7FF0000000000000, double 0x7FF0000000000000>
  %res.i65 = fcmp ogt <8 x double> %res.i73, <double 1.000000e+14, double 1.000000e+14, double 1.000000e+14, double 1.000000e+14, double 1.000000e+14, double 1.000000e+14, double 1.000000e+14, double 1.000000e+14>
  %resb.i64113 = and <8 x i1> %res.i71, %res.i65
  %res.i62 = select <8 x i1> %resb.i64113, <8 x double> zeroinitializer, <8 x double> %res.i74
  %31 = getelementptr i8, i8* %27, i64 %28
  %ptr.i60 = bitcast i8* %31 to <8 x double>*
  store <8 x double> %res.i62, <8 x double>* %ptr.i60, align 8
  %32 = add nuw i64 %value_phi3, 8
  %33 = icmp eq i64 %value_phi2, %17
  %34 = add nuw nsw i64 %value_phi2, 1
  br i1 %33, label %L292.loopexit, label %L22

L292.loopexit:                                    ; preds = %L22
  %phitmp = shl i64 %32, 3
  br label %L292

L292:                                             ; preds = %L292.loopexit, %top
  %value_phi6 = phi i64 [ 0, %top ], [ %phitmp, %L292.loopexit ]
  %35 = icmp eq i64 %18, 0
  br i1 %35, label %L561, label %L295

L295:                                             ; preds = %L292
  %36 = trunc i64 %18 to i8
  %notmask = shl nsw i8 -1, %36
  %37 = xor i8 %notmask, -1
  %38 = inttoptr i64 %24 to i8*
  %39 = getelementptr i8, i8* %38, i64 %value_phi6
  %ptr.i57 = bitcast i8* %39 to <8 x double>*
  %mask.i58 = bitcast i8 %37 to <8 x i1>
  %res.i59 = call <8 x double> @llvm.masked.load.v8f64.p0v8f64(<8 x double>* %ptr.i57, i32 8, <8 x i1> %mask.i58, <8 x double> zeroinitializer)
  %res.i56 = fmul fast <8 x double> %res.i59, <double 0x3E645F306DC9C883, double 0x3E645F306DC9C883, double 0x3E645F306DC9C883, double 0x3E645F306DC9C883, double 0x3E645F306DC9C883, double 0x3E645F306DC9C883, double 0x3E645F306DC9C883, double 0x3E645F306DC9C883>
  %res.i55 = fadd fast <8 x double> %res.i56, <double 0xBE545F306DC9C883, double 0xBE545F306DC9C883, double 0xBE545F306DC9C883, double 0xBE545F306DC9C883, double 0xBE545F306DC9C883, double 0xBE545F306DC9C883, double 0xBE545F306DC9C883, double 0xBE545F306DC9C883>
  %res.i54 = call <8 x double> @llvm.trunc.v8f64(<8 x double> %res.i55)
  %res.i53 = fmul fast <8 x double> %res.i59, <double 0x3FD45F306DC9C883, double 0x3FD45F306DC9C883, double 0x3FD45F306DC9C883, double 0x3FD45F306DC9C883, double 0x3FD45F306DC9C883, double 0x3FD45F306DC9C883, double 0x3FD45F306DC9C883, double 0x3FD45F306DC9C883>
  %res.i52 = fadd fast <8 x double> %res.i53, <double -5.000000e-01, double -5.000000e-01, double -5.000000e-01, double -5.000000e-01, double -5.000000e-01, double -5.000000e-01, double -5.000000e-01, double -5.000000e-01>
  %res.i51.neg = fmul fast <8 x double> %res.i54, <double 0xC160000000000000, double 0xC160000000000000, double 0xC160000000000000, double 0xC160000000000000, double 0xC160000000000000, double 0xC160000000000000, double 0xC160000000000000, double 0xC160000000000000>
  %res.i50 = fadd fast <8 x double> %res.i52, %res.i51.neg
  %res.i49 = call <8 x double> @llvm.rint.v8f64(<8 x double> %res.i50)
  %res.i48 = fmul fast <8 x double> %res.i49, <double 2.000000e+00, double 2.000000e+00, double 2.000000e+00, double 2.000000e+00, double 2.000000e+00, double 2.000000e+00, double 2.000000e+00, double 2.000000e+00>
  %res.i47 = fadd fast <8 x double> %res.i48, <double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00>
  %res.i46 = call <8 x double> @llvm.fmuladd.v8f64(<8 x double> %res.i54, <8 x double> <double 0xC17921FB50000000, double 0xC17921FB50000000, double 0xC17921FB50000000, double 0xC17921FB50000000, double 0xC17921FB50000000, double 0xC17921FB50000000, double 0xC17921FB50000000, double 0xC17921FB50000000>, <8 x double> %res.i59)
  %res.i45 = call <8 x double> @llvm.fmuladd.v8f64(<8 x double> %res.i47, <8 x double> <double 0xBFF921FB50000000, double 0xBFF921FB50000000, double 0xBFF921FB50000000, double 0xBFF921FB50000000, double 0xBFF921FB50000000, double 0xBFF921FB50000000, double 0xBFF921FB50000000, double 0xBFF921FB50000000>, <8 x double> %res.i46)
  %res.i44 = call <8 x double> @llvm.fmuladd.v8f64(<8 x double> %res.i54, <8 x double> <double 0xBFD110B460000000, double 0xBFD110B460000000, double 0xBFD110B460000000, double 0xBFD110B460000000, double 0xBFD110B460000000, double 0xBFD110B460000000, double 0xBFD110B460000000, double 0xBFD110B460000000>, <8 x double> %res.i45)
  %res.i43 = call <8 x double> @llvm.fmuladd.v8f64(<8 x double> %res.i47, <8 x double> <double 0xBE5110B460000000, double 0xBE5110B460000000, double 0xBE5110B460000000, double 0xBE5110B460000000, double 0xBE5110B460000000, double 0xBE5110B460000000, double 0xBE5110B460000000, double 0xBE5110B460000000>, <8 x double> %res.i44)
  %res.i42 = call <8 x double> @llvm.fmuladd.v8f64(<8 x double> %res.i54, <8 x double> <double 0xBE11A62630000000, double 0xBE11A62630000000, double 0xBE11A62630000000, double 0xBE11A62630000000, double 0xBE11A62630000000, double 0xBE11A62630000000, double 0xBE11A62630000000, double 0xBE11A62630000000>, <8 x double> %res.i43)
  %res.i41 = call <8 x double> @llvm.fmuladd.v8f64(<8 x double> %res.i47, <8 x double> <double 0xBC91A62630000000, double 0xBC91A62630000000, double 0xBC91A62630000000, double 0xBC91A62630000000, double 0xBC91A62630000000, double 0xBC91A62630000000, double 0xBC91A62630000000, double 0xBC91A62630000000>, <8 x double> %res.i42)
  %res.i40 = fmul fast <8 x double> %res.i54, <double 0x4170000000000000, double 0x4170000000000000, double 0x4170000000000000, double 0x4170000000000000, double 0x4170000000000000, double 0x4170000000000000, double 0x4170000000000000, double 0x4170000000000000>
  %res.i39 = fadd fast <8 x double> %res.i47, %res.i40
  %res.i38 = call <8 x double> @llvm.fmuladd.v8f64(<8 x double> %res.i39, <8 x double> <double 0xBAE8A2E03707344A, double 0xBAE8A2E03707344A, double 0xBAE8A2E03707344A, double 0xBAE8A2E03707344A, double 0xBAE8A2E03707344A, double 0xBAE8A2E03707344A, double 0xBAE8A2E03707344A, double 0xBAE8A2E03707344A>, <8 x double> %res.i41)
  %res.i37 = fmul fast <8 x double> %res.i38, %res.i38
  %40 = fptosi <8 x double> %res.i47 to <8 x i64>
  %res.i36 = and <8 x i64> %40, <i64 2, i64 2, i64 2, i64 2, i64 2, i64 2, i64 2, i64 2>
  %res.i34 = icmp eq <8 x i64> %res.i36, zeroinitializer
  %res.i33 = fsub fast <8 x double> <double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00>, %res.i38
  %res.i32 = select <8 x i1> %res.i34, <8 x double> %res.i33, <8 x double> %res.i38
  %res.i30 = fmul fast <8 x double> %res.i37, %res.i37
  %res.i29 = call <8 x double> @llvm.fmuladd.v8f64(<8 x double> %res.i37, <8 x double> <double 0x3F8111111111110F, double 0x3F8111111111110F, double 0x3F8111111111110F, double 0x3F8111111111110F, double 0x3F8111111111110F, double 0x3F8111111111110F, double 0x3F8111111111110F, double 0x3F8111111111110F>, <8 x double> <double 0xBFC5555555555555, double 0xBFC5555555555555, double 0xBFC5555555555555, double 0xBFC5555555555555, double 0xBFC5555555555555, double 0xBFC5555555555555, double 0xBFC5555555555555, double 0xBFC5555555555555>)
  %res.i28 = call <8 x double> @llvm.fmuladd.v8f64(<8 x double> %res.i37, <8 x double> <double 0x3EC71DE3A5568A50, double 0x3EC71DE3A5568A50, double 0x3EC71DE3A5568A50, double 0x3EC71DE3A5568A50, double 0x3EC71DE3A5568A50, double 0x3EC71DE3A5568A50, double 0x3EC71DE3A5568A50, double 0x3EC71DE3A5568A50>, <8 x double> <double 0xBF2A01A01A019FC7, double 0xBF2A01A01A019FC7, double 0xBF2A01A01A019FC7, double 0xBF2A01A01A019FC7, double 0xBF2A01A01A019FC7, double 0xBF2A01A01A019FC7, double 0xBF2A01A01A019FC7, double 0xBF2A01A01A019FC7>)
  %res.i27 = call <8 x double> @llvm.fmuladd.v8f64(<8 x double> %res.i37, <8 x double> <double 0x3DE6124601C23966, double 0x3DE6124601C23966, double 0x3DE6124601C23966, double 0x3DE6124601C23966, double 0x3DE6124601C23966, double 0x3DE6124601C23966, double 0x3DE6124601C23966, double 0x3DE6124601C23966>, <8 x double> <double 0xBE5AE64567CB5786, double 0xBE5AE64567CB5786, double 0xBE5AE64567CB5786, double 0xBE5AE64567CB5786, double 0xBE5AE64567CB5786, double 0xBE5AE64567CB5786, double 0xBE5AE64567CB5786, double 0xBE5AE64567CB5786>)
  %res.i26 = call <8 x double> @llvm.fmuladd.v8f64(<8 x double> %res.i37, <8 x double> <double 0x3CE94FA618796592, double 0x3CE94FA618796592, double 0x3CE94FA618796592, double 0x3CE94FA618796592, double 0x3CE94FA618796592, double 0x3CE94FA618796592, double 0x3CE94FA618796592, double 0x3CE94FA618796592>, <8 x double> <double 0xBD6AE7EA531357BF, double 0xBD6AE7EA531357BF, double 0xBD6AE7EA531357BF, double 0xBD6AE7EA531357BF, double 0xBD6AE7EA531357BF, double 0xBD6AE7EA531357BF, double 0xBD6AE7EA531357BF, double 0xBD6AE7EA531357BF>)
  %res.i25 = fmul fast <8 x double> %res.i30, %res.i30
  %res.i24 = call <8 x double> @llvm.fmuladd.v8f64(<8 x double> %res.i28, <8 x double> %res.i30, <8 x double> %res.i29)
  %res.i23 = call <8 x double> @llvm.fmuladd.v8f64(<8 x double> %res.i26, <8 x double> %res.i30, <8 x double> %res.i27)
  %res.i22 = fmul fast <8 x double> %res.i25, %res.i25
  %res.i21 = call <8 x double> @llvm.fmuladd.v8f64(<8 x double> %res.i23, <8 x double> %res.i25, <8 x double> %res.i24)
  %res.i20 = call <8 x double> @llvm.fmuladd.v8f64(<8 x double> %res.i22, <8 x double> <double 0xBC62622B22D526BE, double 0xBC62622B22D526BE, double 0xBC62622B22D526BE, double 0xBC62622B22D526BE, double 0xBC62622B22D526BE, double 0xBC62622B22D526BE, double 0xBC62622B22D526BE, double 0xBC62622B22D526BE>, <8 x double> %res.i21)
  %res.i19 = fmul fast <8 x double> %res.i32, %res.i20
  %res.i18 = call <8 x double> @llvm.fmuladd.v8f64(<8 x double> %res.i37, <8 x double> %res.i19, <8 x double> %res.i32)
  %res.i17 = call <8 x double> @llvm.fabs.v8f64(<8 x double> %res.i59)
  %res.i15 = fcmp une <8 x double> %res.i17, <double 0x7FF0000000000000, double 0x7FF0000000000000, double 0x7FF0000000000000, double 0x7FF0000000000000, double 0x7FF0000000000000, double 0x7FF0000000000000, double 0x7FF0000000000000, double 0x7FF0000000000000>
  %res.i10 = fcmp ogt <8 x double> %res.i17, <double 1.000000e+14, double 1.000000e+14, double 1.000000e+14, double 1.000000e+14, double 1.000000e+14, double 1.000000e+14, double 1.000000e+14, double 1.000000e+14>
  %resb.i114 = and <8 x i1> %res.i15, %res.i10
  %res.i9 = select <8 x i1> %resb.i114, <8 x double> zeroinitializer, <8 x double> %res.i18
  %41 = inttoptr i64 %21 to i8*
  %42 = getelementptr i8, i8* %41, i64 %value_phi6
  %ptr.i8 = bitcast i8* %42 to <8 x double>*
  call void @llvm.masked.store.v8f64.p0v8f64(<8 x double> %res.i9, <8 x double>* %ptr.i8, i32 8, <8 x i1> %mask.i58)
  br label %L561

L561:                                             ; preds = %L295, %L292
  ret %jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 140112848148096 to %jl_value_t*) to %jl_value_t addrspace(10)*)
}

Notice how many operations there are on <8 x double>? These again are unrolled.
Like with the zmm, they're architecture dependent. A computer with avx2 but not avx512 would have seen ymm in the gcc-compiled code, and <4 x double> in the @code_llvm above.

The advantage of the compiler handling it

The advantage of going to the LLVM level is that optimial implementations for vectorized and unvectorized versions are different. Unvectorized versions will tend to have a lot of branches, to handle special cases well, and pick extremely accurate and fast approximations based on which part of the domain the arguments are in.

But with vectorized functions, the entire vector has to follow each branch. Meaning if you have 8 numbers that want to go three different ways, you have to take them all all three ways (masking off the respective lanes that aren't supposed to go that way). That means in practice that slower approximations valid over larger domains are faster for the vectorized versions. Still, the vectorized versions do not handle special cases as well, which is why you need --ffast-math or similar flags to use them (assume no NaN or Inf).

For that reason, you want the same software that is deciding whether or not to vectorize to be aware of this, so it can choose the optimal implementation.

Returning to the @vvectorize macro, it unrolls the loop and swaps functions:

julia> using MacroTools

julia> prettify(@macroexpand @vvectorize (for i in eachindex(a, b)
               a[i] = cos(b[i])
           end))
quote
    horse = $(Expr(:gc_preserve_begin, :a, :b))
    mallard = begin
            coati = min(length(a), length(b))
            (crocodile, flamingo) = (coati >>> 3, coati & 7)
            cockroach = LoopVectorization.vectorizable(a)
            ant = LoopVectorization.vectorizable(b)
            alligator = 0
            for dunlin = 1:crocodile
                wallaby = LoopVectorization.vload(NTuple{8,VecElement{Float64}}, ant, alligator)
                LoopVectorization.vstore!(cockroach, LoopVectorization.SLEEFPirates.cos_fast(wallaby), alligator)
                alligator += 8
            end
            if flamingo > 0
                dolphin = LoopVectorization.VectorizationBase.mask(Val{8}(), flamingo)
                wallaby = LoopVectorization.vload(NTuple{8,VecElement{Float64}}, ant, alligator, dolphin)
                LoopVectorization.vstore!(cockroach, LoopVectorization.SLEEFPirates.cos_fast(wallaby), alligator, dolphin)
                alligator += 8
            end
            nothing
        end
    $(Expr(:gc_preserve_end, :horse))
    mallard
    nothing
end

replacing cos into LoopVectorization.SLEEFPirates.cos_fast, and loads "packed" vectors to feed it.
So while it's not an optimization triggered by the compiler, but by the user deciding to apply the macro, this solution at least also performs the vectorization and implementation swap at the same time.
@xvectorize from xsimdwrap does the same thing, but just passes a function from LoopVectorization a different substituition-dictionary.

Returning to SVML

For the sake of it, lets run some benchmarks of clang using fveclib=svml vs gcc.
godbolt-asm and C code for vectorized exp, log, sin, and cos functions. Because we're using C, GCC needs -lm to link mathfunctions (unlike Fortran which links them automatically), while Clang needs to link the external SVML library.
Without -lm, gcc produces the same code, but it will crash when it tries to use either vectorized or scalar versions. Clang needs fveclib to produce the vectorized version at all.

Compiling with:

gcc -Ofast -march=skylake-avx512 -mprefer-vector-width=512 -fPIC -shared vectorized_special.c -o libgccspecial.so -lm
clang -Ofast -march=skylake-avx512 -mprefer-vector-width=512 -fveclib=SVML -fPIC -shared vectorized_special.c -o libclangspecial.so -lsvml

Benchmarking them, we get:

julia> const CLIBPATH = "/home/chriselrod/Documents/progwork/C"
"/home/chriselrod/Documents/progwork/C"

julia> # gcc -Ofast -march=skylake-avx512 -mprefer-vector-width=512 -fPIC -shared vectorized_special.c -o libgccspecial.so -lm
       # clang -Ofast -march=skylake-avx512 -mprefer-vector-width=512 -fveclib=SVML -fPIC -shared vectorized_special.c -o libclangspecial.so -lsvml
       
       function wrap(compiler, func)
           lib = "lib" * string(compiler) * "special.so"
           funcbase = Symbol(func, :_, compiler)
           funcinplace = Symbol(funcbase, :!)
           quote
               function $funcinplace(a::AbstractVector{Float64}, b::AbstractVector{Float64})
                   ccall(
                       ($(QuoteNode(Symbol(:v, func))), $(joinpath(CLIBPATH, lib))), Cvoid,
                       (Ref{Float64}, Ref{Float64}, Int),
                       a, b, length(a)
                   )
                   a
               end
               $funcbase(a::AbstractVector{Float64}) = $funcinplace(similar(a), a)
           end
       end
wrap (generic function with 1 method)

julia> for func  (:exp, :log, :sin, :cos)
           for compiler  (:gcc, :clang)
               eval(wrap(compiler, func))
           end
       end

julia> b = rand(2000); a = similar(b);

julia> all(exp_gcc(b) .≈ exp.(b))
true

julia> all(exp_clang(b) .≈ exp.(b))
true

julia> all(log_gcc(b) .≈ log.(b))
true

julia> all(log_clang(b) .≈ log.(b))
true

julia> all(sin_gcc(b) .≈ sin.(b))
true

julia> all(sin_clang(b) .≈ sin.(b))
true

julia> all(cos_gcc(b) .≈ cos.(b))
true

julia> all(cos_clang(b) .≈ cos.(b))
true

julia> using BenchmarkTools

julia> @btime @. $a = exp($b);
  12.579 μs (0 allocations: 0 bytes)

julia> @btime exp_gcc!($a, $b);
  1.197 μs (0 allocations: 0 bytes)

julia> @btime exp_clang!($a, $b);
  1.306 μs (0 allocations: 0 bytes)

julia> @btime @. $a = log($b);
  9.982 μs (0 allocations: 0 bytes)

julia> @btime log_gcc!($a, $b);
  1.700 μs (0 allocations: 0 bytes)

julia> @btime log_clang!($a, $b);
  1.952 μs (0 allocations: 0 bytes)

julia> @btime @. $a = sin($b);
  10.779 μs (0 allocations: 0 bytes)

julia> @btime sin_gcc!($a, $b);
  1.279 μs (0 allocations: 0 bytes)

julia> @btime sin_clang!($a, $b);
  1.758 μs (0 allocations: 0 bytes)

julia> @btime @. $a = cos($b);
  11.615 μs (0 allocations: 0 bytes)

julia> @btime cos_gcc!($a, $b);
  1.499 μs (0 allocations: 0 bytes)

julia> @btime cos_clang!($a, $b);
  1.822 μs (0 allocations: 0 bytes)

@aminya
Copy link
Member Author

aminya commented Nov 27, 2019

@chriselrod Thank you for this post!

It is clarified that SVML is not a library! We can have a C or Julia (?) code and compile it using SVML or GCC.

The implementation in Julia is possible through LoopVectorization. Using that we can make any code vectorized and also we can make a vectorized library from a Julia source code.
Is it possible to use LoopVectorization and GCC or SVML?

Also, GCC seems to be a bit faster than Clang+SVML! However, I think both should be implemented, and let the user decide.

Regarding the license, I checked https://www.gnu.org/licenses/gcc-exception-3.1.en.html and https://softwareengineering.stackexchange.com/questions/327622/what-are-the-licensing-requirements-for-a-program-i-compile-using-gcc. It is stated in the exception that the software generated by GCC compiler is not covered under GPL license even if it includes libraries and headers. This means it is OK to use GCC libraries in any situation (including propriety).

@aminya
Copy link
Member Author

aminya commented Nov 27, 2019

I found these two issues in Julia that are related:
JuliaLang/julia#8869
JuliaLang/julia#21454
JuliaLang/julia#15265

@aminya aminya added the Future label Nov 27, 2019
@Crown421
Copy link
Collaborator

@chriselrod This is very interesting, thank you. Once I have a bit more time, I will read and understand better.

I think in general this is way beyond the scope of this package, based on the other issues you have cited. This is a "simple" package to load some specialized functions, whereas integrating SVML requires serious messing with the compiler.
This would also significantly increase the maintenance effort.

@aminya
Copy link
Member Author

aminya commented Nov 27, 2019

We can transfer the issue to a more suitable repository.

@RoyiAvital
Copy link

I think Intel has Opened Sourced SVML and injected the code into the latest versions of GCC.
I don't remember where I saw it but one could see GCC generates SVML calls natively without -mveclibabi=svml.

@aminya
Copy link
Member Author

aminya commented Feb 4, 2020

I think Intel has Opened Sourced SVML and injected the code into the latest versions of GCC.
I don't remember where I saw it but one could see GCC generates SVML calls natively without -mveclibabi=svml.

OMG 🚀 If they open source SVML, probably it will end up in Clang (LLVM) too, which Julia backend uses. So we can get SVML for free. I hope that rumor is true.

@RoyiAvital
Copy link

I found where I saw it:

Sorry, conflated the two libraries as one
It is vectorized (acting on 512 bit zmm registers).

Here is the GPLed source of glibc. That specific link is for avx512 sincos, named svml_d_sincos8_core_avx512.S.

The source file names for vectorized math functions are of the form “svml_{datatype}{func}{vector width}{arch}.S”, and I recall reading on Phoronix that Intel contributed a lot of SIMD math code.
Meaning that SVML, or parts of it, have been open sourced and contributed into glibc itself. Hence, why you don’t need to do any special linking.

I think the best thing is to have ability to use SVML when broadcasting.
By the way, Intel VML are basically wrappers around broadacsted (Array) SVML functions.
Using them directly will reduce overhead and increase performance. Specifically for smaller size arrays.

@chriselrod
Copy link

chriselrod commented Feb 4, 2020

BTW, SLEEFPirates now tries to find libmvec.so.1 on Linus systems. This is the glibc shared library that contains some SVML functions.

If it can find it, it'll use it for exp, log, sin, cos, and ^. (I'll get around to rerunning the exp LoopVectorization benchmark, but I expect it to come closer to gfortran and the Intel compilers now).

Currently, it looks here:

[ "/usr/lib64/", "/usr/lib", "/lib/x86_64-linux-gnu" ]

On Clear Linux, it was located in "/usr/lib64/", while in Ubuntu it's in "/lib/x86_64-linux-gnu".
For example, on Ubuntu (running on domino/aws), I can:

> nm -D /lib/x86_64-linux-gnu/libmvec.so.1 | grep log
                 U __logf_finite
                 U __log_finite
0000000000001bb0 i _ZGVbN2v_log
0000000000001c80 i _ZGVbN4v_logf
0000000000001bd0 T _ZGVcN4v_log
0000000000001ca0 T _ZGVcN8v_logf
0000000000001c10 i _ZGVdN4v_log
0000000000001ce0 i _ZGVdN8v_logf
0000000000001d10 i _ZGVeN16v_logf
0000000000001c40 i _ZGVeN8v_log

I don't know about other distribution families like Arch, Fedora, Void, etc. If you're on one of them, please find it and file an issue if it's located elsewhere so I can add its location to the path and make sure the build script finds it.

Given that lots of computers already have extremely well optimized implementations just sitting there, it made sense to try and use them.

Unfortunately, libmvec seems incomplete. No tan, for example.

@aminya
Copy link
Member Author

aminya commented Feb 4, 2020

BTW, SLEEFPirates now tries to find libmvec.so.1 on Linus systems. This is the glibc shared library that contains some SVML functions.

If it can find it, it'll use it for exp, log, sin, cos, and ^. (I'll get around to rerunning the exp LoopVectorization benchmark, but I expect it to come closer to gfortran and the Intel compilers now).

Currently, it looks here:

[ "/usr/lib64/", "/usr/lib", "/lib/x86_64-linux-gnu" ]

On Clear Linux, it was located in "/usr/lib64/", while in Ubuntu it's in "/lib/x86_64-linux-gnu".
For example, on Ubuntu (running on domino/aws), I can:

> nm -D /lib/x86_64-linux-gnu/libmvec.so.1 | grep log
                 U __logf_finite
                 U __log_finite
0000000000001bb0 i _ZGVbN2v_log
0000000000001c80 i _ZGVbN4v_logf
0000000000001bd0 T _ZGVcN4v_log
0000000000001ca0 T _ZGVcN8v_logf
0000000000001c10 i _ZGVdN4v_log
0000000000001ce0 i _ZGVdN8v_logf
0000000000001d10 i _ZGVeN16v_logf
0000000000001c40 i _ZGVeN8v_log

I don't know about other distribution families like Arch, Fedora, Void, etc. If you're on one of them, please find it and file an issue if it's located elsewhere so I can add its location to the path and make sure the build script finds it.

Given that lots of computers already have extremely well optimized implementations just sitting there, it made sense to try and use them.

Unfortunately, libmvec seems incomplete. No tan, for example.

That is nice! Does Windows have that too?

From what I see in this link https://github.com/bminor/glibc/blob/5cb226d7e4e710939cff7288bf9970cb52ec0dfa/sysdeps/x86_64/fpu/multiarch/svml_d_sincos8_core_avx512.S they have not actually open-sourced the compiler, it is just a set of functions that @chriselrod mentioned, and IntelVectorMath has a bigger set of those function.

@chriselrod
Copy link

chriselrod commented Feb 4, 2020

That is nice! Does Windows have that too?

Not that I know of.
Macs have "AppleAccelerate" which may have some functions. I'd accept PRs wrapping it (as well as any libraries found on Windows or *BSD systems).
Note that the name mangling will probably be different.

And yes, it seems they only released the source of a few of their functions.
I'll have to try wrapping sincos again later. Calling it would crash Julia the first couple times I tried.

@aminya
Copy link
Member Author

aminya commented Feb 4, 2020

That is nice! Does Windows have that too?

Not that I know of.
Macs have "AppleAccelerate" which may have some functions. I'd accept PRs wrapping it (as well as any libraries found on Windows or *BSD systems).
Note that the name mangling will probably be different.

And yes, it seems they only released the source of a few of their functions.
I'll have to try wrapping sincos again later. Calling it would crash Julia the first couple times I tried.

By the way, it would be nice if one of your Vectorization packages, provides a macro that replaces functions with their vectorized version.

Like @ivm @. sin(x) would replace this with IntelVectorMath function, and @applacc @. sin(x) calls AppleAccelerate.

We can provide such macros from IntelVectorMath.jl too, but maybe having all of them in one place would be better.

@Crown421 what do you think?

@chriselrod
Copy link

chriselrod commented Feb 4, 2020

The major improvement these provide is that they're vectorized. If x is a scalar, then there isn't much benefit, if there is any at all.
Version of LoopVectorization provided an @vectorize macro (that has since been removed) which naively swapped calls, and made loops incremented (ie, instead of 1:N, it would be 1:W:N, plus some code to handle the remainder). @avx does this better.

If they are a vector, calling @avx sin.(x) or IntelVectorMath.sin(x) work (although a macro could search a whole block of code and swap them to use IntelVectorMath.

@RoyiAvital
Copy link

RoyiAvital commented Feb 5, 2020

From what I see in this link https://github.com/bminor/glibc/blob/5cb226d7e4e710939cff7288bf9970cb52ec0dfa/sysdeps/x86_64/fpu/multiarch/svml_d_sincos8_core_avx512.S they have not actually open-sourced the compiler, it is just a set of functions that @chriselrod mentioned, and IntelVectorMath has a bigger set of those function.

@aminya , The point is different. SVML is a set of functions which their input is either scalars or SSE / AVX elements. Basically any compiler could use those functions if given the Code / Lib File.

Intel VML is basically wrapping SVML functions in a loop for handling arrays (With taking care of the edge case where the array dimensions aren't an integer multiplication of the element used).

@chriselrod , If Intel injected their code to GCC. Can't those specific functions be compiled in Windows or are they too deep integrated into GLIBC?

@chriselrod
Copy link

@RoyiAvital
Technically the code is in glibc. You could also look at the code for the specific functions you do want to compile.

Ubuntu WSL on my Windows laptop has libmvec. VS Code also interfaces well with WSL, but that doesn't seem to be supported by the Julia extension at the moment.
Emacs on WSL is also unusably buggy (this may be better on more recent releases; the place I work is >1 year behind on Windows builds). Maybe Vim works well, but I haven't used it before, but have definitely considered giving it a try.

The hardest part would probably be getting the table of constants right.

If you do that, you could even call these functions directly in Julia via llvmcall + the call asm llvm intrinsic. Then anyone who can run Julia and has appropriate hardware can use these functions.

Of course, always respect the licenses.

@RoyiAvital
Copy link

@chriselrod , What you suggested is exactly what I meant. If the license allows, to extract the needed functions into independent C file and then compile and use it in Julia.

The questions are:

  1. Does the license allow it?
  2. How easy it is or the functions were wrote with deep integration with GLIBC?

@chriselrod
Copy link

chriselrod commented Feb 5, 2020

The license is just the GPL.

The GNU C Library is free software; you can redistribute it and/or
modify it under the terms of the GNU Lesser General Public
License as published by the Free Software Foundation; either
version 2.1 of the License, or (at your option) any later version.

Meaning you can do more or less whatever you want with it as long as you keep it GPLed (v2.1 or later). Just put it in a GPLed Julia library (instead of an MIT one).
Other MIT libraries can still depend on it and remain MIT, so IMO it's not much to worry about.
When it could be problematic is if you're thinking about writing your own versions.
If you do that, you may have to make that library MIT if it qualifies as a derived work if you based your implementation on theirs (which would be hard to resist doing, given the performance).

All the SVML code is written in assembly, so it'd be .S files.
It's because of that that you may as well use llvmcall directly on the assembly code (using Julia's assembler), instead of distributing it in compiled binaries.

The hardest part will probably be figuring out constants like _LogRcp_lookup. If you think you can do that, go for it.

@aminya
Copy link
Member Author

aminya commented Feb 5, 2020

Regarding the license, I checked https://www.gnu.org/licenses/gcc-exception-3.1.en.html and https://softwareengineering.stackexchange.com/questions/327622/what-are-the-licensing-requirements-for-a-program-i-compile-using-gcc. It is stated in the exception that the software generated by GCC compiler is not covered under GPL license even if it includes libraries and headers. This means it is OK to use GCC libraries in any situation (including propriety).

See the above regarding the license

@chriselrod
Copy link

That just means that compiling with GCC has no impact on the license of the code we're compiling.

The fact that the code we'd be compiling (GLIBC) is GPLed however does matter. Especially if we're talking about repackaging the GLPed GLIBC code.

@aminya
Copy link
Member Author

aminya commented Feb 5, 2020

I guess so. If the library is used directly, that gets GPLed, but if let GCC compiler decides, it is an exception. 😄

@chriselrod
Copy link

If you're planning on just ccalling GLIBC, I think the Julia-wrapper can be MIT.

@chriselrod
Copy link

I think stevenjg is an expert on licenses as well as special function implementations. Not sure what the etiquette standard is on tagging someone who wasn't a part of the conversation.

@aminya
Copy link
Member Author

aminya commented Feb 5, 2020

I think stevenjg is an expert on licenses as well as special function implementations. Not sure what the etiquette standard is on tagging someone who wasn't a part of the conversation.

Part of license ( exception) states that

A Compilation Process is "Eligible" if it is done using GCC, alone or with other GPL-compatible software, or if it is done without using any work based on GCC. For example, using non-GPL-compatible Software to optimize any GCC intermediate representations would not qualify as an Eligible Compilation Process.

IMO, this suggests that GCC compiler should be part of the process. Otherwise the result it gets GPLed.

@chriselrod
Copy link

What are you trying to suggest?

@aminya
Copy link
Member Author

aminya commented Feb 5, 2020

I am just saying these functions should be compiled using GCC to be usable for Non-GPLed (e.g. proprietary) situations, otherwise their under GPL.
I will try to avoid GPLize any code I write. So I prefer to use IntelVecotMath instead

@aminya
Copy link
Member Author

aminya commented Feb 5, 2020

Oh wait! The library is LGPL, so using ccall is fine!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants