-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorporating SVML #22
Comments
Could you explain what you mean? |
Numba for python has SVML. |
@chriselrod Yesterday, you were talking about vectorization in the compiler. Is that related to this issue? |
In the link you cited it says "[...] Numba automatically configures the LLVM back end to use the SVML intrinsic functions where ever possible". |
So from what I understand, probably SVML should be implemented at a lower level. |
Maybe @KristofferC and @eschnett can help. |
This comment is sort of a meandering mess, so I at least labeled sections. -fveclib= and NumbaClang/LLVM has the optional flag -fveclib=. Vectorization in GCCWhen I use vectorization, I normally mean to operate on packed vectors in registers. This is different from what it commonly means in languages like R or Stan. This is also what all fast .L4:
movq %rbx, %r12
salq $6, %r12
vmovupd (%r14,%r12), %zmm0
incq %rbx
call _ZGVeN8v_cos@PLT
vmovupd %zmm0, 0(%r13,%r12)
cmpq %rbx, %r15
jne .L4 This is doing packed operations like @Crown421 described, except it is operating on 8 at a time (to fill 512 bit registers), not just 2. In this way, vectorization is well integrated into recent versions of GCC. LoopVectorizationLoopVectorization.jl automates doing this unrolling manually. using LoopVectorization
function cos_sleef!(a, b)
@vvectorize for i in eachindex(a, b)
a[i] = cos(b[i])
end
end
b = randn(2000,2000); a = similar(b); Using ; julia> @code_llvm debuginfo=:none cos_sleef!(a, b)
define nonnull %jl_value_t addrspace(10)* @"japi1_cos_sleef!_17907"(%jl_value_t addrspace(10)*, %jl_value_t addrspace(10)**, i32) #0 {
top:
%3 = alloca %jl_value_t addrspace(10)**, align 8
store volatile %jl_value_t addrspace(10)** %1, %jl_value_t addrspace(10)*** %3, align 8
%4 = load %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)** %1, align 8
%5 = getelementptr inbounds %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)** %1, i64 1
%6 = load %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)** %5, align 8
%7 = addrspacecast %jl_value_t addrspace(10)* %4 to %jl_value_t addrspace(11)*
%8 = bitcast %jl_value_t addrspace(11)* %7 to %jl_array_t addrspace(11)*
%9 = getelementptr inbounds %jl_array_t, %jl_array_t addrspace(11)* %8, i64 0, i32 1
%10 = load i64, i64 addrspace(11)* %9, align 8
%11 = addrspacecast %jl_value_t addrspace(10)* %6 to %jl_value_t addrspace(11)*
%12 = bitcast %jl_value_t addrspace(11)* %11 to %jl_array_t addrspace(11)*
%13 = getelementptr inbounds %jl_array_t, %jl_array_t addrspace(11)* %12, i64 0, i32 1
%14 = load i64, i64 addrspace(11)* %13, align 8
%15 = icmp slt i64 %14, %10
%16 = select i1 %15, i64 %14, i64 %10
%17 = lshr i64 %16, 3
%18 = and i64 %16, 7
%19 = addrspacecast %jl_value_t addrspace(11)* %7 to %jl_value_t*
%20 = bitcast %jl_value_t* %19 to i64*
%21 = load i64, i64* %20, align 8
%22 = addrspacecast %jl_value_t addrspace(11)* %11 to %jl_value_t*
%23 = bitcast %jl_value_t* %22 to i64*
%24 = load i64, i64* %23, align 8
%25 = icmp eq i64 %17, 0
br i1 %25, label %L292, label %L22.preheader
L22.preheader: ; preds = %top
%26 = inttoptr i64 %24 to i8*
%27 = inttoptr i64 %21 to i8*
br label %L22
L22: ; preds = %L22.preheader, %L22
%value_phi2 = phi i64 [ %34, %L22 ], [ 1, %L22.preheader ]
%value_phi3 = phi i64 [ %32, %L22 ], [ 0, %L22.preheader ]
%28 = shl i64 %value_phi3, 3
%29 = getelementptr i8, i8* %26, i64 %28
%ptr.i = bitcast i8* %29 to <8 x double>*
%res.i = load <8 x double>, <8 x double>* %ptr.i, align 8
%res.i112 = fmul fast <8 x double> %res.i, <double 0x3E645F306DC9C883, double 0x3E645F306DC9C883, double 0x3E645F306DC9C883, double 0x3E645F306DC9C883, double 0x3E645F306DC9C883, double 0x3E645F306DC9C883, double 0x3E645F306DC9C883, double 0x3E645F306DC9C883>
%res.i111 = fadd fast <8 x double> %res.i112, <double 0xBE545F306DC9C883, double 0xBE545F306DC9C883, double 0xBE545F306DC9C883, double 0xBE545F306DC9C883, double 0xBE545F306DC9C883, double 0xBE545F306DC9C883, double 0xBE545F306DC9C883, double 0xBE545F306DC9C883>
%res.i110 = call <8 x double> @llvm.trunc.v8f64(<8 x double> %res.i111)
%res.i109 = fmul fast <8 x double> %res.i, <double 0x3FD45F306DC9C883, double 0x3FD45F306DC9C883, double 0x3FD45F306DC9C883, double 0x3FD45F306DC9C883, double 0x3FD45F306DC9C883, double 0x3FD45F306DC9C883, double 0x3FD45F306DC9C883, double 0x3FD45F306DC9C883>
%res.i108 = fadd fast <8 x double> %res.i109, <double -5.000000e-01, double -5.000000e-01, double -5.000000e-01, double -5.000000e-01, double -5.000000e-01, double -5.000000e-01, double -5.000000e-01, double -5.000000e-01>
%res.i107.neg = fmul fast <8 x double> %res.i110, <double 0xC160000000000000, double 0xC160000000000000, double 0xC160000000000000, double 0xC160000000000000, double 0xC160000000000000, double 0xC160000000000000, double 0xC160000000000000, double 0xC160000000000000>
%res.i106 = fadd fast <8 x double> %res.i108, %res.i107.neg
%res.i105 = call <8 x double> @llvm.rint.v8f64(<8 x double> %res.i106)
%res.i104 = fmul fast <8 x double> %res.i105, <double 2.000000e+00, double 2.000000e+00, double 2.000000e+00, double 2.000000e+00, double 2.000000e+00, double 2.000000e+00, double 2.000000e+00, double 2.000000e+00>
%res.i103 = fadd fast <8 x double> %res.i104, <double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00>
%res.i102 = call <8 x double> @llvm.fmuladd.v8f64(<8 x double> %res.i110, <8 x double> <double 0xC17921FB50000000, double 0xC17921FB50000000, double 0xC17921FB50000000, double 0xC17921FB50000000, double 0xC17921FB50000000, double 0xC17921FB50000000, double 0xC17921FB50000000, double 0xC17921FB50000000>, <8 x double> %res.i)
%res.i101 = call <8 x double> @llvm.fmuladd.v8f64(<8 x double> %res.i103, <8 x double> <double 0xBFF921FB50000000, double 0xBFF921FB50000000, double 0xBFF921FB50000000, double 0xBFF921FB50000000, double 0xBFF921FB50000000, double 0xBFF921FB50000000, double 0xBFF921FB50000000, double 0xBFF921FB50000000>, <8 x double> %res.i102)
%res.i100 = call <8 x double> @llvm.fmuladd.v8f64(<8 x double> %res.i110, <8 x double> <double 0xBFD110B460000000, double 0xBFD110B460000000, double 0xBFD110B460000000, double 0xBFD110B460000000, double 0xBFD110B460000000, double 0xBFD110B460000000, double 0xBFD110B460000000, double 0xBFD110B460000000>, <8 x double> %res.i101)
%res.i99 = call <8 x double> @llvm.fmuladd.v8f64(<8 x double> %res.i103, <8 x double> <double 0xBE5110B460000000, double 0xBE5110B460000000, double 0xBE5110B460000000, double 0xBE5110B460000000, double 0xBE5110B460000000, double 0xBE5110B460000000, double 0xBE5110B460000000, double 0xBE5110B460000000>, <8 x double> %res.i100)
%res.i98 = call <8 x double> @llvm.fmuladd.v8f64(<8 x double> %res.i110, <8 x double> <double 0xBE11A62630000000, double 0xBE11A62630000000, double 0xBE11A62630000000, double 0xBE11A62630000000, double 0xBE11A62630000000, double 0xBE11A62630000000, double 0xBE11A62630000000, double 0xBE11A62630000000>, <8 x double> %res.i99)
%res.i97 = call <8 x double> @llvm.fmuladd.v8f64(<8 x double> %res.i103, <8 x double> <double 0xBC91A62630000000, double 0xBC91A62630000000, double 0xBC91A62630000000, double 0xBC91A62630000000, double 0xBC91A62630000000, double 0xBC91A62630000000, double 0xBC91A62630000000, double 0xBC91A62630000000>, <8 x double> %res.i98)
%res.i96 = fmul fast <8 x double> %res.i110, <double 0x4170000000000000, double 0x4170000000000000, double 0x4170000000000000, double 0x4170000000000000, double 0x4170000000000000, double 0x4170000000000000, double 0x4170000000000000, double 0x4170000000000000>
%res.i95 = fadd fast <8 x double> %res.i103, %res.i96
%res.i94 = call <8 x double> @llvm.fmuladd.v8f64(<8 x double> %res.i95, <8 x double> <double 0xBAE8A2E03707344A, double 0xBAE8A2E03707344A, double 0xBAE8A2E03707344A, double 0xBAE8A2E03707344A, double 0xBAE8A2E03707344A, double 0xBAE8A2E03707344A, double 0xBAE8A2E03707344A, double 0xBAE8A2E03707344A>, <8 x double> %res.i97)
%res.i93 = fmul fast <8 x double> %res.i94, %res.i94
%30 = fptosi <8 x double> %res.i103 to <8 x i64>
%res.i92 = and <8 x i64> %30, <i64 2, i64 2, i64 2, i64 2, i64 2, i64 2, i64 2, i64 2>
%res.i90 = icmp eq <8 x i64> %res.i92, zeroinitializer
%res.i89 = fsub fast <8 x double> <double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00>, %res.i94
%res.i88 = select <8 x i1> %res.i90, <8 x double> %res.i89, <8 x double> %res.i94
%res.i86 = fmul fast <8 x double> %res.i93, %res.i93
%res.i85 = call <8 x double> @llvm.fmuladd.v8f64(<8 x double> %res.i93, <8 x double> <double 0x3F8111111111110F, double 0x3F8111111111110F, double 0x3F8111111111110F, double 0x3F8111111111110F, double 0x3F8111111111110F, double 0x3F8111111111110F, double 0x3F8111111111110F, double 0x3F8111111111110F>, <8 x double> <double 0xBFC5555555555555, double 0xBFC5555555555555, double 0xBFC5555555555555, double 0xBFC5555555555555, double 0xBFC5555555555555, double 0xBFC5555555555555, double 0xBFC5555555555555, double 0xBFC5555555555555>)
%res.i84 = call <8 x double> @llvm.fmuladd.v8f64(<8 x double> %res.i93, <8 x double> <double 0x3EC71DE3A5568A50, double 0x3EC71DE3A5568A50, double 0x3EC71DE3A5568A50, double 0x3EC71DE3A5568A50, double 0x3EC71DE3A5568A50, double 0x3EC71DE3A5568A50, double 0x3EC71DE3A5568A50, double 0x3EC71DE3A5568A50>, <8 x double> <double 0xBF2A01A01A019FC7, double 0xBF2A01A01A019FC7, double 0xBF2A01A01A019FC7, double 0xBF2A01A01A019FC7, double 0xBF2A01A01A019FC7, double 0xBF2A01A01A019FC7, double 0xBF2A01A01A019FC7, double 0xBF2A01A01A019FC7>)
%res.i83 = call <8 x double> @llvm.fmuladd.v8f64(<8 x double> %res.i93, <8 x double> <double 0x3DE6124601C23966, double 0x3DE6124601C23966, double 0x3DE6124601C23966, double 0x3DE6124601C23966, double 0x3DE6124601C23966, double 0x3DE6124601C23966, double 0x3DE6124601C23966, double 0x3DE6124601C23966>, <8 x double> <double 0xBE5AE64567CB5786, double 0xBE5AE64567CB5786, double 0xBE5AE64567CB5786, double 0xBE5AE64567CB5786, double 0xBE5AE64567CB5786, double 0xBE5AE64567CB5786, double 0xBE5AE64567CB5786, double 0xBE5AE64567CB5786>)
%res.i82 = call <8 x double> @llvm.fmuladd.v8f64(<8 x double> %res.i93, <8 x double> <double 0x3CE94FA618796592, double 0x3CE94FA618796592, double 0x3CE94FA618796592, double 0x3CE94FA618796592, double 0x3CE94FA618796592, double 0x3CE94FA618796592, double 0x3CE94FA618796592, double 0x3CE94FA618796592>, <8 x double> <double 0xBD6AE7EA531357BF, double 0xBD6AE7EA531357BF, double 0xBD6AE7EA531357BF, double 0xBD6AE7EA531357BF, double 0xBD6AE7EA531357BF, double 0xBD6AE7EA531357BF, double 0xBD6AE7EA531357BF, double 0xBD6AE7EA531357BF>)
%res.i81 = fmul fast <8 x double> %res.i86, %res.i86
%res.i80 = call <8 x double> @llvm.fmuladd.v8f64(<8 x double> %res.i84, <8 x double> %res.i86, <8 x double> %res.i85)
%res.i79 = call <8 x double> @llvm.fmuladd.v8f64(<8 x double> %res.i82, <8 x double> %res.i86, <8 x double> %res.i83)
%res.i78 = fmul fast <8 x double> %res.i81, %res.i81
%res.i77 = call <8 x double> @llvm.fmuladd.v8f64(<8 x double> %res.i79, <8 x double> %res.i81, <8 x double> %res.i80)
%res.i76 = call <8 x double> @llvm.fmuladd.v8f64(<8 x double> %res.i78, <8 x double> <double 0xBC62622B22D526BE, double 0xBC62622B22D526BE, double 0xBC62622B22D526BE, double 0xBC62622B22D526BE, double 0xBC62622B22D526BE, double 0xBC62622B22D526BE, double 0xBC62622B22D526BE, double 0xBC62622B22D526BE>, <8 x double> %res.i77)
%res.i75 = fmul fast <8 x double> %res.i88, %res.i76
%res.i74 = call <8 x double> @llvm.fmuladd.v8f64(<8 x double> %res.i93, <8 x double> %res.i75, <8 x double> %res.i88)
%res.i73 = call <8 x double> @llvm.fabs.v8f64(<8 x double> %res.i)
%res.i71 = fcmp une <8 x double> %res.i73, <double 0x7FF0000000000000, double 0x7FF0000000000000, double 0x7FF0000000000000, double 0x7FF0000000000000, double 0x7FF0000000000000, double 0x7FF0000000000000, double 0x7FF0000000000000, double 0x7FF0000000000000>
%res.i65 = fcmp ogt <8 x double> %res.i73, <double 1.000000e+14, double 1.000000e+14, double 1.000000e+14, double 1.000000e+14, double 1.000000e+14, double 1.000000e+14, double 1.000000e+14, double 1.000000e+14>
%resb.i64113 = and <8 x i1> %res.i71, %res.i65
%res.i62 = select <8 x i1> %resb.i64113, <8 x double> zeroinitializer, <8 x double> %res.i74
%31 = getelementptr i8, i8* %27, i64 %28
%ptr.i60 = bitcast i8* %31 to <8 x double>*
store <8 x double> %res.i62, <8 x double>* %ptr.i60, align 8
%32 = add nuw i64 %value_phi3, 8
%33 = icmp eq i64 %value_phi2, %17
%34 = add nuw nsw i64 %value_phi2, 1
br i1 %33, label %L292.loopexit, label %L22
L292.loopexit: ; preds = %L22
%phitmp = shl i64 %32, 3
br label %L292
L292: ; preds = %L292.loopexit, %top
%value_phi6 = phi i64 [ 0, %top ], [ %phitmp, %L292.loopexit ]
%35 = icmp eq i64 %18, 0
br i1 %35, label %L561, label %L295
L295: ; preds = %L292
%36 = trunc i64 %18 to i8
%notmask = shl nsw i8 -1, %36
%37 = xor i8 %notmask, -1
%38 = inttoptr i64 %24 to i8*
%39 = getelementptr i8, i8* %38, i64 %value_phi6
%ptr.i57 = bitcast i8* %39 to <8 x double>*
%mask.i58 = bitcast i8 %37 to <8 x i1>
%res.i59 = call <8 x double> @llvm.masked.load.v8f64.p0v8f64(<8 x double>* %ptr.i57, i32 8, <8 x i1> %mask.i58, <8 x double> zeroinitializer)
%res.i56 = fmul fast <8 x double> %res.i59, <double 0x3E645F306DC9C883, double 0x3E645F306DC9C883, double 0x3E645F306DC9C883, double 0x3E645F306DC9C883, double 0x3E645F306DC9C883, double 0x3E645F306DC9C883, double 0x3E645F306DC9C883, double 0x3E645F306DC9C883>
%res.i55 = fadd fast <8 x double> %res.i56, <double 0xBE545F306DC9C883, double 0xBE545F306DC9C883, double 0xBE545F306DC9C883, double 0xBE545F306DC9C883, double 0xBE545F306DC9C883, double 0xBE545F306DC9C883, double 0xBE545F306DC9C883, double 0xBE545F306DC9C883>
%res.i54 = call <8 x double> @llvm.trunc.v8f64(<8 x double> %res.i55)
%res.i53 = fmul fast <8 x double> %res.i59, <double 0x3FD45F306DC9C883, double 0x3FD45F306DC9C883, double 0x3FD45F306DC9C883, double 0x3FD45F306DC9C883, double 0x3FD45F306DC9C883, double 0x3FD45F306DC9C883, double 0x3FD45F306DC9C883, double 0x3FD45F306DC9C883>
%res.i52 = fadd fast <8 x double> %res.i53, <double -5.000000e-01, double -5.000000e-01, double -5.000000e-01, double -5.000000e-01, double -5.000000e-01, double -5.000000e-01, double -5.000000e-01, double -5.000000e-01>
%res.i51.neg = fmul fast <8 x double> %res.i54, <double 0xC160000000000000, double 0xC160000000000000, double 0xC160000000000000, double 0xC160000000000000, double 0xC160000000000000, double 0xC160000000000000, double 0xC160000000000000, double 0xC160000000000000>
%res.i50 = fadd fast <8 x double> %res.i52, %res.i51.neg
%res.i49 = call <8 x double> @llvm.rint.v8f64(<8 x double> %res.i50)
%res.i48 = fmul fast <8 x double> %res.i49, <double 2.000000e+00, double 2.000000e+00, double 2.000000e+00, double 2.000000e+00, double 2.000000e+00, double 2.000000e+00, double 2.000000e+00, double 2.000000e+00>
%res.i47 = fadd fast <8 x double> %res.i48, <double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00>
%res.i46 = call <8 x double> @llvm.fmuladd.v8f64(<8 x double> %res.i54, <8 x double> <double 0xC17921FB50000000, double 0xC17921FB50000000, double 0xC17921FB50000000, double 0xC17921FB50000000, double 0xC17921FB50000000, double 0xC17921FB50000000, double 0xC17921FB50000000, double 0xC17921FB50000000>, <8 x double> %res.i59)
%res.i45 = call <8 x double> @llvm.fmuladd.v8f64(<8 x double> %res.i47, <8 x double> <double 0xBFF921FB50000000, double 0xBFF921FB50000000, double 0xBFF921FB50000000, double 0xBFF921FB50000000, double 0xBFF921FB50000000, double 0xBFF921FB50000000, double 0xBFF921FB50000000, double 0xBFF921FB50000000>, <8 x double> %res.i46)
%res.i44 = call <8 x double> @llvm.fmuladd.v8f64(<8 x double> %res.i54, <8 x double> <double 0xBFD110B460000000, double 0xBFD110B460000000, double 0xBFD110B460000000, double 0xBFD110B460000000, double 0xBFD110B460000000, double 0xBFD110B460000000, double 0xBFD110B460000000, double 0xBFD110B460000000>, <8 x double> %res.i45)
%res.i43 = call <8 x double> @llvm.fmuladd.v8f64(<8 x double> %res.i47, <8 x double> <double 0xBE5110B460000000, double 0xBE5110B460000000, double 0xBE5110B460000000, double 0xBE5110B460000000, double 0xBE5110B460000000, double 0xBE5110B460000000, double 0xBE5110B460000000, double 0xBE5110B460000000>, <8 x double> %res.i44)
%res.i42 = call <8 x double> @llvm.fmuladd.v8f64(<8 x double> %res.i54, <8 x double> <double 0xBE11A62630000000, double 0xBE11A62630000000, double 0xBE11A62630000000, double 0xBE11A62630000000, double 0xBE11A62630000000, double 0xBE11A62630000000, double 0xBE11A62630000000, double 0xBE11A62630000000>, <8 x double> %res.i43)
%res.i41 = call <8 x double> @llvm.fmuladd.v8f64(<8 x double> %res.i47, <8 x double> <double 0xBC91A62630000000, double 0xBC91A62630000000, double 0xBC91A62630000000, double 0xBC91A62630000000, double 0xBC91A62630000000, double 0xBC91A62630000000, double 0xBC91A62630000000, double 0xBC91A62630000000>, <8 x double> %res.i42)
%res.i40 = fmul fast <8 x double> %res.i54, <double 0x4170000000000000, double 0x4170000000000000, double 0x4170000000000000, double 0x4170000000000000, double 0x4170000000000000, double 0x4170000000000000, double 0x4170000000000000, double 0x4170000000000000>
%res.i39 = fadd fast <8 x double> %res.i47, %res.i40
%res.i38 = call <8 x double> @llvm.fmuladd.v8f64(<8 x double> %res.i39, <8 x double> <double 0xBAE8A2E03707344A, double 0xBAE8A2E03707344A, double 0xBAE8A2E03707344A, double 0xBAE8A2E03707344A, double 0xBAE8A2E03707344A, double 0xBAE8A2E03707344A, double 0xBAE8A2E03707344A, double 0xBAE8A2E03707344A>, <8 x double> %res.i41)
%res.i37 = fmul fast <8 x double> %res.i38, %res.i38
%40 = fptosi <8 x double> %res.i47 to <8 x i64>
%res.i36 = and <8 x i64> %40, <i64 2, i64 2, i64 2, i64 2, i64 2, i64 2, i64 2, i64 2>
%res.i34 = icmp eq <8 x i64> %res.i36, zeroinitializer
%res.i33 = fsub fast <8 x double> <double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00>, %res.i38
%res.i32 = select <8 x i1> %res.i34, <8 x double> %res.i33, <8 x double> %res.i38
%res.i30 = fmul fast <8 x double> %res.i37, %res.i37
%res.i29 = call <8 x double> @llvm.fmuladd.v8f64(<8 x double> %res.i37, <8 x double> <double 0x3F8111111111110F, double 0x3F8111111111110F, double 0x3F8111111111110F, double 0x3F8111111111110F, double 0x3F8111111111110F, double 0x3F8111111111110F, double 0x3F8111111111110F, double 0x3F8111111111110F>, <8 x double> <double 0xBFC5555555555555, double 0xBFC5555555555555, double 0xBFC5555555555555, double 0xBFC5555555555555, double 0xBFC5555555555555, double 0xBFC5555555555555, double 0xBFC5555555555555, double 0xBFC5555555555555>)
%res.i28 = call <8 x double> @llvm.fmuladd.v8f64(<8 x double> %res.i37, <8 x double> <double 0x3EC71DE3A5568A50, double 0x3EC71DE3A5568A50, double 0x3EC71DE3A5568A50, double 0x3EC71DE3A5568A50, double 0x3EC71DE3A5568A50, double 0x3EC71DE3A5568A50, double 0x3EC71DE3A5568A50, double 0x3EC71DE3A5568A50>, <8 x double> <double 0xBF2A01A01A019FC7, double 0xBF2A01A01A019FC7, double 0xBF2A01A01A019FC7, double 0xBF2A01A01A019FC7, double 0xBF2A01A01A019FC7, double 0xBF2A01A01A019FC7, double 0xBF2A01A01A019FC7, double 0xBF2A01A01A019FC7>)
%res.i27 = call <8 x double> @llvm.fmuladd.v8f64(<8 x double> %res.i37, <8 x double> <double 0x3DE6124601C23966, double 0x3DE6124601C23966, double 0x3DE6124601C23966, double 0x3DE6124601C23966, double 0x3DE6124601C23966, double 0x3DE6124601C23966, double 0x3DE6124601C23966, double 0x3DE6124601C23966>, <8 x double> <double 0xBE5AE64567CB5786, double 0xBE5AE64567CB5786, double 0xBE5AE64567CB5786, double 0xBE5AE64567CB5786, double 0xBE5AE64567CB5786, double 0xBE5AE64567CB5786, double 0xBE5AE64567CB5786, double 0xBE5AE64567CB5786>)
%res.i26 = call <8 x double> @llvm.fmuladd.v8f64(<8 x double> %res.i37, <8 x double> <double 0x3CE94FA618796592, double 0x3CE94FA618796592, double 0x3CE94FA618796592, double 0x3CE94FA618796592, double 0x3CE94FA618796592, double 0x3CE94FA618796592, double 0x3CE94FA618796592, double 0x3CE94FA618796592>, <8 x double> <double 0xBD6AE7EA531357BF, double 0xBD6AE7EA531357BF, double 0xBD6AE7EA531357BF, double 0xBD6AE7EA531357BF, double 0xBD6AE7EA531357BF, double 0xBD6AE7EA531357BF, double 0xBD6AE7EA531357BF, double 0xBD6AE7EA531357BF>)
%res.i25 = fmul fast <8 x double> %res.i30, %res.i30
%res.i24 = call <8 x double> @llvm.fmuladd.v8f64(<8 x double> %res.i28, <8 x double> %res.i30, <8 x double> %res.i29)
%res.i23 = call <8 x double> @llvm.fmuladd.v8f64(<8 x double> %res.i26, <8 x double> %res.i30, <8 x double> %res.i27)
%res.i22 = fmul fast <8 x double> %res.i25, %res.i25
%res.i21 = call <8 x double> @llvm.fmuladd.v8f64(<8 x double> %res.i23, <8 x double> %res.i25, <8 x double> %res.i24)
%res.i20 = call <8 x double> @llvm.fmuladd.v8f64(<8 x double> %res.i22, <8 x double> <double 0xBC62622B22D526BE, double 0xBC62622B22D526BE, double 0xBC62622B22D526BE, double 0xBC62622B22D526BE, double 0xBC62622B22D526BE, double 0xBC62622B22D526BE, double 0xBC62622B22D526BE, double 0xBC62622B22D526BE>, <8 x double> %res.i21)
%res.i19 = fmul fast <8 x double> %res.i32, %res.i20
%res.i18 = call <8 x double> @llvm.fmuladd.v8f64(<8 x double> %res.i37, <8 x double> %res.i19, <8 x double> %res.i32)
%res.i17 = call <8 x double> @llvm.fabs.v8f64(<8 x double> %res.i59)
%res.i15 = fcmp une <8 x double> %res.i17, <double 0x7FF0000000000000, double 0x7FF0000000000000, double 0x7FF0000000000000, double 0x7FF0000000000000, double 0x7FF0000000000000, double 0x7FF0000000000000, double 0x7FF0000000000000, double 0x7FF0000000000000>
%res.i10 = fcmp ogt <8 x double> %res.i17, <double 1.000000e+14, double 1.000000e+14, double 1.000000e+14, double 1.000000e+14, double 1.000000e+14, double 1.000000e+14, double 1.000000e+14, double 1.000000e+14>
%resb.i114 = and <8 x i1> %res.i15, %res.i10
%res.i9 = select <8 x i1> %resb.i114, <8 x double> zeroinitializer, <8 x double> %res.i18
%41 = inttoptr i64 %21 to i8*
%42 = getelementptr i8, i8* %41, i64 %value_phi6
%ptr.i8 = bitcast i8* %42 to <8 x double>*
call void @llvm.masked.store.v8f64.p0v8f64(<8 x double> %res.i9, <8 x double>* %ptr.i8, i32 8, <8 x i1> %mask.i58)
br label %L561
L561: ; preds = %L295, %L292
ret %jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 140112848148096 to %jl_value_t*) to %jl_value_t addrspace(10)*)
} Notice how many operations there are on The advantage of the compiler handling itThe advantage of going to the LLVM level is that optimial implementations for vectorized and unvectorized versions are different. Unvectorized versions will tend to have a lot of branches, to handle special cases well, and pick extremely accurate and fast approximations based on which part of the domain the arguments are in. But with vectorized functions, the entire vector has to follow each branch. Meaning if you have 8 numbers that want to go three different ways, you have to take them all all three ways (masking off the respective lanes that aren't supposed to go that way). That means in practice that slower approximations valid over larger domains are faster for the vectorized versions. Still, the vectorized versions do not handle special cases as well, which is why you need For that reason, you want the same software that is deciding whether or not to vectorize to be aware of this, so it can choose the optimal implementation. Returning to the julia> using MacroTools
julia> prettify(@macroexpand @vvectorize (for i in eachindex(a, b)
a[i] = cos(b[i])
end))
quote
horse = $(Expr(:gc_preserve_begin, :a, :b))
mallard = begin
coati = min(length(a), length(b))
(crocodile, flamingo) = (coati >>> 3, coati & 7)
cockroach = LoopVectorization.vectorizable(a)
ant = LoopVectorization.vectorizable(b)
alligator = 0
for dunlin = 1:crocodile
wallaby = LoopVectorization.vload(NTuple{8,VecElement{Float64}}, ant, alligator)
LoopVectorization.vstore!(cockroach, LoopVectorization.SLEEFPirates.cos_fast(wallaby), alligator)
alligator += 8
end
if flamingo > 0
dolphin = LoopVectorization.VectorizationBase.mask(Val{8}(), flamingo)
wallaby = LoopVectorization.vload(NTuple{8,VecElement{Float64}}, ant, alligator, dolphin)
LoopVectorization.vstore!(cockroach, LoopVectorization.SLEEFPirates.cos_fast(wallaby), alligator, dolphin)
alligator += 8
end
nothing
end
$(Expr(:gc_preserve_end, :horse))
mallard
nothing
end replacing Returning to SVMLFor the sake of it, lets run some benchmarks of clang using Compiling with:
Benchmarking them, we get: julia> const CLIBPATH = "/home/chriselrod/Documents/progwork/C"
"/home/chriselrod/Documents/progwork/C"
julia> # gcc -Ofast -march=skylake-avx512 -mprefer-vector-width=512 -fPIC -shared vectorized_special.c -o libgccspecial.so -lm
# clang -Ofast -march=skylake-avx512 -mprefer-vector-width=512 -fveclib=SVML -fPIC -shared vectorized_special.c -o libclangspecial.so -lsvml
function wrap(compiler, func)
lib = "lib" * string(compiler) * "special.so"
funcbase = Symbol(func, :_, compiler)
funcinplace = Symbol(funcbase, :!)
quote
function $funcinplace(a::AbstractVector{Float64}, b::AbstractVector{Float64})
ccall(
($(QuoteNode(Symbol(:v, func))), $(joinpath(CLIBPATH, lib))), Cvoid,
(Ref{Float64}, Ref{Float64}, Int),
a, b, length(a)
)
a
end
$funcbase(a::AbstractVector{Float64}) = $funcinplace(similar(a), a)
end
end
wrap (generic function with 1 method)
julia> for func ∈ (:exp, :log, :sin, :cos)
for compiler ∈ (:gcc, :clang)
eval(wrap(compiler, func))
end
end
julia> b = rand(2000); a = similar(b);
julia> all(exp_gcc(b) .≈ exp.(b))
true
julia> all(exp_clang(b) .≈ exp.(b))
true
julia> all(log_gcc(b) .≈ log.(b))
true
julia> all(log_clang(b) .≈ log.(b))
true
julia> all(sin_gcc(b) .≈ sin.(b))
true
julia> all(sin_clang(b) .≈ sin.(b))
true
julia> all(cos_gcc(b) .≈ cos.(b))
true
julia> all(cos_clang(b) .≈ cos.(b))
true
julia> using BenchmarkTools
julia> @btime @. $a = exp($b);
12.579 μs (0 allocations: 0 bytes)
julia> @btime exp_gcc!($a, $b);
1.197 μs (0 allocations: 0 bytes)
julia> @btime exp_clang!($a, $b);
1.306 μs (0 allocations: 0 bytes)
julia> @btime @. $a = log($b);
9.982 μs (0 allocations: 0 bytes)
julia> @btime log_gcc!($a, $b);
1.700 μs (0 allocations: 0 bytes)
julia> @btime log_clang!($a, $b);
1.952 μs (0 allocations: 0 bytes)
julia> @btime @. $a = sin($b);
10.779 μs (0 allocations: 0 bytes)
julia> @btime sin_gcc!($a, $b);
1.279 μs (0 allocations: 0 bytes)
julia> @btime sin_clang!($a, $b);
1.758 μs (0 allocations: 0 bytes)
julia> @btime @. $a = cos($b);
11.615 μs (0 allocations: 0 bytes)
julia> @btime cos_gcc!($a, $b);
1.499 μs (0 allocations: 0 bytes)
julia> @btime cos_clang!($a, $b);
1.822 μs (0 allocations: 0 bytes) |
@chriselrod Thank you for this post! It is clarified that SVML is not a library! We can have a C or Julia (?) code and compile it using SVML or GCC. The implementation in Julia is possible through LoopVectorization. Using that we can make any code vectorized and also we can make a vectorized library from a Julia source code. Also, GCC seems to be a bit faster than Clang+SVML! However, I think both should be implemented, and let the user decide. Regarding the license, I checked https://www.gnu.org/licenses/gcc-exception-3.1.en.html and https://softwareengineering.stackexchange.com/questions/327622/what-are-the-licensing-requirements-for-a-program-i-compile-using-gcc. It is stated in the exception that the software generated by GCC compiler is not covered under GPL license even if it includes libraries and headers. This means it is OK to use GCC libraries in any situation (including propriety). |
I found these two issues in Julia that are related: |
@chriselrod This is very interesting, thank you. Once I have a bit more time, I will read and understand better. I think in general this is way beyond the scope of this package, based on the other issues you have cited. This is a "simple" package to load some specialized functions, whereas integrating SVML requires serious messing with the compiler. |
We can transfer the issue to a more suitable repository. |
I think Intel has Opened Sourced SVML and injected the code into the latest versions of GCC. |
OMG 🚀 If they open source SVML, probably it will end up in Clang (LLVM) too, which Julia backend uses. So we can get SVML for free. I hope that rumor is true. |
I found where I saw it:
I think the best thing is to have ability to use SVML when broadcasting. |
BTW, SLEEFPirates now tries to find If it can find it, it'll use it for exp, log, sin, cos, and ^. (I'll get around to rerunning the exp LoopVectorization benchmark, but I expect it to come closer to gfortran and the Intel compilers now). Currently, it looks here: [ "/usr/lib64/", "/usr/lib", "/lib/x86_64-linux-gnu" ] On Clear Linux, it was located in "/usr/lib64/", while in Ubuntu it's in "/lib/x86_64-linux-gnu". > nm -D /lib/x86_64-linux-gnu/libmvec.so.1 | grep log
U __logf_finite
U __log_finite
0000000000001bb0 i _ZGVbN2v_log
0000000000001c80 i _ZGVbN4v_logf
0000000000001bd0 T _ZGVcN4v_log
0000000000001ca0 T _ZGVcN8v_logf
0000000000001c10 i _ZGVdN4v_log
0000000000001ce0 i _ZGVdN8v_logf
0000000000001d10 i _ZGVeN16v_logf
0000000000001c40 i _ZGVeN8v_log I don't know about other distribution families like Arch, Fedora, Void, etc. If you're on one of them, please find it and file an issue if it's located elsewhere so I can add its location to the path and make sure the build script finds it. Given that lots of computers already have extremely well optimized implementations just sitting there, it made sense to try and use them. Unfortunately, libmvec seems incomplete. No |
That is nice! Does Windows have that too? From what I see in this link https://github.com/bminor/glibc/blob/5cb226d7e4e710939cff7288bf9970cb52ec0dfa/sysdeps/x86_64/fpu/multiarch/svml_d_sincos8_core_avx512.S they have not actually open-sourced the compiler, it is just a set of functions that @chriselrod mentioned, and |
Not that I know of. And yes, it seems they only released the source of a few of their functions. |
By the way, it would be nice if one of your Vectorization packages, provides a macro that replaces functions with their vectorized version. Like We can provide such macros from IntelVectorMath.jl too, but maybe having all of them in one place would be better. @Crown421 what do you think? |
The major improvement these provide is that they're vectorized. If If they are a vector, calling |
@aminya , The point is different. SVML is a set of functions which their input is either scalars or SSE / AVX elements. Basically any compiler could use those functions if given the Code / Lib File. Intel VML is basically wrapping SVML functions in a loop for handling arrays (With taking care of the edge case where the array dimensions aren't an integer multiplication of the element used). @chriselrod , If Intel injected their code to GCC. Can't those specific functions be compiled in Windows or are they too deep integrated into GLIBC? |
@RoyiAvital Ubuntu WSL on my Windows laptop has The hardest part would probably be getting the table of constants right. If you do that, you could even call these functions directly in Julia via llvmcall + the Of course, always respect the licenses. |
@chriselrod , What you suggested is exactly what I meant. If the license allows, to extract the needed functions into independent The questions are:
|
The license is just the GPL.
Meaning you can do more or less whatever you want with it as long as you keep it GPLed (v2.1 or later). Just put it in a GPLed Julia library (instead of an MIT one). All the SVML code is written in assembly, so it'd be The hardest part will probably be figuring out constants like |
See the above regarding the license |
That just means that compiling with GCC has no impact on the license of the code we're compiling. The fact that the code we'd be compiling (GLIBC) is GPLed however does matter. Especially if we're talking about repackaging the GLPed GLIBC code. |
I guess so. If the library is used directly, that gets GPLed, but if let GCC compiler decides, it is an exception. 😄 |
If you're planning on just |
I think stevenjg is an expert on licenses as well as special function implementations. Not sure what the etiquette standard is on tagging someone who wasn't a part of the conversation. |
Part of license ( exception) states that
IMO, this suggests that GCC compiler should be part of the process. Otherwise the result it gets GPLed. |
What are you trying to suggest? |
I am just saying these functions should be compiled using GCC to be usable for Non-GPLed (e.g. proprietary) situations, otherwise their under GPL. |
Oh wait! The library is LGPL, so using |
Integrate Intel SVML
https://software.intel.com/en-us/cpp-compiler-developer-guide-and-reference-intrinsics-for-short-vector-math-library-svml-operations
The text was updated successfully, but these errors were encountered: