Guidelines for efficient faer dynamic library #108

guiburon · 2024-02-26T18:18:14Z

Hi!

I am really impressed by your colossal work on this math kernel!
I am writing a Julia wrapper to benchmark faer against OpenBLAS and MKL.

So far I have only studied the dense matrix-matrix multiplication. My preliminary results show faer approximately 50% slower than OpenBLAS and 25% slower than MKL on an AMD Ryzen 5 7640U on 8 threads.

This is basically my first Rust project and I want to be fair to faer: is this a reasonable dynamic library exposing faer inplace matrix multiplication using the C ABI?

I am not sure if opening an issue is the right way to ask but the faer documentation is very sparse at the moment on how to import external matrices.

use faer::modules::core::mul::matmul;
use faer::{mat, Parallelism};
use std::usize;

// inplace c = a * b
#[no_mangle]
pub unsafe extern "C" fn mult(
    c_ptr: *mut f64,
    c_nrows: u64,
    c_ncols: u64,
    c_row_stride: u64,
    c_col_stride: u64,
    a_ptr: *const f64,
    a_nrows: u64,
    a_ncols: u64,
    a_row_stride: u64,
    a_col_stride: u64,
    b_ptr: *const f64,
    b_nrows: u64,
    b_ncols: u64,
    b_row_stride: u64,
    b_col_stride: u64,
    nthreads: u32,
) {
    assert!(!c_ptr.is_null());
    assert!(!a_ptr.is_null());
    assert!(!b_ptr.is_null());

    let c = unsafe {
        mat::from_raw_parts_mut::<f64>(
            c_ptr,
            c_nrows as usize,
            c_ncols as usize,
            c_row_stride as isize,
            c_col_stride as isize,
        )
    };

    let a = unsafe {
        mat::from_raw_parts::<f64>(
            a_ptr,
            a_nrows as usize,
            a_ncols as usize,
            a_row_stride as isize,
            a_col_stride as isize,
        )
    };

    let b = unsafe {
        mat::from_raw_parts::<f64>(
            b_ptr,
            b_nrows as usize,
            b_ncols as usize,
            b_row_stride as isize,
            b_col_stride as isize,
        )
    };

    matmul(c, a, b, None, 1.0, Parallelism::Rayon(nthreads as usize));
}

The text was updated successfully, but these errors were encountered:

sarah-quinones · 2024-02-26T18:24:58Z

yeah, that looks reasonable enough to me. im surprised about the results though. could you share your benchmark setup?

guiburon · 2024-02-26T20:54:59Z

Thanks for your input!

I am multiplying dense f64 rectangular matrices of size (20,000x8,000) and (8,000x4,000). I am preallocating the result matrix before the benchmark. The benchmark macro @btime discards the first run (Julia is JIT) and then compute the average execution time and allocations over multiple runs.

Regarding the results I talked about, sorry I missed that OpenBLAS ignored my thread count target. I rerun the benchmark on all of my available hardware threads to circumvent that.
EDIT: And the MKL ran on 6 threads so nothing was comparable. Sorry, new laptop, something is odd with my current setup.

I will do more rigorous and thorough benchmarking later in the week. On Intel hardware as well.

Hardware and software

AMD Ryzen 5 7640U: 6 cores/12 threads
32Go DDR5 5600 MHz
Linux 6.7.6-arch1-1
Julia 1.10
rustc 1.76.0
faer 0.17.1

12 threads run

❯ julia -t 12 --project=. benchmark.jl
--- OpenBLAS ---
  3.862 s (0 allocations: 0 bytes)
--- MKL ---
  4.638 s (0 allocations: 0 bytes)
--- faer ---
  4.396 s (0 allocations: 0 bytes)

The MKL chooses to run on 6 threads according to htop while OpenBLAS and faer use hyperthreading and run on 12 threads.

benchmark.jl

nthreads = Base.Threads.nthreads()

ENV["OPENBLAS_NUM_THREADS"] = nthreads    # does not work
ENV["OMP_NUM_THREADS"] = nthreads    # useless?
ENV["MKL_NUM_THREADS"] = nthreads

using LinearAlgebra
using BenchmarkTools

include("wrapper.jl")
using .faer

ma = 20_000
na = 8_000

mb = na
nb = 4_000

a = rand(Float64, ma, na)
b = rand(Float64, mb, nb)

# --- OpenBLAS ---
println("--- OpenBLAS ---")
c = Matrix{Float64}(undef, ma, nb)
@btime mul!($c, $a, $b)

# --- MKL ---
using MKL
println("--- MKL ---")
c = Matrix{Float64}(undef, ma, nb)
@btime mul!($c, $a, $b)

# --- faer ---
println("--- faer ---")
c = Matrix{Float64}(undef, ma, nb)
@btime mult!($c, $a, $b; nthreads=$nthreads)

sarah-quinones · 2024-02-26T22:10:57Z

one thing that could make a difference is building faer with the nightly feature (and a nightly toolchain)
this enables avx512 instructions that are currently unstable, whereas openblas uses them by default

guiburon · 2024-02-27T00:25:10Z

I switched to my desktop (AMD Ryzen 9 7950X3D 16C/32T, 64Go DDR5-6000) because I think my laptop may thermal throttle and artificially lower MKL and faer results. I also set the random seed for repeatability.

nightly is worse right now unfortunately.

(20,000 x 8,000) * (8,000 x 4,000)

rustc 1.78.0-nightly

❯ julia -t 32 --project=. benchmark.jl
--- OpenBLAS ---
  1.166 s (0 allocations: 0 bytes)
--- MKL ---
  1.369 s (0 allocations: 0 bytes)
--- faer ---
  1.277 s (0 allocations: 0 bytes)

rustc 1.76.0

--- OpenBLAS ---
  1.181 s (0 allocations: 0 bytes)
--- MKL ---
  1.361 s (0 allocations: 0 bytes)
--- faer ---
  1.197 s (0 allocations: 0 bytes)

(40,000 x 16,000) * (16,000 x 8,000)

rustc 1.78.0-nightly

❯ julia -t 32 --project=. benchmark.jl
--- OpenBLAS ---
  9.221 s (0 allocations: 0 bytes)
--- MKL ---
  10.432 s (0 allocations: 0 bytes)
--- faer ---
  10.384 s (0 allocations: 0 bytes)

rustc 1.76.0

--- OpenBLAS ---
  9.278 s (0 allocations: 0 bytes)
--- MKL ---
  10.369 s (0 allocations: 0 bytes)
--- faer ---
  9.803 s (0 allocations: 0 bytes)

Are those results reasonable? They look good to me but I don't know the expected performance of faer.

I will do different benchmarks when I have more time. Are you interested? And if so where should I share them?

sarah-quinones · 2024-02-27T01:07:24Z

the results look pretty reasonable to me. it's hard to know exactly what is making faer slower without taking a closer look.
especially since im not able to bench on a wide variety of computers and the optimization settings can be tuned differently for each one.

guiburon · 2024-02-27T12:13:27Z

FYI I ran the same benchmark on Intel hardware. I fixed my thread count problem: everything effectively run on 8 threads here.

faer seems a lot less competitive on this hardware. nightly is still worse.

Hardware and software

Intel Xeon Gold 6136: 12 cores/12 threads (hyperthreading disabled)
64Go DDR4-2666
Linux WSL2 5.15.133.1
Julia 1.10
faer 0.17.1

(20,000 x 8,000) * (8,000 x 4,000)

rustc 1.78.0-nightly

❯ julia -t 8 --project benchmark.jl
--- OpenBLAS ---
  2.326 s (0 allocations: 0 bytes)
--- MKL ---
  2.082 s (0 allocations: 0 bytes)
--- faer ---
  5.636 s (0 allocations: 0 bytes)

rustc 1.76.0

--- OpenBLAS ---
  2.277 s (0 allocations: 0 bytes)
--- MKL ---
  2.032 s (0 allocations: 0 bytes)
--- faer ---
  4.662 s (0 allocations: 0 bytes)

sarah-quinones · 2024-02-27T12:59:58Z

what happens if you initialize the matrix instead of using undef? do you still get the same results?

sarah-quinones · 2024-02-27T13:58:05Z

i just got an idea! what happens if you benchmark faer without any of the other libraries running? i vaguely remember some issues with openmp's threadpool interfering with rayon's, which caused significant slowdowns on faer's side of things

i would be curious to see those as well as single threaded results if that's alright with you

guiburon · 2024-02-27T17:53:17Z

what happens if you initialize the matrix instead of using undef? do you still get the same results?

No change

i just got an idea! what happens if you benchmark faer without any of the other libraries running? i vaguely remember some issues with openmp's threadpool interfering with rayon's, which caused significant slowdowns on faer's side of things

No change

i would be curious to see those as well as single threaded results if that's alright with you

Large performance difference here! nightly does not change anything this time. I monitored the CPU usage to make sure it was all monothread.
I might run the same benchmark on my AMD hardware later.

Hardware and software

Intel Xeon Gold 6136: 12 cores/12 threads (hyperthreading disabled)
64Go DDR4-2666
Linux WSL2 5.15.133.1
Julia 1.10
faer 0.17.1

(20,000 x 8,000) * (8,000 x 4,000)

rustc 1.78.0-nightly
Parallelism::Rayon(1)

❯ julia -t 1 --project=. benchmark.jl
--- OpenBLAS ---
  15.627 s (0 allocations: 0 bytes)
--- MKL ---
  16.032 s (0 allocations: 0 bytes)
--- faer ---
  37.625 s (0 allocations: 0 bytes)

rustc 1.76.0
Parallelism::Rayon(1)

--- OpenBLAS ---
  15.977 s (0 allocations: 0 bytes)
--- MKL ---
  17.235 s (0 allocations: 0 bytes)
--- faer ---
  37.824 s (0 allocations: 0 bytes)

rustc 1.78.0-nightly
Parallelism::None

❯ julia -t 1 --project=. benchmark.jl
--- OpenBLAS ---
  16.049 s (0 allocations: 0 bytes)
--- MKL ---
  17.687 s (0 allocations: 0 bytes)
--- faer ---
  41.687 s (0 allocations: 0 bytes)

rustc 1.76.0
Parallelism::None
Not done.

sarah-quinones · 2024-02-27T18:52:27Z

yeah, no idea what's happening then. if you can share your full benchmark i can see if i can reproduce the results.

guiburon · 2024-02-28T15:09:50Z

yeah, no idea what's happening then. if you can share your full benchmark i can see if i can reproduce the results.

https://github.com/guiburon/faer-api

FYI something seems odd right now with BLAS.set_num_threads so I suggest monitoring the CPU usage to be sure OpenBLAS runs on the requested thread count. You might have to export OMP_NUM_THREADS before launching Julia if BLAS.set_num_threads does not work.

I don't know if you are familiar with Julia. Don't hesitate to ask if you want some pointers.

sarah-quinones · 2024-02-28T15:47:44Z

i tried the benchmark and im getting close results for all 3 libraries

--- faer ---
  5.089 s (0 allocations: 0 bytes)
--- OpenBLAS ---
  4.978 s (0 allocations: 0 bytes)
--- MKL ---
  4.887 s (0 allocations: 0 bytes)

one thing i noticed though, was that faer seems to be running slower in julia than rust for some reason? in rust the timings range from 4.2s to 4.8s on my machine (i5-11400 @ 2.60GHz with 12 threads)

guiburon · 2024-02-28T16:49:24Z

I ran the benchmark in monothread (Rayon(1)) on my Ryzen 5 7640U @ 4.9GHz and got close results for all 3 libs.

❯ julia -t 1 --project=. benchmark.jl
--- faer ---
  19.374 s (0 allocations: 0 bytes)
--- OpenBLAS ---
  18.557 s (0 allocations: 0 bytes)
--- MKL ---
  19.655 s (0 allocations: 0 bytes)

So the only hardware where faer is far behind (both mono and multithread) is on that Xeon Gold 6136? It does not seem to be due to Intel hardware judging by your i5 results. Maybe it's due to WSL but I can't easily run the benchmark directly on Windows. I will exclude that hardware from my benchmarks for now.

guiburon added the enhancement New feature or request label Feb 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Guidelines for efficient faer dynamic library #108

Guidelines for efficient faer dynamic library #108

guiburon commented Feb 26, 2024

sarah-quinones commented Feb 26, 2024

guiburon commented Feb 26, 2024 •

edited

Loading

sarah-quinones commented Feb 26, 2024

guiburon commented Feb 27, 2024

sarah-quinones commented Feb 27, 2024

guiburon commented Feb 27, 2024

sarah-quinones commented Feb 27, 2024

sarah-quinones commented Feb 27, 2024

guiburon commented Feb 27, 2024

sarah-quinones commented Feb 27, 2024

guiburon commented Feb 28, 2024

sarah-quinones commented Feb 28, 2024 •

edited

Loading

guiburon commented Feb 28, 2024

Guidelines for efficient faer dynamic library #108

Guidelines for efficient faer dynamic library #108

Comments

guiburon commented Feb 26, 2024

sarah-quinones commented Feb 26, 2024

guiburon commented Feb 26, 2024 • edited Loading

sarah-quinones commented Feb 26, 2024

guiburon commented Feb 27, 2024

sarah-quinones commented Feb 27, 2024

guiburon commented Feb 27, 2024

sarah-quinones commented Feb 27, 2024

sarah-quinones commented Feb 27, 2024

guiburon commented Feb 27, 2024

sarah-quinones commented Feb 27, 2024

guiburon commented Feb 28, 2024

sarah-quinones commented Feb 28, 2024 • edited Loading

guiburon commented Feb 28, 2024

guiburon commented Feb 26, 2024 •

edited

Loading

sarah-quinones commented Feb 28, 2024 •

edited

Loading