Benchmarks against NumPy

Here is a series of various D and Julia benchmarks timed against popular NumPy functions.

We test standard D functions as well as Mir numerical library across different tasks such as multiplication, dot product, sorting and one general neural network data preprocessing task. Additionally, we test standard D and Mir memory allocation, reallocation and garbage collection against NumPy.

*Added Julia benchmarks out of curiosity.

Each benchmark was run 20 times with 0 sec. timeout, the timings were then collected and averaged (NumPy, D). Julia code was benchmarked with @btime macro from BenchmarkTools package.

Versions

D compiler LDC 1.19.0, mir-algorithm 3.7.19, mir-random 2.2.11, mir-blas 1.1.10 (OpenBLAS)
(Anaconda) Python 3.7.6, NumPy 1.18.1 (MKL)
Julia 1.4.0 (OpenBLAS)

Hardware

CPU: Quad Core Intel Core i7-7700HQ (-MT MCP-) speed/min/max: 919/800/3800 MHz Kernel: 5.5.7-1-MANJARO x86_64
Mem: 2814.2/48147.6 MiB (5.8%) Storage: 489.05 GiB (6.6% used) Procs: 271 Shell: fish 3.1.0

Compile and Run

D

dub run --compiler=ldc2 --build=release --force

Julia

julia julia_bench.jl

NumPy

python3 other_benchmarks/basic_ops_bench.py

Cumulative Table Benchmarks (single-thread)

General Purpose (single-thread)

Description	NumPy (sec.)	Standard D (sec.)	Mir D (sec.)
Dot (scalar) product of two 300000 arrays (float64), (1000 loops)	0.10081	0.17074 (x1.7)	0.0892 (x1/1.1)
Element-wise sum of two 100x100 matrices (int), (1000 loops)	0.00387	0.00369 (x1/1.1)	0.00134 (x1/2.9)
Element-wise multiplication of two 100x100 matrices (float64), (1000 loops)	0.00419	0.03762 (x9)	0.00301 (x1/1.4)
L2 norm of 500x600 matrix (float64), (1000 loops)	0.06302	0.11729 (x1.9)	0.03903 (x1/1.6)
Matrix product of 500x600 and 600x500 matrices (float64)	0.00556	0.15769 (x28) *	0.00592 (x1.1)
Sort of 500x600 matrix (float64)	0.00963	0.01104 (x1.2)	0.01136 (x1.2)

Cumulative Table Memory Benchmarks (single-thread)

Description	Numpy (sec.)	Standard D (sec.)	Mir D (sec.)
Allocation, writing and deallocation of a [30000000] array	0.94646	0.96885 (x1)	0.92168 (x1)
Allocation, writing and deallocation of a several big arrays of different sizes	0.32987	0.31707 (x1)	0.91351 (x2.8)
Slicing [30000] array into another array (30000 loops)	0.39881	0.32689 (x1/1.2)	0.39911 (x1)

Cumulative Table Benchmarks (multi-thread)

General Purpose (multi-thread)

Description	NumPy (sec.)	Mir D (sec.)	Julia (sec.)
Dot (scalar) product of two 300000 arrays (float64), (1000 loops)	0.03528	0.03091 (x1/1.1)	0.03 (x1/1.1)
Element-wise sum of two 100x100 matrices (int), (1000 loops)	0.00379	0.00152 (x1/2.5)	0.0063 (x1.6)
Element-wise multiplication of two 100x100 matrices (float64), (1000 loops)	0.00419	0.00293 (x1/1.4)	0.00617 (x1.5)
L2 norm of 500x600 matrix (float64), (1000 loops)	0.02391	0.03982 (x1.7)	0.097 (x4.1)
Matrix product of 500x600 and 600x500 matrices (float64)	0.00186	0.00207 (x1.1)	0.01988 (x10.7)
Sort of 500x600 matrix (float64)	0.01033	0.0113 (x1.1)	0.0161 (x1.6)

Domain Specific

Description	Python + NumPy (sec.)	Standard D + Mir (sec.)
Neural network training data preprocessing (1.5 MB)	0.15563	0.04602 (x1/3.4)
Neural network training data preprocessing (16 MB)	1.86498	0.45454 (x1/4.1)

NumPy (single-thread)

In order to limit the number of threads, set the environment variable prior to running the benchmarks. For example, anaconda NumPy uses intel-mkl, therefore the number of threads is controlled with MKL_NUM_THREADS variable.

Check which backend is used:

In [1]: import numpy as np

In [2]: np.show_config()
blas_mkl_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/home/pavel/miniconda3/envs/torch/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/home/pavel/miniconda3/envs/torch/include']
blas_opt_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/home/pavel/miniconda3/envs/torch/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/home/pavel/miniconda3/envs/torch/include']
lapack_mkl_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/home/pavel/miniconda3/envs/torch/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/home/pavel/miniconda3/envs/torch/include']
lapack_opt_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/home/pavel/miniconda3/envs/torch/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/home/pavel/miniconda3/envs/torch/include']

Bash:

Your NumPy threads are controlled by one or several of the variables below:

export OPENBLAS_NUM_THREADS = 1
export MKL_NUM_THREADS = 1
export NUMEXPR_NUM_THREADS = 1
export VECLIB_MAXIMUM_THREADS = 1
export OMP_NUM_THREADS = 1

Fish: use set -x ENV_VAR val

Description	Time (sec.)
Dot (scalar) product of two 300000 arrays (float64), (1000 loops)	0.10080685440025264
Element-wise sum of two 100x100 matrices (int), (1000 loops)	0.0038669097997626523
Element-wise multiplication of two 100x100 matrices (float64), (1000 loops)	0.004186091849987861
Matrix product of 500x600 and 600x500 matrices (float64)	0.005560713250088156
L2 norm of 500x600 matrix (float64), (1000 loops)	0.06301813390018651
Sort of 500x600 matrix (float64)	0.009630701900277927

Standard D (single-thread)

Description	Time (sec.)
(Reference only) not optimized matrix product of two [500x600] struct matrices (double)	0.157694
Dot (scalar) product of two [300000] arrays (double), (1000 loops)	0.170738
Element-wise multiplication of two [100x100] arrays of arrays (double), (1000 loops)	0.0675184
Element-wise multiplication of two [100x100] struct matrices (double), (1000 loops)	0.037623
Element-wise sum of two [100x100] arrays of arrays (int), (1000 loops)	0.0728854
Element-wise sum of two [100x100] struct matrices (int), (1000 loops)	0.00368572
L2 norm of [500x600] struct matrix (double), (1000 loops)	0.117289
Sort of [500x600] struct matrix (double)	0.0110437

Mir D (single-thread)

Set environment variables.

Bash: export OPENBLAS_NUM_THREADS=1 Fish: set -x OPENBLAS_NUM_THREADS 1

Description	Time (sec.)
Dot (scalar) product of two [300000] slices (double), (1000 loops)	0.0892025
Dot (scalar) product of two [300000] slices (double), (OpenBLAS), (1000 loops)	0.0900235
Dot (scalar) product of two [300000] slices (double), (plain loop), (1000 loops)	0.0893657
Element-wise multiplication of two [100x100] slices (double), (1000 loops)	0.00301221
Element-wise sum of two [100x100] slices (int), (1000 loops)	0.00133979
L2 norm of [500x600] slice (double), (1000 loops)	0.0390259
Matrix product of two [500x600] and [600x500] slices (double), (OpenBLAS)	0.00591477
Sort of [500x600] slice (double)	0.011357

NumPy (multi-thread)

Bash:

Your NumPy threads are controlled by one or several of the variables below:

export OPENBLAS_NUM_THREADS = 4
export MKL_NUM_THREADS = 4
export NUMEXPR_NUM_THREADS = 4
export VECLIB_MAXIMUM_THREADS = 4
export OMP_NUM_THREADS = 4

Fish: use set -x ENV_VAR val

Description	Time (sec.)
Element-wise sum of two 100x100 matrices (int), (1000 loops)	0.0037877704002312385
Element-wise multiplication of two 100x100 matrices (float64), (1000 loops)	0.004193491550176986
Dot (scalar) product of two 300000 arrays (float64), (1000 loops)	0.03528142820068751
Matrix product of 500x600 and 600x500 matrices (float64)	0.0018566828504845035
L2 norm of 500x600 matrix (float64), (1000 loops)	0.023907507749936486
Sort of 500x600 matrix (float64)	0.010326230399914493

Standard D (multi-thread)

Not implemented for benchmarks.

See how to use multi-threading in standard D.

Mir D (multi-thread)

Set environment variables.

Bash: export OPENBLAS_NUM_THREADS=4 Fish: set -x OPENBLAS_NUM_THREADS 4

Description	Time (sec.)
Dot (scalar) product of two [300000] slices (double), (1000 loops)	0.0863238
Dot (scalar) product of two [300000] slices (double), (OpenBLAS), (1000 loops)	0.0309097
Dot (scalar) product of two [300000] slices (double), (plain loop), (1000 loops)	0.0860322
Element-wise multiplication of two [100x100] slices (double), (1000 loops)	0.00293436
Element-wise sum of two [100x100] slices (int), (1000 loops)	0.0015176
L2 norm of [500x600] slice (double), (1000 loops)	0.0398216
Matrix product of two [500x600] and [600x500] slices (double) (OpenBLAS)	0.00206505
Sort of [500x600] slice (double)	0.0112988

Julia (multi-thread)

Julia by default uses single thread according to the docs but htop was reporting all cores busy.

Set environment variables.

Bash: export JULIA_NUM_THREADS=4 Fish: set -x JULIA_NUM_THREADS 4

Test it with julia -e "println(Threads.nthreads()).

Description	Time (sec.)
Element-wise sum of two 100x100 matrices (int), (1000 loops)	0.0063
Element-wise multiplication of two 100x100 matrices (float64), (1000 loops)	0.00617
Dot (scalar) product of two 300000 arrays (float64), (1000 loops)	0.03
Matrix product of 500x600 and 600x500 matrices (float64)	0.01988
L2 norm of 500x600 matrix (float64), (1000 loops)	0.097
Sort of 500x600 matrix (float64)	0.0161

NumPy Memory (single-thread)

Description	Time (sec.)
Allocation, writing and deallocation of a [30000000] array	0.9464583551500254
Allocation, writing and deallocation of a several big arrays of different sizes	0.3298667574499632
Slicing [30000] array into another array (30000 loops)	0.3988089733500601

Standard D Memory (single-thread)

Description	Time (sec.)
Allocation, writing and deallocation of a [30000000] array	0.968854
Allocation, writing and deallocation of a several big arrays of different sizes	0.317072
Slicing [30000] array into another array (30000 loops)	0.326886

Mir D Memory (single-thread)

Description	Time (sec.)
Allocation, writing and deallocation of a [30000000] array	0.921682
Allocation, writing and deallocation of a several big arrays of different sizes	0.913509
Slicing [30000] array into another array (30000 loops)	0.399107

Not optimized Matrix Product

Standard D library does not have a function for matrix product therefore we are using plain loop implementation. Although looped function is pretty fast with small to medium sized matrices, it becomes prohibitively slow with bigger matrices (efficient matrix multiplication is a field on its own). NumPy uses heavily optimized BLAS general matrix multiplication gemm routine. Nothing really stops you from using the same via D CBLAS package directly in your code.

Not optimized matrixDotProduct function timings:

Matrix Sizes	Time (sec.)
2 x [100 x 100]	0.01
2 x [1000 x 1000]	2.21
2 x [1500 x 1000]	5.6
2 x [1500 x 1500]	9.28
2 x [2000 x 2000]	44.59
2 x [2100 x 2100]	55.13

Neural Network Data Preprocessing

General purpose function benchmarks are great but they tell us little about language efficiency in real world tasks. Therefore, we implemented a small neural network data preprocessing benchmark.

For this benchmark we used actual code used thousands of times for preprocessing training data for offline BiLSTM model training. The BiLSTM model is a word classifier, trained to do named entity recognition for just one specific class. In order to train the model, we need to read our training data, preprocess and convert it into multidimensional tensors. The multidimensional tensors represent input data and should be sliceable into shapes [batch_size x seq_length x feature_dim], e.g. [32 x 25 x 4].

Our original implementation was written in Python and NumPy for converting to tensors and slicing. The D version follows the same pattern and uses Mir numeric library for representing multidimensional arrays and do the slicing.

Two test datasets (1.5 MB and 16 MB) used in benchmarks contain private information and can not be provided. However, the original data looks like the following example table below.

document_id	masked_token	original_token	feature1	feature2	feature3	feature4	target_label
2314	ich	Ich	231.23	20.10	1	1	0
2314	bin	bin	235.1	20.10	0	0	0
2314	kartoffel	Kartoffel	240.5	20.10	1	0	0
2314	<dig>	2	244.2	20.10	0	0	0
2314	<prep>	für	250	20.10	0	0	0
2314	<dig>	3	255	20.10	0	0	0
2314	<punct>	!	240.5	20.10	0	0	0
2314	münchen	München	340.32	130.23	1	0	1
2314	<func>	ist	355.21	130.23	0	1	0
2314	grün	grün	364.78	130.23	0	0	0

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
other_benchmarks		other_benchmarks
source		source
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
dub.json		dub.json
dub.sdl		dub.sdl
dub.selections.json		dub.selections.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Benchmarks against NumPy

Versions

Hardware

Compile and Run

Cumulative Table Benchmarks (single-thread)

General Purpose (single-thread)

Cumulative Table Memory Benchmarks (single-thread)

Cumulative Table Benchmarks (multi-thread)

General Purpose (multi-thread)

Domain Specific

NumPy (single-thread)

Standard D (single-thread)

Mir D (single-thread)

NumPy (multi-thread)

Standard D (multi-thread)

Mir D (multi-thread)

Julia (multi-thread)

NumPy Memory (single-thread)

Standard D Memory (single-thread)

Mir D Memory (single-thread)

Not optimized Matrix Product

Neural Network Data Preprocessing

About

Releases

Packages

Contributors 2

Languages

License

tastyminerals/mir_benchmarks

Folders and files

Latest commit

History

Repository files navigation

Benchmarks against NumPy

Versions

Hardware

Compile and Run

Cumulative Table Benchmarks (single-thread)

General Purpose (single-thread)

Cumulative Table Memory Benchmarks (single-thread)

Cumulative Table Benchmarks (multi-thread)

General Purpose (multi-thread)

Domain Specific

NumPy (single-thread)

Standard D (single-thread)

Mir D (single-thread)

NumPy (multi-thread)

Standard D (multi-thread)

Mir D (multi-thread)

Julia (multi-thread)

NumPy Memory (single-thread)

Standard D Memory (single-thread)

Mir D Memory (single-thread)

Not optimized Matrix Product

Neural Network Data Preprocessing

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages