DOT(sparse, dense) performance #60

eric-haibin-lin · 2017-05-31T21:12:42Z

Opening a separate issue to track the performance.

Setup: p2.8xlarge, commit: be1e63b. compiled with

USE_OPENMP=1, DEV=0, DEBUG=0, USE_MKL2017=0, USE_MKL2017_EXPERIMENTAL=0, USE_BLAS=openblas

ubuntu@ip-172-31-33-77:~ $ python ./benchmark/python/sparse_op.py
A = sparse NDArray of shape(m, k)
B = dense NDArray of shape(k, n)

dot_forward     dot(csr, dns)
density(%)      context n       m       k       t_sparse        t_dense t_sparse/t_dense
5.0             cpu(0)  50      512     50000   0.107473        0.07950 1.35
2.0             cpu(0)  50      512     50000   0.037989        0.04614 0.82
1.0             cpu(0)  50      512     50000   0.019648        0.04613 0.43
0.5             cpu(0)  50      512     50000   0.010569        0.04618 0.23
0.1             cpu(0)  50      512     50000   0.002689        0.04633 0.06

5.0             cpu(0)  100     512     100000  0.706307        0.07776 9.08
2.0             cpu(0)  100     512     100000  0.175632        0.07767 2.26
1.0             cpu(0)  100     512     100000  0.081770        0.07748 1.06
0.5             cpu(0)  100     512     100000  0.044006        0.07770 0.57
0.1             cpu(0)  100     512     100000  0.010350        0.07731 0.13

dot_backward    dot(csr.T, dns)
density(%)      context n       m       k       t_sparse        t_dense t_sparse/t_dense
5.0             cpu(0)  50      512     50000   89.038583       0.05922 1503.65
2.0             cpu(0)  50      512     50000   74.109367       0.05911 1253.69
1.0             cpu(0)  50      512     50000   71.950517       0.05902 1219.13
0.5             cpu(0)  50      512     50000   55.482324       0.05920 937.19
0.1             cpu(0)  50      512     50000   24.582513       0.05941 413.74

The text was updated successfully, but these errors were encountered:

reminisce · 2017-06-01T16:06:15Z

Benchmark data after improving dot(csr.T(), dns)=dns and comparison with scipy.

Setup: p2.xlarge.
PR: #61
Compiled with:

USE_OPENMP=1, USE_CUDA=0, DEV=0, DEBUG=0, USE_MKL2017=0, USE_MKL2017_EXPERIMENTAL=0, USE_BLAS=openblas

device	exp	parallelization
cpu (4 threads)	dot(csr, dns)=dns	row blocks
cpu (4 threads)	dot(csr.T(), dns)=dns	row blocks

A = sparse NDArray of shape(m, k)
B = dense NDArray of shape(k, n)
dot_forward	dot(csr, dns)=dns
density(%)	context	n	m	k	t_sparse	t_dense	t_sparse/t_dense	t_scipy_sparse	t_scipy_dense	t_scipy_sparse/t_scipy_dense
5.0		cpu(0)	50	512	50000	0.002691	0.00788	0.34			0.048267	0.06281		0.77
2.0		cpu(0)	50	512	50000	0.001649	0.00793	0.21			0.023380	0.06344		0.37
1.0		cpu(0)	50	512	50000	0.000980	0.00771	0.13			0.013439	0.06336		0.21
0.5		cpu(0)	50	512	50000	0.001260	0.00767	0.16			0.007220	0.06292		0.11
0.1		cpu(0)	50	512	50000	0.000054	0.00770	0.01			0.001582	0.06770		0.02

5.0		cpu(0)	100	512	100000	0.009179	0.01946	0.47			0.261017	0.14550		1.79
2.0		cpu(0)	100	512	100000	0.004635	0.01940	0.24			0.116671	0.14529		0.80
1.0		cpu(0)	100	512	100000	0.002479	0.01955	0.13			0.047389	0.14522		0.33
0.5		cpu(0)	100	512	100000	0.001623	0.01961	0.08			0.025809	0.14595		0.18
0.1		cpu(0)	100	512	100000	0.000188	0.01947	0.01			0.005126	0.15187		0.03

dot_backward	dot(csr.T, dns)=dns
density(%)	context	n	m	k	t_sparse	t_dense	t_sparse/t_dense	t_scipy_sparse	t_scipy_dense	t_scipy_sparse/t_scipy_dense
5.0		cpu(0)	50	512	50000	0.029762	0.10533	0.28			0.047543	0.09001		0.53
2.0		cpu(0)	50	512	50000	0.017676	0.10564	0.17			0.021487	0.08963		0.24
1.0		cpu(0)	50	512	50000	0.013754	0.10349	0.13			0.013814	0.08873		0.16
0.5		cpu(0)	50	512	50000	0.011742	0.10394	0.11			0.007770	0.08869		0.09
0.1		cpu(0)	50	512	50000	0.004027	0.10331	0.04			0.002200	0.09080		0.02

5.0		cpu(0)	100	512	100000	0.102324	0.26337	0.39			0.220872	0.21498		1.03
2.0		cpu(0)	100	512	100000	0.046840	0.26338	0.18			0.116241	0.21422		0.54
1.0		cpu(0)	100	512	100000	0.028831	0.26262	0.11			0.060634	0.21256		0.29
0.5		cpu(0)	100	512	100000	0.022656	0.26276	0.09			0.029284	0.21312		0.14
0.1		cpu(0)	100	512	100000	0.014599	0.26468	0.06			0.012112	0.21223		0.06

eric-haibin-lin · 2017-06-02T04:50:37Z

@reminisce could you also document the setup so that it's reproduce-able next time(e.g hardware, commit id, etc)?

jiajiechen · 2017-06-07T22:46:47Z

Setup: c3.8xlarge. (32 threads)
Compiled with

USE_OPENMP=1, USE_CUDA=0, DEV=0, DEBUG=0, USE_MKL2017=0, USE_MKL2017_EXPERIMENTAL=0, USE_BLAS=openblas

Note: the dense dot operator is executed in single thread.

Synthetic Data(uniform distribution)

A = CSR NDArray of shape(m, k)
B = dense NDArray of shape(k, n)
dot_forward	dot(csr, dns)
density(%)	context	n	m	k	t_dense/t_sparse	t_dense	t_sparse	t_scipy_dense/t_scipy_sparse	t_scipy_dense	t_scipy_sparse
100.0		cpu(0)	64	512	50000	5.36			0.02	0.00390		0.15				0.106657	0.73276
90.0		cpu(0)	64	512	50000	4.41			0.02	0.00475		0.16				0.107326	0.66499
70.0		cpu(0)	64	512	50000	6.93			0.02	0.00301		0.20				0.106693	0.52502
50.0		cpu(0)	64	512	50000	8.34			0.02	0.00251		0.27				0.106659	0.38793
30.0		cpu(0)	64	512	50000	11.11			0.02	0.00189		0.43				0.108297	0.25304
20.0		cpu(0)	64	512	50000	10.68			0.02	0.00196		0.59				0.106615	0.17954
10.0		cpu(0)	64	512	50000	20.32			0.02	0.00102		1.12				0.106978	0.09549
7.0		cpu(0)	64	512	50000	22.92			0.02	0.00091		1.55				0.107462	0.06930
5.0		cpu(0)	64	512	50000	26.59			0.02	0.00078		2.18				0.106927	0.04912
2.0		cpu(0)	64	512	50000	69.43			0.02	0.00030		5.14				0.107367	0.02089
1.0		cpu(0)	64	512	50000	104.20			0.02	0.00020		10.58				0.107327	0.01014
0.5		cpu(0)	64	512	50000	154.40			0.02	0.00013		20.27				0.107358	0.00530
0.1		cpu(0)	64	512	50000	243.30			0.02	0.00009		91.78				0.107332	0.00117
100.0		cpu(0)	128	512	100000	5.47			0.08	0.01440		0.03				0.106269	3.15615
90.0		cpu(0)	128	512	100000	5.91			0.08	0.01340		0.03				0.102373	2.94345
70.0		cpu(0)	128	512	100000	6.20			0.08	0.01232		0.03				0.081981	2.64635
50.0		cpu(0)	128	512	100000	7.62			0.08	0.01036		0.05				0.109099	2.32388
30.0		cpu(0)	128	512	100000	10.65			0.08	0.00710		0.07				0.106034	1.59388
20.0		cpu(0)	128	512	100000	10.50			0.08	0.00732		0.05				0.075211	1.64093
10.0		cpu(0)	128	512	100000	14.99			0.08	0.00525		0.12				0.073087	0.61370
7.0		cpu(0)	128	512	100000	18.44			0.08	0.00427		0.17				0.110425	0.63676
5.0		cpu(0)	128	512	100000	17.47			0.08	0.00451		0.22				0.079013	0.36717
2.0		cpu(0)	128	512	100000	42.68			0.08	0.00185		0.50				0.081154	0.16210
1.0		cpu(0)	128	512	100000	85.00			0.08	0.00093		1.00				0.086644	0.08625
0.5		cpu(0)	128	512	100000	112.83			0.08	0.00068		2.23				0.124563	0.05598
0.1		cpu(0)	128	512	100000	325.71			0.08	0.00024		10.76				0.119254	0.01108

dot_backward	dot(csr.T, dns)
density(%)	context	n	m	k	t_dense/t_sparse	t_dense	t_sparse	t_scipy_dense/t_scipy_sparse	t_scipy_dense	t_scipy_sparse
100.0		cpu(0)	64	512	50000	3.22			0.22	0.06767		0.17				0.122950	0.73488
90.0		cpu(0)	64	512	50000	2.47			0.21	0.08632		0.18				0.121640	0.66491
70.0		cpu(0)	64	512	50000	3.04			0.21	0.06970		0.23				0.122812	0.53754
50.0		cpu(0)	64	512	50000	3.52			0.21	0.06018		0.31				0.123057	0.40275
30.0		cpu(0)	64	512	50000	6.13			0.22	0.03615		0.53				0.133585	0.25196
20.0		cpu(0)	64	512	50000	7.06			0.22	0.03125		0.75				0.133698	0.17817
10.0		cpu(0)	64	512	50000	11.12			0.22	0.01954		1.27				0.124703	0.09806
7.0		cpu(0)	64	512	50000	13.65			0.20	0.01484		1.78				0.122862	0.06911
5.0		cpu(0)	64	512	50000	17.58			0.21	0.01208		2.48				0.122755	0.04956
2.0		cpu(0)	64	512	50000	36.78			0.21	0.00576		6.09				0.122539	0.02013
1.0		cpu(0)	64	512	50000	50.39			0.21	0.00421		11.61				0.123022	0.01060
0.5		cpu(0)	64	512	50000	59.36			0.20	0.00341		22.12				0.122475	0.00554
0.1		cpu(0)	64	512	50000	79.61			0.21	0.00266		66.76				0.123492	0.00185
100.0		cpu(0)	128	512	100000	2.20			0.81	0.36856		0.04				0.133066	3.36961
90.0		cpu(0)	128	512	100000	2.46			0.80	0.32468		0.04				0.136473	3.21853
70.0		cpu(0)	128	512	100000	3.56			0.81	0.22623		0.04				0.133270	3.09333
50.0		cpu(0)	128	512	100000	4.47			0.77	0.17326		0.05				0.142082	2.71753
30.0		cpu(0)	128	512	100000	7.29			0.81	0.11053		0.07				0.140831	1.91718
20.0		cpu(0)	128	512	100000	10.75			0.81	0.07519		0.10				0.130642	1.32151
10.0		cpu(0)	128	512	100000	19.21			0.80	0.04152		0.19				0.126030	0.67883
7.0		cpu(0)	128	512	100000	31.04			0.81	0.02600		0.24				0.122869	0.50638
5.0		cpu(0)	128	512	100000	40.17			0.77	0.01912		0.37				0.138255	0.37648
2.0		cpu(0)	128	512	100000	68.28			0.80	0.01172		0.83				0.136068	0.16385
1.0		cpu(0)	128	512	100000	64.97			0.80	0.01230		1.46				0.129803	0.08876
0.5		cpu(0)	128	512	100000	129.39			0.81	0.00624		2.55				0.133773	0.05242
0.1		cpu(0)	128	512	100000	118.27			0.80	0.00675		7.96				0.127005	0.01595

avazu-app Data

density(%)	n	m	k	t_dense/t_sparse	t_dense	t_sparse
0.0015		64	500	1000000	4493.29			4.0891	0.000910
0.0015		128	500	1000000	6860.97			7.6305	0.001112

kdda data

density(%)	n	m	k		t_dense/t_sparse	t_dense	t_sparse
0.018663	64	200	20216830	31760.71		4.2127	0.000133

reminisce · 2017-06-12T17:46:56Z

Since we are doing more and more benchmarks, shall we start to use pandas DataFrame to store the benchmark results from now on? It makes the display of the results look better and provides convenient util functions to run statistical analysis.

eric-haibin-lin · 2017-07-07T23:29:15Z

GPU dot operator result:
apache#6937

dot(csr, row_sparse) operator result:
apache#6902

* update * back to color

eric-haibin-lin mentioned this issue May 31, 2017

Benchmark dot operator #56

Closed

reminisce mentioned this issue Jun 1, 2017

Improve dot #61

Merged

eric-haibin-lin mentioned this issue Jun 19, 2017

[DISCUSSION] Sparse Tensor Support Design apache/mxnet#4742

Closed

eric-haibin-lin closed this as completed Jan 16, 2018

eric-haibin-lin pushed a commit that referenced this issue Apr 4, 2018

[TUTORIAL] Onnx tutorial (#60)

acaa588

* update * back to color

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOT(sparse, dense) performance #60

DOT(sparse, dense) performance #60

eric-haibin-lin commented May 31, 2017

reminisce commented Jun 1, 2017 •

edited

Loading

eric-haibin-lin commented Jun 2, 2017

jiajiechen commented Jun 7, 2017 •

edited by eric-haibin-lin

Loading

reminisce commented Jun 12, 2017

eric-haibin-lin commented Jul 7, 2017

DOT(sparse, dense) performance #60

DOT(sparse, dense) performance #60

Comments

eric-haibin-lin commented May 31, 2017

reminisce commented Jun 1, 2017 • edited Loading

eric-haibin-lin commented Jun 2, 2017

jiajiechen commented Jun 7, 2017 • edited by eric-haibin-lin Loading

reminisce commented Jun 12, 2017

eric-haibin-lin commented Jul 7, 2017

reminisce commented Jun 1, 2017 •

edited

Loading

jiajiechen commented Jun 7, 2017 •

edited by eric-haibin-lin

Loading