Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOT(sparse, dense) performance #60

Closed
eric-haibin-lin opened this issue May 31, 2017 · 5 comments
Closed

DOT(sparse, dense) performance #60

eric-haibin-lin opened this issue May 31, 2017 · 5 comments

Comments

@eric-haibin-lin
Copy link
Owner

Opening a separate issue to track the performance.

Setup: p2.8xlarge, commit: be1e63b. compiled with

USE_OPENMP=1, DEV=0, DEBUG=0, USE_MKL2017=0, USE_MKL2017_EXPERIMENTAL=0, USE_BLAS=openblas
ubuntu@ip-172-31-33-77:~ $ python ./benchmark/python/sparse_op.py
A = sparse NDArray of shape(m, k)
B = dense NDArray of shape(k, n)

dot_forward     dot(csr, dns)
density(%)      context n       m       k       t_sparse        t_dense t_sparse/t_dense
5.0             cpu(0)  50      512     50000   0.107473        0.07950 1.35
2.0             cpu(0)  50      512     50000   0.037989        0.04614 0.82
1.0             cpu(0)  50      512     50000   0.019648        0.04613 0.43
0.5             cpu(0)  50      512     50000   0.010569        0.04618 0.23
0.1             cpu(0)  50      512     50000   0.002689        0.04633 0.06

5.0             cpu(0)  100     512     100000  0.706307        0.07776 9.08
2.0             cpu(0)  100     512     100000  0.175632        0.07767 2.26
1.0             cpu(0)  100     512     100000  0.081770        0.07748 1.06
0.5             cpu(0)  100     512     100000  0.044006        0.07770 0.57
0.1             cpu(0)  100     512     100000  0.010350        0.07731 0.13

dot_backward    dot(csr.T, dns)
density(%)      context n       m       k       t_sparse        t_dense t_sparse/t_dense
5.0             cpu(0)  50      512     50000   89.038583       0.05922 1503.65
2.0             cpu(0)  50      512     50000   74.109367       0.05911 1253.69
1.0             cpu(0)  50      512     50000   71.950517       0.05902 1219.13
0.5             cpu(0)  50      512     50000   55.482324       0.05920 937.19
0.1             cpu(0)  50      512     50000   24.582513       0.05941 413.74
@reminisce
Copy link
Collaborator

reminisce commented Jun 1, 2017

Benchmark data after improving dot(csr.T(), dns)=dns and comparison with scipy.

Setup: p2.xlarge.
PR: #61
Compiled with:

USE_OPENMP=1, USE_CUDA=0, DEV=0, DEBUG=0, USE_MKL2017=0, USE_MKL2017_EXPERIMENTAL=0, USE_BLAS=openblas
device exp parallelization
cpu (4 threads) dot(csr, dns)=dns row blocks
cpu (4 threads) dot(csr.T(), dns)=dns row blocks
A = sparse NDArray of shape(m, k)
B = dense NDArray of shape(k, n)
dot_forward	dot(csr, dns)=dns
density(%)	context	n	m	k	t_sparse	t_dense	t_sparse/t_dense	t_scipy_sparse	t_scipy_dense	t_scipy_sparse/t_scipy_dense
5.0		cpu(0)	50	512	50000	0.002691	0.00788	0.34			0.048267	0.06281		0.77
2.0		cpu(0)	50	512	50000	0.001649	0.00793	0.21			0.023380	0.06344		0.37
1.0		cpu(0)	50	512	50000	0.000980	0.00771	0.13			0.013439	0.06336		0.21
0.5		cpu(0)	50	512	50000	0.001260	0.00767	0.16			0.007220	0.06292		0.11
0.1		cpu(0)	50	512	50000	0.000054	0.00770	0.01			0.001582	0.06770		0.02

5.0		cpu(0)	100	512	100000	0.009179	0.01946	0.47			0.261017	0.14550		1.79
2.0		cpu(0)	100	512	100000	0.004635	0.01940	0.24			0.116671	0.14529		0.80
1.0		cpu(0)	100	512	100000	0.002479	0.01955	0.13			0.047389	0.14522		0.33
0.5		cpu(0)	100	512	100000	0.001623	0.01961	0.08			0.025809	0.14595		0.18
0.1		cpu(0)	100	512	100000	0.000188	0.01947	0.01			0.005126	0.15187		0.03

dot_backward	dot(csr.T, dns)=dns
density(%)	context	n	m	k	t_sparse	t_dense	t_sparse/t_dense	t_scipy_sparse	t_scipy_dense	t_scipy_sparse/t_scipy_dense
5.0		cpu(0)	50	512	50000	0.029762	0.10533	0.28			0.047543	0.09001		0.53
2.0		cpu(0)	50	512	50000	0.017676	0.10564	0.17			0.021487	0.08963		0.24
1.0		cpu(0)	50	512	50000	0.013754	0.10349	0.13			0.013814	0.08873		0.16
0.5		cpu(0)	50	512	50000	0.011742	0.10394	0.11			0.007770	0.08869		0.09
0.1		cpu(0)	50	512	50000	0.004027	0.10331	0.04			0.002200	0.09080		0.02

5.0		cpu(0)	100	512	100000	0.102324	0.26337	0.39			0.220872	0.21498		1.03
2.0		cpu(0)	100	512	100000	0.046840	0.26338	0.18			0.116241	0.21422		0.54
1.0		cpu(0)	100	512	100000	0.028831	0.26262	0.11			0.060634	0.21256		0.29
0.5		cpu(0)	100	512	100000	0.022656	0.26276	0.09			0.029284	0.21312		0.14
0.1		cpu(0)	100	512	100000	0.014599	0.26468	0.06			0.012112	0.21223		0.06

@eric-haibin-lin
Copy link
Owner Author

@reminisce could you also document the setup so that it's reproduce-able next time(e.g hardware, commit id, etc)?

@jiajiechen
Copy link

jiajiechen commented Jun 7, 2017

Setup: c3.8xlarge. (32 threads)
Compiled with

USE_OPENMP=1, USE_CUDA=0, DEV=0, DEBUG=0, USE_MKL2017=0, USE_MKL2017_EXPERIMENTAL=0, USE_BLAS=openblas

Note: the dense dot operator is executed in single thread.

Synthetic Data(uniform distribution)

A = CSR NDArray of shape(m, k)
B = dense NDArray of shape(k, n)
dot_forward	dot(csr, dns)
density(%)	context	n	m	k	t_dense/t_sparse	t_dense	t_sparse	t_scipy_dense/t_scipy_sparse	t_scipy_dense	t_scipy_sparse
100.0		cpu(0)	64	512	50000	5.36			0.02	0.00390		0.15				0.106657	0.73276
90.0		cpu(0)	64	512	50000	4.41			0.02	0.00475		0.16				0.107326	0.66499
70.0		cpu(0)	64	512	50000	6.93			0.02	0.00301		0.20				0.106693	0.52502
50.0		cpu(0)	64	512	50000	8.34			0.02	0.00251		0.27				0.106659	0.38793
30.0		cpu(0)	64	512	50000	11.11			0.02	0.00189		0.43				0.108297	0.25304
20.0		cpu(0)	64	512	50000	10.68			0.02	0.00196		0.59				0.106615	0.17954
10.0		cpu(0)	64	512	50000	20.32			0.02	0.00102		1.12				0.106978	0.09549
7.0		cpu(0)	64	512	50000	22.92			0.02	0.00091		1.55				0.107462	0.06930
5.0		cpu(0)	64	512	50000	26.59			0.02	0.00078		2.18				0.106927	0.04912
2.0		cpu(0)	64	512	50000	69.43			0.02	0.00030		5.14				0.107367	0.02089
1.0		cpu(0)	64	512	50000	104.20			0.02	0.00020		10.58				0.107327	0.01014
0.5		cpu(0)	64	512	50000	154.40			0.02	0.00013		20.27				0.107358	0.00530
0.1		cpu(0)	64	512	50000	243.30			0.02	0.00009		91.78				0.107332	0.00117
100.0		cpu(0)	128	512	100000	5.47			0.08	0.01440		0.03				0.106269	3.15615
90.0		cpu(0)	128	512	100000	5.91			0.08	0.01340		0.03				0.102373	2.94345
70.0		cpu(0)	128	512	100000	6.20			0.08	0.01232		0.03				0.081981	2.64635
50.0		cpu(0)	128	512	100000	7.62			0.08	0.01036		0.05				0.109099	2.32388
30.0		cpu(0)	128	512	100000	10.65			0.08	0.00710		0.07				0.106034	1.59388
20.0		cpu(0)	128	512	100000	10.50			0.08	0.00732		0.05				0.075211	1.64093
10.0		cpu(0)	128	512	100000	14.99			0.08	0.00525		0.12				0.073087	0.61370
7.0		cpu(0)	128	512	100000	18.44			0.08	0.00427		0.17				0.110425	0.63676
5.0		cpu(0)	128	512	100000	17.47			0.08	0.00451		0.22				0.079013	0.36717
2.0		cpu(0)	128	512	100000	42.68			0.08	0.00185		0.50				0.081154	0.16210
1.0		cpu(0)	128	512	100000	85.00			0.08	0.00093		1.00				0.086644	0.08625
0.5		cpu(0)	128	512	100000	112.83			0.08	0.00068		2.23				0.124563	0.05598
0.1		cpu(0)	128	512	100000	325.71			0.08	0.00024		10.76				0.119254	0.01108

dot_backward	dot(csr.T, dns)
density(%)	context	n	m	k	t_dense/t_sparse	t_dense	t_sparse	t_scipy_dense/t_scipy_sparse	t_scipy_dense	t_scipy_sparse
100.0		cpu(0)	64	512	50000	3.22			0.22	0.06767		0.17				0.122950	0.73488
90.0		cpu(0)	64	512	50000	2.47			0.21	0.08632		0.18				0.121640	0.66491
70.0		cpu(0)	64	512	50000	3.04			0.21	0.06970		0.23				0.122812	0.53754
50.0		cpu(0)	64	512	50000	3.52			0.21	0.06018		0.31				0.123057	0.40275
30.0		cpu(0)	64	512	50000	6.13			0.22	0.03615		0.53				0.133585	0.25196
20.0		cpu(0)	64	512	50000	7.06			0.22	0.03125		0.75				0.133698	0.17817
10.0		cpu(0)	64	512	50000	11.12			0.22	0.01954		1.27				0.124703	0.09806
7.0		cpu(0)	64	512	50000	13.65			0.20	0.01484		1.78				0.122862	0.06911
5.0		cpu(0)	64	512	50000	17.58			0.21	0.01208		2.48				0.122755	0.04956
2.0		cpu(0)	64	512	50000	36.78			0.21	0.00576		6.09				0.122539	0.02013
1.0		cpu(0)	64	512	50000	50.39			0.21	0.00421		11.61				0.123022	0.01060
0.5		cpu(0)	64	512	50000	59.36			0.20	0.00341		22.12				0.122475	0.00554
0.1		cpu(0)	64	512	50000	79.61			0.21	0.00266		66.76				0.123492	0.00185
100.0		cpu(0)	128	512	100000	2.20			0.81	0.36856		0.04				0.133066	3.36961
90.0		cpu(0)	128	512	100000	2.46			0.80	0.32468		0.04				0.136473	3.21853
70.0		cpu(0)	128	512	100000	3.56			0.81	0.22623		0.04				0.133270	3.09333
50.0		cpu(0)	128	512	100000	4.47			0.77	0.17326		0.05				0.142082	2.71753
30.0		cpu(0)	128	512	100000	7.29			0.81	0.11053		0.07				0.140831	1.91718
20.0		cpu(0)	128	512	100000	10.75			0.81	0.07519		0.10				0.130642	1.32151
10.0		cpu(0)	128	512	100000	19.21			0.80	0.04152		0.19				0.126030	0.67883
7.0		cpu(0)	128	512	100000	31.04			0.81	0.02600		0.24				0.122869	0.50638
5.0		cpu(0)	128	512	100000	40.17			0.77	0.01912		0.37				0.138255	0.37648
2.0		cpu(0)	128	512	100000	68.28			0.80	0.01172		0.83				0.136068	0.16385
1.0		cpu(0)	128	512	100000	64.97			0.80	0.01230		1.46				0.129803	0.08876
0.5		cpu(0)	128	512	100000	129.39			0.81	0.00624		2.55				0.133773	0.05242
0.1		cpu(0)	128	512	100000	118.27			0.80	0.00675		7.96				0.127005	0.01595

avazu-app Data

density(%)	n	m	k	t_dense/t_sparse	t_dense	t_sparse
0.0015		64	500	1000000	4493.29			4.0891	0.000910
0.0015		128	500	1000000	6860.97			7.6305	0.001112

kdda data

density(%)	n	m	k		t_dense/t_sparse	t_dense	t_sparse
0.018663	64	200	20216830	31760.71		4.2127	0.000133

@reminisce
Copy link
Collaborator

Since we are doing more and more benchmarks, shall we start to use pandas DataFrame to store the benchmark results from now on? It makes the display of the results look better and provides convenient util functions to run statistical analysis.

@eric-haibin-lin
Copy link
Owner Author

GPU dot operator result:
apache#6937

dot(csr, row_sparse) operator result:
apache#6902

eric-haibin-lin pushed a commit that referenced this issue Apr 4, 2018
* update

* back to color
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants