-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[MXNET-263] Support for dot(dns, csr) = dns and dot(dns, csr.T) = dns on GPU #10371
Conversation
dot(dns, csr) output a csr ndarray when cpu context is used, and dns ndarray when gpu context is used ? This should be at least well documented somewhere. |
@anirudh2290 Corresponding documentations will be added once the implementations and tests are complete. |
src/operator/tensor/dot-inl.h
Outdated
@@ -235,13 +235,21 @@ inline bool DotForwardInferStorageType(const nnvm::NodeAttrs& attrs, | |||
DispatchMode::kFComputeEx); | |||
} | |||
if (!dispatched && lhs_stype == kDefaultStorage && rhs_stype == kCSRStorage && | |||
!param.transpose_a && !param.transpose_b) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we will add a stype hint argument which defaults to None?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Working on that now.
src/operator/tensor/dot.cu
Outdated
|
||
namespace mxnet { | ||
namespace op { | ||
|
||
template<typename gpu> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like for dot, gpu kernels are in dot-inl.cuh
. Let's move the implementation here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
src/operator/tensor/dot.cu
Outdated
MSHADOW_SGL_DBL_TYPE_SWITCH(csr_data.type_flag_, DType, { // data type | ||
MSHADOW_IDX_TYPE_SWITCH(csr_indices.type_flag_, IType, { // indptr type | ||
MSHADOW_IDX_TYPE_SWITCH(csr_indptr.type_flag_, CType, { // colidx type | ||
/* Allocate workspace */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
src/operator/tensor/dot-inl.cuh
Outdated
for (CType col_id = rhs_indptr[k]; col_id < rhs_indptr[k + 1]; ++col_id) { | ||
sum += lhs_data[i * lhs_num_cols + rhs_indices[col_id]] * rhs_data[col_id]; | ||
} | ||
out[i*out_num_cols+k] = sum; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use tid directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, done.
f168837
to
387c66b
Compare
Please fix test
|
src/operator/tensor/dot.cc
Outdated
- dot(csr, default) = default | ||
- dot(csr.T, default) = row_sparse | ||
- dot(csr, row_sparse) = default | ||
- dot(default, csr) = csr | ||
- dot(default, csr) = csr on CPU only | ||
- dot(default, csr) = dense on GPU only |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you show the output storage with specific values of forward_stype_hint
? e.g.
dot(csr, dense, trx_a=True) = row_sparse
dot(csr, dense, forward_stype_hint='default', tx_a=True) = default
dot(default, csr, forward_stype_hint='default') = default (GPU only)
What happens if someone uses dot(dense, dense, forward_stype_hint='csr')?
src/operator/tensor/dot-inl.h
Outdated
DMLC_DECLARE_PARAMETER(DotParam) { | ||
DMLC_DECLARE_FIELD(transpose_a) | ||
.describe("If true then transpose the first input before dot.") | ||
.set_default(false); | ||
DMLC_DECLARE_FIELD(transpose_b) | ||
.describe("If true then transpose the second input before dot.") | ||
.set_default(false); | ||
DMLC_DECLARE_FIELD(forward_stype_hint) | ||
.describe("Desired storage type of the forward output.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to be more detailed than this. Explicitly state that if no such combination is implemented, dense op is used.
src/operator/tensor/dot-inl.h
Outdated
if (!dispatched && lhs_stype == kDefaultStorage && | ||
rhs_stype == kDefaultStorage) { | ||
// dns, dns -> dns | ||
dispatched = storage_type_assign(&out_stype, kDefaultStorage, dispatch_mode, | ||
DispatchMode::kFCompute); | ||
target_stype = (param.forward_stype_hint.has_value())? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: This line is quite long. Maybe cache the value of param.forward_stype_hint.has_value()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
src/operator/tensor/dot-inl.cuh
Outdated
DType* csc_data, | ||
unsigned long long* csc_indices, | ||
unsigned long long* csc_indptr, | ||
unsigned long long* workspace, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename "workspace" to "col_counters"? Better document what this space is used for:
e.g. used to count the offset of column indices atomically
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
src/operator/tensor/dot-inl.cuh
Outdated
if (tid < num_rows) { | ||
for (CType i = csr_indptr[tid]; i < csr_indptr[tid + 1]; ++i) { | ||
// target column | ||
IType target_col = csr_indices[i]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: use const wherever applies
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
src/operator/tensor/dot-inl.cuh
Outdated
|
||
// if dot(dense, csr) = dns, transform to csc first | ||
if (!transpose_b) { | ||
// LOG(FATAL) << "dot(dns, csr) = dns not implemented yet"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove unused code please
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
src/operator/tensor/dot-inl.cuh
Outdated
MSHADOW_IDX_TYPE_SWITCH(csr_indices.type_flag_, IType, { | ||
MSHADOW_IDX_TYPE_SWITCH(csr_indptr.type_flag_, CType, { | ||
DType* csc_data_ptr = NULL; | ||
unsigned long long* csc_indices_ptr = NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe typedef to "AtomicIType"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, done
src/operator/tensor/dot-inl.cuh
Outdated
unsigned long long* csc_indices_ptr = NULL; | ||
unsigned long long* csc_indptr_ptr = NULL; | ||
unsigned long long* col_counters = NULL; | ||
size_t ull_mem_size = sizeof(unsigned long long); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does ull mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh nvm unsigned long long.. Not obvious though
src/operator/tensor/dot-inl.cuh
Outdated
temp_storage_bytes += (ull_mem_size - (temp_storage_bytes % ull_mem_size)); | ||
Tensor<gpu, 1, char> workspace = | ||
ctx.requested[0].get_space_typed<gpu, 1, char>( | ||
Shape1(nnz*sizeof(DType) + nnz*ull_mem_size + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we calculate nnz*ull_mem_size
once and assign it to a meaningful local variable? That improves readability
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
src/operator/tensor/dot-inl.cuh
Outdated
Stream<gpu>::GetStream(s)); | ||
// Reset values for col_counter, ready for the final transform | ||
mxnet_op::Kernel<mxnet_op::set_zero, gpu>::Launch( | ||
s, csr_cols+1, col_counters); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we rename "csr_cols" to "num_csr_cols"? It's not obvious from its name about whether it's a number, or a pointer to data, unless reading backward in code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
src/operator/tensor/dot-inl.cuh
Outdated
MSHADOW_SGL_DBL_TYPE_SWITCH(csr_data.type_flag_, DType, { // data type | ||
MSHADOW_IDX_TYPE_SWITCH(csr_indices.type_flag_, IType, { // indptr type | ||
MSHADOW_IDX_TYPE_SWITCH(csr_indptr.type_flag_, CType, { // colidx type | ||
CType out_num_rows = ret->shape()[0]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like these two lines are duplicates of line 1044 and can be moved out side of if-else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, done
52ebdbc
to
cb2671b
Compare
src/operator/tensor/dot-inl.cuh
Outdated
for (CType i = csr_indptr[tid]; i < csr_indptr[tid + 1]; ++i) { | ||
// target column | ||
const IType target_col = csr_indices[i]; | ||
const int target_offset = atomicAdd(&col_counters[target_col], 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this provides deterministic result.. The order of accumulation could be different across runs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this only affects the order of data within the column, when we are doing the final multiplication-accumulation the result should still be the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well...maybe u r right about this, but I guess then we'll have to add a sort after this to ensure the order.
test_infer_forward_stype(lhs_shape, (lhs_shape[0], rnd.randint(10, 20)), | ||
lhs_d, rhs_d, True, False) | ||
test_infer_forward_stype(lhs_shape, (rnd.randint(10, 20), lhs_shape[0]), | ||
lhs_d, rhs_d, True, True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should think of a test to check if the result is determinist
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
a46bb74
to
66b197d
Compare
src/operator/tensor/dot-inl.cuh
Outdated
|
||
/*! | ||
* \brief GPU Kernel of generation of transposed csr matrix | ||
* \param tid global thread id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove line 472, 473
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
src/operator/tensor/dot.cc
Outdated
- dot(default, csr) = csr | ||
- otherwise, ``dot`` generates output with default storage | ||
- dot(default, csr) = csr on CPU only | ||
- dot(default, csr) = dense on GPU only |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it better to write "dot(default, csr, forward_stype_hint='default')"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also dot(default, csr, transpose_b=True, forward_stype_hint='default') ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also just choose one of "default" or "dense".. Mixing them in the same line will be confusing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
rhs = rhs_nd.tostype(rhs_stype) | ||
out = mx.nd.dot(lhs, rhs, forward_stype_hint=forward_stype, | ||
transpose_a=trans_a, transpose_b=trans_b) | ||
assert_almost_equal(out.tostype('default').asnumpy(), out_np, rtol=1e-4, atol=1e-5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
out.asnumpy() should be fine
forward_stype_hint=forward_stype, | ||
transpose_a=trans_a, transpose_b=trans_b) | ||
location = {'lhs': lhs, 'rhs': rhs} | ||
check_symbolic_forward(out, location, [out_np], rtol=1e-3, atol=1e-4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also check_symbolic_backward?
@with_seed() | ||
def test_sparse_dot_determinism(): | ||
def test_dot_determinism(lhs_stype, rhs_stype, lhs_density, rhs_density, transpose_a, transpose_b): | ||
lhs_row = rnd.randint(200, 400) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
an int between 200 and 400 seems too large for unit test. Maybe just choose between 50 and 100?
src/operator/tensor/dot.cu
Outdated
@@ -23,6 +23,7 @@ | |||
*/ | |||
|
|||
#include "./dot-inl.h" | |||
#include <cub/cub.cuh> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed here?
src/operator/tensor/dot-inl.cuh
Outdated
MSHADOW_SGL_DBL_TYPE_SWITCH(csr_data.type_flag_, DType, { // data type | ||
MSHADOW_IDX_TYPE_SWITCH(csr_indices.type_flag_, IType, { // indptr type | ||
MSHADOW_IDX_TYPE_SWITCH(csr_indptr.type_flag_, CType, { // colidx type | ||
const CType out_num_rows = ret->shape()[0]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shape is always dim_t
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
src/operator/tensor/dot-inl.cuh
Outdated
|
||
int num_bits = 1; | ||
unsigned int a = num_csr_cols - 1; | ||
while (a >>= 1) num_bits++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we reuse the ilog2 function ? the variable "a" is quite unreadable
c3e9ba8
to
4f025d6
Compare
9f77910
to
4a4ac67
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also paste the benchmark script & result later? Thx
forward_stype=forward_stype, | ||
transpose_a=trans_a, transpose_b=trans_b) | ||
location = {'lhs': lhs, 'rhs': rhs} | ||
check_symbolic_forward(out, location, [out_np], rtol=1e-3, atol=1e-4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we test check_symbolic_backward for dot(dns, csr)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
dispatched = dispatch_fallback(out_attrs, dispatch_mode); | ||
target_stype = (target_stype == kUndefinedStorage)? kDefaultStorage : target_stype; | ||
dispatched = storage_type_assign(&out_stype, target_stype, dispatch_mode, | ||
DispatchMode::kFComputeFallback); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should also update InferStorageType for backward dot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
src/operator/tensor/dot-inl.cuh
Outdated
* \brief GPU Impl of dot(dns, csr) = csr | ||
*/ | ||
template<typename gpu> | ||
inline void DotDnsCsrCsrImpl(const OpContext& ctx, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
template<>
inline void DotDnsCsrCsrImpl<gpu>(const OpContext& ctx,...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
ccb9f5a
to
880ae43
Compare
src/operator/tensor/dot.cc
Outdated
- dot(default, csr) = csr | ||
- otherwise, ``dot`` generates output with default storage | ||
- dot(default, csr) = csr on CPU only | ||
- dot(default, csr) = default on GPU only |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls update doc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
src/operator/tensor/dot-inl.cuh
Outdated
size_t csc_indptr_bytes = (num_csr_cols+1)*sizeof(CType); | ||
size_t csc_data_bytes = nnz*sizeof(DType); | ||
size_t scan_temp_storage_bytes = 0; | ||
size_t temp_storage_bytes = SortByKeyWorkspaceSize<dim_t, dim_t, gpu>(nnz); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: it should be SortByKeyWorkspaceSize<IType, IType, gpu>(nnz);
? We might add 32bit int for idx dtype in the future to speedup / reduce memory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
880ae43
to
4449f46
Compare
4449f46
to
5ab47fd
Compare
Benchmark results: [density of rhs csr(%)] [speedup rate]
|
benchmark script: import mxnet as mx
import numpy as np
from mxnet.test_utils import assert_almost_equal
import time
def measure_cost(repeat, f, *args, **kwargs):
# start bench
start = time.time()
results = []
for i in range(repeat):
results.append(f(*args, **kwargs))
for result in results:
result.wait_to_read()
end = time.time()
diff = end - start
return diff / repeat
def measure_fallback(repeat, a):
# start bench
start = time.time()
results = []
for i in range(repeat):
results.append(a.tostype('default'))
for result in results:
result.wait_to_read()
end = time.time()
diff = end - start
return diff / repeat
def main():
shape_lhs = (256, 30000)
shape_rhs = (30000, 30000)
dns = np.random.uniform(size=shape_lhs)
mx_dns = mx.nd.array(dns, ctx=mx.gpu())
mx_dns_cpy = mx_dns.copy()
for density in [0.01, 0.005, 0.001, 0.0005, 0.0001]:
csr = scipy.sparse.random(shape_rhs[0], shape_rhs[1], density=density, format = 'csr', dtype=np.float32)
mx_csr = mx.nd.sparse.csr_matrix((csr.data, csr.indices, csr.indptr), shape=shape_rhs, ctx=mx.gpu())
mx_csr_dns = mx_csr.tostype('default')
sparse_cost = 0.0
dns_cost = 0.0
mx.nd.waitall()
#warmup
check = mx.nd.dot(mx_dns, mx_csr, forward_stype='default')
check_np = np.dot(dns, mx_csr_dns.asnumpy())
assert_almost_equal(check.asnumpy(), check_np, atol=1e-5, rtol=1e-4)
print(check.shape)
mx.nd.waitall()
for i in range(50):
sparse_cost += measure_cost(1, mx.nd.dot, mx_dns, mx_csr, forward_stype='default')
dns_cost += measure_fallback(1, mx_csr)
dns_cost += measure_cost(1, mx.nd.dot, mx_dns, mx_csr_dns)
print("%.2f %%" % (density*100), dns_cost / sparse_cost)
sparse_cost = 0.0
dns_cost = 0.0
check = mx.nd.dot(mx_dns, mx_csr, transpose_b=True, forward_stype='default')
check_np = np.dot(dns, mx_csr_dns.asnumpy().T)
assert_almost_equal(check.asnumpy(), check_np, atol=1e-5, rtol=1e-4)
print(check.shape)
mx.nd.waitall()
for i in range(50):
sparse_cost += measure_cost(1, mx.nd.dot, mx_dns, mx_csr, transpose_b=True, forward_stype='default')
dns_cost += measure_fallback(1, mx_csr)
dns_cost += measure_cost(1, mx.nd.dot, mx_dns, mx_csr_dns, transpose_b=True)
print("%.2f %%" % (density*100), dns_cost / sparse_cost)
if __name__ == "__main__":
main() |
… on GPU (apache#10371) * add support for dot(dns, csr) = dns and dot(dns, csr.T) = dns on GPU * add unit test for new op and forward_stype_hint parameter to dot * update documentation for dot * address code reviews * fix flaky test_gluon:test_lambda through loosening the atol * switch dot(dns, csr) case to a deterministic algorithm with unit test for determinism * address code reviews and add backward
… on GPU (apache#10371) * add support for dot(dns, csr) = dns and dot(dns, csr.T) = dns on GPU * add unit test for new op and forward_stype_hint parameter to dot * update documentation for dot * address code reviews * fix flaky test_gluon:test_lambda through loosening the atol * switch dot(dns, csr) case to a deterministic algorithm with unit test for determinism * address code reviews and add backward
Description
Support dot(dns, csr) = dns and dot(dns, csr.T) = dns.
Checklist
Essentials
Changes
Comments
The storage type hint is to provide the user a way to specify what he/she wants so that we can avoid surprises when they hit any un-supported cases. A test with full coverage of all possible input cases is added.
The algorithm for dot(dense, csr) = dense will be changed to a deterministic version. On the other hand a determinism test will be added soon.
Profling results for dot(dense, csr.T):
([CSR density%] [Initialization Phase] [Transpose Phase] [Multiplication-Addition Phase]) (time in ns)
(1.00% 2877416 93137180 4862214884)
(0.10% 24891 7864861 560496940)
(0.01% 15276 652855 58704271)
Most of time is spent on the Multiplication-Addition Phase, we'll be working on improvement of the computation kernel soon. @eric-haibin-lin
Tested with the new warp kernel but saw no fundamental improvement, the kernel code: