-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Fix non-determinism of dot(csr.T, dns) = dns with tests #11825
Conversation
@eric-haibin-lin Please give a review when you have time, thanks! |
@@ -573,12 +581,113 @@ inline void DotCsrDnsDnsImpl(const OpContext& ctx, | |||
data_out.dptr<DType>(), data_l.dptr<DType>(), indptr_l.dptr<IType>(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since DotCsrTransDnsDnsWarpKernel, DotCsrTransDnsDnsThreadBlockKernel, and DotCsrTransDnsDnsWarpBlockKernel are nondeterministic and cannot be used, we might as well just remove them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure I can do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
f63c106
to
498b150
Compare
src/operator/tensor/dot-inl.cuh
Outdated
Tensor<gpu, 1, char> temp_storage(temp_storage_ptr, Shape1(temp_storage_bytes), s); | ||
|
||
int num_bits = log2i(num_csr_cols - 1); | ||
SortByKey(csc_cols, original_idx, true, &temp_storage, 0, num_bits); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: should the argument be num_bits -1 based on the SortByKey function signature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pay attention to the num_csr_cols - 1
above.
src/operator/tensor/dot-inl.cuh
Outdated
@@ -35,6 +35,13 @@ | |||
namespace mxnet { | |||
namespace op { | |||
|
|||
// Returns integer log2(a) rounded up | |||
inline int log2i(size_t a) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel we should use common utility function with more efficient implementation for such type of computation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only used in this file, plus, pay attention to the mismatch between types of input and output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although this is only used in this file for now, this function is too generic and very likely to be used in the future by other developers. Making it a generic utility will allow us to change its implementation in the future without breaking existing code. It's output can be cast based on our need. I don't see it as a reason to have a separate function to do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO a generic utility should be a templated one to support all data types instead of something like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a simple search and found log2i (or with slightly different names) function appeared at least three times in this module. See: indexing_op.h, pooled_storage_manner.h and intrinsics.cuh. It would be nice to consolidate them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For normal usages where the inputs and outputs have the same type there're std implementations of this. On the other hand this is already a very optimized code to perform such functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That one is operating on unsigned int and I'm operating on size_t here. Actually you can see that the code is the same, as there's no other usages, putting them in corresponding files that uses them makes more sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taking log2 integer does sound like a common utility. We always need to calculate this before calling SortByKey
. Can we have log2i(size_t)
and log2i(unsigned int)
in common/utils.h
and replace other occurrences?
src/operator/tensor/dot-inl.cuh
Outdated
// Returns integer log2(a) rounded up | ||
inline int log2i(size_t a) { | ||
int k = 1; | ||
while (a >>= 1) k++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: use ++k instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
498b150
to
1e91db6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the performance impact of this change?
src/operator/tensor/dot-inl.cuh
Outdated
@@ -35,6 +35,13 @@ | |||
namespace mxnet { | |||
namespace op { | |||
|
|||
// Returns integer log2(a) rounded up | |||
inline int log2i(size_t a) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taking log2 integer does sound like a common utility. We always need to calculate this before calling SortByKey
. Can we have log2i(size_t)
and log2i(unsigned int)
in common/utils.h
and replace other occurrences?
1e91db6
to
437da0e
Compare
@@ -663,6 +663,18 @@ constexpr size_t MaxIntegerValue<mshadow::half::half_t>() { | |||
return size_t(2) << 10; | |||
} | |||
|
|||
MSHADOW_XINLINE int ilog2ul(size_t a) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Please also refactor other places using the same utility as we discussed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/operator/tensor/dot-inl.cuh
Outdated
@@ -510,6 +398,7 @@ inline void DotCsrDnsDnsImpl(const OpContext& ctx, | |||
return; | |||
} | |||
|
|||
using namespace mshadow; | |||
using mshadow::cuda::kBaseThreadNum; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may not need the mshadow directive since you added the line above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not used now, deleted.
437da0e
to
9baa493
Compare
LGTM and thanks for the refactoring. |
9baa493
to
e7f8ed7
Compare
Benchmark script: import mxnet as mx
import sys
import os
import scipy
import numpy as np
from mxnet.test_utils import assert_almost_equal
import time
def measure_cost(repeat, f, *args, **kwargs):
# start bench
start = time.time()
results = []
for i in range(repeat):
results.append(f(*args, **kwargs))
for result in results:
result.wait_to_read()
end = time.time()
diff = end - start
return diff / repeat
def measure_fallback(repeat, a):
# start bench
start = time.time()
results = []
for i in range(repeat):
results.append(a.tostype('default'))
for result in results:
result.wait_to_read()
end = time.time()
diff = end - start
return diff / repeat
def main():
shape_dns = (1000000, 512)
shape_csr = (1000000, 128)
dns = np.random.uniform(size=shape_dns)
mx_dns = mx.nd.array(dns, ctx=mx.gpu())
mx_dns_cpy = mx_dns.copy()
for density in [0.01, 0.005, 0.001, 0.0005, 0.0001]:
csr = scipy.sparse.random(shape_csr[0], shape_csr[1], density=density, format = 'csr', dtype=np.float32)
mx_csr = mx.nd.sparse.csr_matrix((csr.data, csr.indices, csr.indptr), shape=shape_csr, ctx=mx.gpu())
mx_csr_dns = mx_csr.tostype('default')
sparse_cost = 0.0
dns_cost = 0.0
mx.nd.waitall()
#warmup
check = mx.nd.dot(mx_csr, mx_dns, transpose_a=True, forward_stype='default')
check_np = np.dot(mx_csr_dns.asnumpy().T, dns)
assert_almost_equal(check.asnumpy(), check_np, atol=1e-5, rtol=1e-4)
print(check.shape)
mx.nd.waitall()
for i in range(50):
sparse_cost += measure_cost(1, mx.nd.dot, mx_csr, mx_dns, transpose_a=True, forward_stype='default')
dns_cost += measure_cost(1, mx.nd.dot, mx_csr_dns, mx_dns, transpose_a=True)
print("%.2f %%" % (density*100), dns_cost / sparse_cost)
if __name__ == "__main__":
main() Benchmark results on single K80 GPU (all speedup values are based on comparison to the same dense dot computation dot( (1M, 128).T, (1M, 512) ) ): |
Comparison of performance between before and after on smaller workloads (dot( (100k, 128).T , (100k, 512) )): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice speed up
e7f8ed7
to
c561e6c
Compare
@eric-haibin-lin File conflicts resolved, should be ready for merge once the build passes. |
Great fix! |
* fix undeterminism of dot(csr.T, dns) = dns with tests * address code reviews
Description
Fix for #10709
Checklist
Essentials
Changes
Comments
New determinism test passed 10000 times on local machine.
Correctness check, which takes longer time to run (~5s/trial), have already passed more than 10000 times at this moment.