[mxnet 2.0] [item 2.4] Turning on large tensor support by default #17331

apeforest · 2020-01-15T23:47:07Z

Description

Currently, MXNet only supports tensor size smaller than 2^31. To support large tensors, users need to recompile MXNet with USE_INT64_TENSOR_SIZE compiler flag set to ON.

Large tensor is used often in applications such as recommendation system with sparse embedding matrix and graph neural networks such as DGL.

To provide a better user experience, we would like to turn on this compiler flag by default so that MXNet binary release will support large tensors.

RFC: https://lists.apache.org/thread.html/df53b8c26e9e0433378dd803baba9fec4dd922728a5ce9135dc164b3@%3Cdev.mxnet.apache.org%3E

Current Status:

Large tensor support is already implemented in MXNet backend and C API. Over 80 operators have been tested and more are being tested.

There was performance degradation in a few operators such as transpose and it has been fixed (#16104)

Model Inference Performance

int64/int32 P50 records the 50-percentile inference runtime
% Diff: Runtime speedup of int64 build vs int32 build.
Thus a positive means inference time is reduced using int64 as tensor index.

Model	Mode	int64 P50 (ms)	int32 P50 (ms)	Diff (%)
resnext101_64x4d	gluon	47.34253883	49.46685	4.29%
resnext101_64x4d	module	28.83672714	28.48792	-1.22%
resnext50	gluon	17.14539528	18.05592	5.04%
resnext50	module	10.05506516	9.636641	-4.34%
nin	gluon	2.574443817	2.608061	1.29%
nin	module	2.432107925	2.737761	11.16%
resnet18	gluon	3.895759583	3.638268	-7.08%
resnet18	module	2.954959869	3.182888	7.16%
wavernn	gluon	262.9389763	256.5546	-2.49%
caffenet	gluon	2.930879593	3.087759	5.08%
caffenet	module	3.169536591	3.225327	1.73%
vgg19	gluon	14.18304443	13.89098	-2.10%
vgg19	module	13.80157471	14.33492	3.72%
maskrcnn	gluon	2340.852737	2391.741	2.13%
maskrcnn	module	1943.515778	1926.38	-0.89%
superres	gluon	17.39168167	18.00895	3.43%
superres	module	16.98470116	17.26198	1.61%
resnet101	gluon	18.73707771	18.4412	-1.60%
resnet101	module	16.66593552	14.78386	-12.73%
vgg16	gluon	12.403965	16.2611	23.72%
vgg16	module	17.93074608	11.83605	-51.49%
yolov3	gluon	22.96686172	23.01311	0.20%
yolov3	module	18.57829094	20.05506	7.36%
ssd	gluon	17.17400551	16.73698	-2.61%
ssd	module	13.98611069	14.00757	0.15%
rnn	gluon	28.2740593	28.92017	2.23%
rnn	module	19.32096481	28.63479	32.53%
a3c	gluon	0.928401947	0.94223	1.47%
a3c	module	0.673055649	0.858545	21.61%
squeezenetv10	gluon	4.072666168	4.251957	4.22%
squeezenetv10	module	3.686189651	3.818274	3.46%
resnet152	gluon	25.8705616	27.65441	6.45%
resnet152	module	20.5206871	21.03257	2.43%
resnet34	gluon	6.978273392	7.166862	2.63%
resnet34	module	5.693674088	5.653858	-0.70%
squeezenetv11	gluon	3.037929535	3.165722	4.04%
squeezenetv11	module	2.890110016	2.983332	3.12%
resnext101	gluon	29.1929245	27.65107	-5.58%
resnext101	module	15.9804821	17.51709	8.77%
bert	gluon	44.32678223	43.77675	-1.26%
bert	module	43.85828972	45.38655	3.37%
resnet50	gluon	10.39171219	10.31256	-0.77%
resnet50	module	9.351491928	8.312941	-12.49%
fasterrcnn	gluon	1041.807413	1061.532	1.86%
fasterrcnn	module	702.3141384	703.7232	0.20%
inception	gluon	7.934331894	8.714437	8.95%
inception	module	5.178928375	5.363703	3.44%
Average	gluon	n/a	n/a	0.69%
Average	module	n/a	n/a	-0.37%

Model Training Performance

Model	int64 Samples/Second	int32 Samples/Second	Percentage Change
xception	67.51961	68.61849	-1.60%
resnet50_v2	299.0174	299.1728	-0.05%
gnmt	7.65	7.675	-0.33%
vgg16	228.4218	230.0739	-0.72%
bert	38.1	46.7	-18.42%
yolo3_darknet53_custom	31.6145	40.65	-22.23%
inceptionv3	225.4025	227.1884	-0.79%
se_resnet152_v1	123.7371	124.1493	-0.33%
word_language_model	15651.19	15524.71	0.81%
*mobilenet0.25_cifar10	56.6609205	60.5992765	6.50%
resnet101_v1	176.6355	177.3132	-0.38%
squeezenet1.0	790.7722	790.1395	0.08%
mobilenetv2_0.75	680.4143	672.2202	1.22%
ssd	66.2365	67.56	-1.96%
Average			-3.44%

* measures speed instead of throughput

What Caused Performance Drop in BERT

Thanks to @JonTanS for running the profiler, we have ping pointed the performance degradation in operator broadcast_axis (from 138ms to 177ms) and MXNDArraySyncCopyToCPU (from 592ms to 679ms).

Running operator-level profiler we could identify the 2.2X performance drop in broadcast_axis operator.

w/o USE_INT64_TENSOR_SIZE flag:
[{'broadcast_axis': [{'inputs': {'data': (1, 1024, 1), 'axis': (0, 2), 'size': (1024, 8)}, 'max_storage_mem_alloc_gpu/0': 16777.2168, 'avg_time_forward_broadcast_axis': 2.7753}]}]

w/ USE_INT64_TENSOR_SIZE flag:
[{'broadcast_axis': [{'inputs': {'data': (1, 1024, 1), 'axis': (0, 2), 'size': (1024, 8)}, 'max_storage_mem_alloc_gpu/0': 16777.2168, 'avg_time_forward_broadcast_axis': 6.3178}]}

Why is broadcast_axis Operator Affected

Too many div/mul/mod ALU operations in the indices which changed from int32 type to int64 type

template<typename OP>
struct broadcast_kernel {
  template<typename IType, typename OType>
  MSHADOW_XINLINE static void Map(index_t i,
                                  IType *input,
                                  OType *output,
                                  mshadow::Shape<MXNET_SPECIAL_MAX_NDIM> in_shape,
                                  mshadow::Shape<MXNET_SPECIAL_MAX_NDIM> out_shape,
                                  const OpReqType req,
                                  const uint32_t ndim) {
    size_t in_stride = 1;
    size_t out_stride = 1;
    index_t idx = i;
    index_t in_idx = i;
    for (int iter = ndim - 1; iter >= 0; --iter) {
      size_t dim_idx = idx % out_shape[iter];
      in_idx -= dim_idx * out_stride;
      if (in_shape[iter] != 1) {
        in_idx += dim_idx * in_stride;
      }
      idx /= out_shape[iter];
      in_stride *= in_shape[iter];
      out_stride *= out_shape[iter];
    }
    KERNEL_ASSIGN(output[i], req, OP::Map(input[in_idx]));
  }
};

TODO

(DONE) update MXNet development doc and FAQ for adding new operators
(@ChaiBapchya )
(DONE) turning on nightly tests for large tensor (@access2rohit )
~~skipping tests that cannot fit in nightly CI machine #17450~~
~~Re-Enabling Large Tensor and Vector Nightly on GPU #16164~~
~~enabling build stage gpu_int64 to enable large tensor nightly runs #17546~~
test performance in npx operators (@access2rohit)
(DONE) test more operators (@ChaiBapchya)
Implement remaining nn_basic ops in opperf #17456
(DONE) adding end-to-end tests for a list of models (@jonatan1626)
Updated PartialSortSmallK for LT support #17462
Fix training regression in BERT model
setting the flag to ON and clean up (@apeforest)

The text was updated successfully, but these errors were encountered:

ChaiBapchya · 2020-01-30T00:53:56Z

Add LT support to ops found via OpPerf
NN optimizers and 1 activation #17444 [Merged]
Random, Sample, PDF ops : #17445 [Merged]

ChaiBapchya · 2020-01-30T08:47:59Z

[OpPerf] : Indexing Ops #16253 [Merged]
[OpPerf] : Neural Network Loss Ops #17482 [Merged]
[OpPerf] : Consolidate array manipulation related operators #17487

JonTanS · 2020-02-05T20:15:06Z

Inference Benchmarks comparing LT_MKL with just MKL Enabled.
All Time in MS.
% Diff calculated by doing 1 - (P50 with LT divided by P50 without LT).
A positive number means a speed increase, a negative number means a speed decrease.


Model	Mode	P50 w/ LT	P50 No LT	Percentage Difference
resnext101_64x4d	gluon	47.34253883	49.46685	4.29%
resnext101_64x4d	module	28.83672714	28.48792	-1.22%
resnext50	gluon	17.14539528	18.05592	5.04%
resnext50	module	10.05506516	9.636641	-4.34%
nin	gluon	2.574443817	2.608061	1.29%
nin	module	2.432107925	2.737761	11.16%
resnet18	gluon	3.895759583	3.638268	-7.08%
resnet18	module	2.954959869	3.182888	7.16%
wavernn	gluon	262.9389763	256.5546	-2.49%
caffenet	gluon	2.930879593	3.087759	5.08%
caffenet	module	3.169536591	3.225327	1.73%
vgg19	gluon	14.18304443	13.89098	-2.10%
vgg19	module	13.80157471	14.33492	3.72%
maskrcnn	gluon	2340.852737	2391.741	2.13%
maskrcnn	module	1943.515778	1926.38	-0.89%
superres	gluon	17.39168167	18.00895	3.43%
superres	module	16.98470116	17.26198	1.61%
resnet101	gluon	18.73707771	18.4412	-1.60%
resnet101	module	16.66593552	14.78386	-12.73%
vgg16	gluon	12.403965	16.2611	23.72%
vgg16	module	17.93074608	11.83605	-51.49%
yolov3	gluon	22.96686172	23.01311	0.20%
yolov3	module	18.57829094	20.05506	7.36%
ssd	gluon	17.17400551	16.73698	-2.61%
ssd	module	13.98611069	14.00757	0.15%
rnn	gluon	28.2740593	28.92017	2.23%
rnn	module	19.32096481	28.63479	32.53%
a3c	gluon	0.928401947	0.94223	1.47%
a3c	module	0.673055649	0.858545	21.61%
squeezenetv10	gluon	4.072666168	4.251957	4.22%
squeezenetv10	module	3.686189651	3.818274	3.46%
resnet152	gluon	25.8705616	27.65441	6.45%
resnet152	module	20.5206871	21.03257	2.43%
resnet34	gluon	6.978273392	7.166862	2.63%
resnet34	module	5.693674088	5.653858	-0.70%
squeezenetv11	gluon	3.037929535	3.165722	4.04%
squeezenetv11	module	2.890110016	2.983332	3.12%
resnext101	gluon	29.1929245	27.65107	-5.58%
resnext101	module	15.9804821	17.51709	8.77%
bert	gluon	44.32678223	43.77675	-1.26%
bert	module	43.85828972	45.38655	3.37%
resnet50	gluon	10.39171219	10.31256	-0.77%
resnet50	module	9.351491928	8.312941	-12.49%
fasterrcnn	gluon	1041.807413	1061.532	1.86%
fasterrcnn	module	702.3141384	703.7232	0.20%
inception	gluon	7.934331894	8.714437	8.95%
inception	module	5.178928375	5.363703	3.44%
drmm	gluon	837.1179104	614.3708	-36.26%
drmm	module	830.9795856	607.6496	-36.75%

Average Percentage Change over all numbers:
Gluon: 0.69%
Module: -0.37%

JonTanS · 2020-02-06T01:12:46Z

Training Benchmarks comparing LT_MKL with just MKL Enabled.
Speed measured seconds per Epoch.
GPU Memory measured in MB.

Note: Samples/Second are opposite so I have multiple the percentages by -1. A quick explanation: The number should be going higher so a positive percentage change means there are now less samples/second. A negative percentage change means there are more samples/second.

Model	Speed P50 LT	Speed P50 No LT	GPU Memory LT	GPU Memory No LT	Samples/Second P50 LT	Samples/Second P50 no LT	Speed Percentage Change	GPU Memory Percentage Change	Samples/Second Percentage Change
xception	19247.12517	18935.02989	15304	15320	67.51961	68.61849	-1.65%	0.10%	-1.60%
resnet50_v2	4342.953992	4342.899322	6892	6762	299.0174	299.1728	0.00%	-1.92%	-0.05%
gnmt	N/A	N/A	4244	4112	7.65	7.675		-3.21%	-0.33%
vgg16	5680.658345	5641.058277	9480	9496	228.4218	230.0739	-0.70%	0.17%	-0.72%
bert	20.66	16.8	4684	4050	38.1	46.7	-22.98%	-15.65%	-18.42%
yolo3_darknet53_custom	517.4205	454.908	7304	12436	31.6145	40.65	-13.74%	41.27%	-22.23%
inceptionv3	5765.122603	5723.867063	8318	8304	225.4025	227.1884	-0.72%	-0.17%	-0.79%
se_resnet152_v1	10497.33863	10465.23692	11290	10568	123.7371	124.1493	-0.31%	-6.83%	-0.33%
word_language_model	141.125	142.3	8846	7426	15651.19	15524.71	0.83%	-19.12%	0.81%
mobilenet0.25_cifar10	56.6609205	60.5992765	1234	1134	N/A	N/A	6.50%	-8.82%
resnet101_v1	7354.353666	7329.202738	8118	8022	176.6355	177.3132	-0.34%	-1.20%	-0.38%
squeezenet1.0	1677.752777	1678.684668	3770	3590	790.7722	790.1395	0.06%	-5.01%	0.08%
mobilenetv2_0.75	1938.194231	1968.429737	5078	5008	680.4143	672.2202	1.54%	-1.40%	1.22%
ssd	424.28	254.9485	4702	4592	66.2365	67.56	-66.42%	-2.40%	-1.96%

Average Percentage Change:
Speed: -7.53%
GPU Memory: -1.73%
Samples / Second: -3.44%

eric-haibin-lin · 2020-02-06T01:35:48Z

@jonatan1626 thanks for the update. Does -22.98% mean 22.98% slower?

JonTanS · 2020-02-06T02:09:58Z

@eric-haibin-lin Yes I am calculating this by: 1 - (LT, MKL value / MKL value).
For the samples/sec I doing the above and then multiplying by -1.

apeforest · 2020-02-06T05:54:09Z

@eric-haibin-lin Yes I am calculating this by: 1 - (LT, MKL value / MKL value).
For the samples/sec I doing the above and then multiplying by -1.

In your description "A negative percentage change means there are more samples/second." Doesn't that mean negative percentage is faster?

JonTanS · 2020-02-06T06:07:37Z

@apeforest Oh sorry, so I'm multiplying by only for the samples/second column -1 to keep the meaning consistent with everything else. The rest of the columns depict the correct positive percentage improvement and negative percentage degradation.

For example if MKL_LT gives 66 samples/sec and MKL gives 70 samples/sec that will be:
1-(66/70) or 6%. Because it's positive, we think that it's better but actually it's worse because the throughput has gone down.

On the other hand if MKL_LT gives 74 samples/sec and MKL gives 70 samples/sec that will be:
1-(74/70) or -5%. Because it's negative, we think it's worse but actually it's better because our throughput has gone up.

So I multiply by -1 to give it the same meaning as the rest of the percentages, where positive is better and negative is worse.

szha · 2020-02-17T17:15:29Z

The slowdown for BERT (-22.98%) is quite significant. We will need to mitigate this before moving forward.

apeforest · 2020-02-21T20:48:22Z

Thanks to @JonTanS for running the profiler, we have ping pointed the performance degradation in operator broadcast_axis (from 138ms to 177ms) and MXNDArraySyncCopyToCPU (from 592ms to 679ms).

Running operator-level profiler we could also identify the performance drop in broadcast_axis alone.

w/o USE_INT64_TENSOR_SIZE flag:
[{'broadcast_axis': [{'inputs': {'data': (1, 1024, 1), 'axis': (0, 2), 'size': (1024, 8)}, 'max_storage_mem_alloc_gpu/0': 16777.2168, 'avg_time_forward_broadcast_axis': 2.7753}]}]

w/ USE_INT64_TENSOR_SIZE flag:
[{'broadcast_axis': [{'inputs': {'data': (1, 1024, 1), 'axis': (0, 2), 'size': (1024, 8)}, 'max_storage_mem_alloc_gpu/0': 16777.2168, 'avg_time_forward_broadcast_axis': 6.3178}]}

Also, as I look into the implementation of broadcast_axis operator, many modulo and multiplication operator on the indices are involved. The next step will be to find an optimal implementation of broadcast_axis to reduce the ALU on indices in the kernel.

access2rohit · 2020-05-02T03:11:24Z

new_bert_train.log
new_lt_bert_train.log
master_bert_train.log
master_lt_bert_train.log

access2rohit · 2020-05-02T04:08:56Z

@szha @eric-haibin-lin @apeforest

With current master and new broadcast_axis changes on p3.16xl single GPU training run.

Bert Run Command:

python3 run_pretraining.py --data='./part-0000.train' --data_eval='./part-0000.train' --num_steps 100 --lr 1e-4 --optimizer lamb --accumulate 1 --raw --gpus 0 --num_dataset_workers 2 --num_batch_workers 1 --circle_length 1 --total_batch_size 4 --total_batch_size_eval 4 --log_interval 10

Results:

Code Version	throughput (samples/sec)			total time
	avg	p50	p90	(only training ignoring evaluation steps)
master LT	24.38k	25.50k	28.47k	134.8 sec
master	25.90k	25.90k	27.82k	131.9 sec
new LT	25.87k	25.80k	28.00k	127.3 sec
new	25.92k	25.80k	27.80k	131.5 sec

"new" refers to mxnet code with optimized broadcast_axis.
"master" refers to mxnet master branch code
"LT" refers to of the build was done after enabling large tensor.

apeforest · 2020-05-06T05:58:13Z

@access2rohit This result is a little surprising. In the earlier benchmark results provided by @JonTanS, there is a ~18% degradation in BERT training when large tensor (LT) compiler flag is turned on:

bert	38.1	46.7	-18.42%

However, from your result, even without your latest speedup in broadcast_axis operator, there is very little difference with LT flag is on:

master LT	24.38k	25.50k	28.47k	134.8 sec
master	25.90k	25.90k	27.82k	131.9 sec

Could you provide more insights?

access2rohit · 2020-05-07T23:12:14Z

@apeforest THe profiling done by @JonTanS was done long back using mxnet-1.6in november. These results are using current master branch of MXNet, bert scripts have changed too. If there are newer setting for running BERT on single node they are not available on Gluon NLP site. If @eric-haibin-lin or @szhengac can verify whether my BERT is correct or not and also provide proper tuning params to run BERT on single node I will re-run benchmarks and update the results here.

access2rohit · 2020-06-27T06:51:40Z

PR: #17882 fixes regression in SSD. Following are the new results for SSD run:

Code	SSD 1 Epoch time (sec)	%age Speedup/Slowdown w.r.t Master (large tensor disabled)
Master (large tensor disabled)	226	0
Master (large tensor enabled)	335	48.23% slowdown
Master + CPU Optimized broadcast_axis (large tensor disabled)	130	42.5% speedup
Master + CPU Optimized broadcast_axis (large tensor enabled)	184	18.5% speedup

access2rohit · 2020-06-27T06:53:55Z

@apeforest @sandeep-krishnamurthy @szha @zheng-da

PR's to enable Large Tensor Support as default in master are divided into two stages:
Stage1: Unix CPU/GPU and Windows CPU/GPU #18625
Stage2: All remaining platforms #18626

Once the above 2 PR's are merged MXNet will support Large Tensors for CPU/GPU(depending on Global Memory) on master.

access2rohit · 2020-07-10T06:00:14Z

Currently Large Tensor Support work on all operators implemented in MXNet and MKLDNN also supports int64. CUDA kernels written inside MXNET both generic(cpu/gpu) and specific(gpu only) support large tensors depending on device memory.

BLAS and LAPACK libs were not considered while defining the scope of the project. Currently following BLAS and LAPACK implementations are supported inside MXNet

openBLAS (Default)
MKL
ATLAS
Apple Accelerate

upon investigation openBLAS needs to be built with specific flag to support int64_t signatures and MKL will support long long int signatures(in which case reinterpret_cast<>() is needed for casting pointers as int64_t is treated as long int* as opposed to long long int* in MKL). Additionally LAPACK and BLAS wrappers need to be updated from int -> int64_t.

Initially openBLAS can be supported since it is used by default and in pypi wheels as well. Thus not, breaking any default behaviour of customer. Users attempting to use Large Tensor with other BLAS and LAPACK implementations won't face issues as long as they don't use large tensors. Additional error messages will be added in case Large Tensor is used BLAS implementation is not openBLAS until that BLAS library is made to work with large tensor support of MXNet.

NOTE: currently openBLAS works correctly with smaller inputs(within range of int32) but will truncate parameters passed with higher values and hence will result in either SIGSEGV(mostly) or garbage values being found(will eventually cause SIGSEGV in a bigger script)

@sandeep-krishnamurthy @leezu @szha @zheng-da

sandeep-krishnamurthy · 2020-07-10T15:08:45Z

Thanks @access2rohit for the summary.

Is the plan for enabling Large Tensor Support in the following order?

Make openBLAS compatible with Large Tensor support and merge the PR for Enabling Large Tensor Support so that default PyPi users of MXNet can already benefit from the new capability. This will actually cover largest user base of MXNet.
Next, we work on enabling MKL bindings capable of Large Tensor Support, as a separate PR. So users building custom MXNet builds with MKL as BLAS will get the Large Tensor functionality.
We need to debate on ATLAS and Accelerate BLAS support and we can pick up this discussion once we get above 2 major steps done.

Do you see this order of execution okay @access2rohit @leezu @szha @zheng-da ?

szha · 2020-07-10T15:58:23Z

Has the large tensor for numpy array been supported?

sandeep-krishnamurthy · 2020-07-10T16:11:37Z

@access2rohit can correct me, but, few of them are supported as they use same kernels under the hood. This issue scope was mainly on the NDArray when it got started. After these are done, remaining Numpy ops will also be supported.

access2rohit · 2020-07-10T16:36:38Z

Make openBLAS compatible with Large Tensor support and merge the PR for Enabling Large Tensor Support so that default PyPi users of MXNet can already benefit from the new capability. This will actually cover largest user base of MXNet.

yes

access2rohit · 2020-07-10T16:42:25Z

Has the large tensor for numpy array been supported?

upon inspecting numpy files inside MXNet and they are using index_t for iterating over elements in their own kernels and use NDarray ones for remaining in which we ensured to use index_t where required. For kernels using BLAS I will update them in the same PR as making MXNet wrappers for openBLAS int64 compatible.

leezu · 2020-07-10T17:16:49Z

NOTE: currently openBLAS works correctly with smaller inputs(within range of int32) but will truncate parameters passed with higher values and hence will result in either SIGSEGV(mostly) or garbage values being found(will eventually cause SIGSEGV in a bigger script)

I'm a little concerned that we don't have a correct integration of BLAS and Lapack. BLAS kernels and will get potential crashes or corrupt results. But I think @sandeep-krishnamurthy's point

Make openBLAS compatible with Large Tensor support and merge the PR for Enabling Large Tensor Support so that default PyPi users of MXNet can already benefit from the new capability. This will actually cover largest user base of MXNet.

refers to fixing this? If so, I'm fine with the order of execution. Thank you @access2rohit for the hard work on this feature

access2rohit · 2020-07-10T17:21:22Z

upon investigation openBLAS needs to be built with specific flag to support int64_t signatures and MKL will support long long int signatures(in which case reinterpret_cast<>() is needed for casting pointers as int64_t is treated as long int* as opposed to long long int* in MKL). Additionally LAPACK and BLAS wrappers need to be updated from int -> int64_t.

@leezu yes thats what I meant

szha · 2020-07-10T21:25:17Z

I think the numpy frontend hasn't supported large tensors yet. I started working on it here #18368 but I haven't found the time to finish migrating all the tests. @access2rohit would you be able to help out and take that over?

apeforest added the Feature request label Jan 15, 2020

apeforest self-assigned this Jan 15, 2020

apeforest changed the title ~~[mxnet 2.0] Turning on large tensor support by default~~ [mxnet 2.0] [item 2.4] Turning on large tensor support by default Jan 16, 2020

This was referenced Jun 27, 2020

Enable Large Tensor Support: Stage 1 #18625

Merged

[DO NOT MERGE] Enable Large Tensor Support : Stage2 #18626

Closed

sandeep-krishnamurthy mentioned this issue Jul 17, 2020

[NumPy] enable large tensor in np #18368

Merged

7 tasks

szha assigned access2rohit and unassigned apeforest Jul 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[mxnet 2.0] [item 2.4] Turning on large tensor support by default #17331

[mxnet 2.0] [item 2.4] Turning on large tensor support by default #17331

apeforest commented Jan 15, 2020 •

edited

Loading

ChaiBapchya commented Jan 30, 2020 •

edited

Loading

ChaiBapchya commented Jan 30, 2020 •

edited

Loading

JonTanS commented Feb 5, 2020 •

edited

Loading

JonTanS commented Feb 6, 2020

eric-haibin-lin commented Feb 6, 2020

JonTanS commented Feb 6, 2020 •

edited

Loading

apeforest commented Feb 6, 2020

JonTanS commented Feb 6, 2020

szha commented Feb 17, 2020

apeforest commented Feb 21, 2020

access2rohit commented May 2, 2020

access2rohit commented May 2, 2020 •

edited

Loading

apeforest commented May 6, 2020 •

edited

Loading

access2rohit commented May 7, 2020

access2rohit commented Jun 27, 2020

access2rohit commented Jun 27, 2020 •

edited

Loading

access2rohit commented Jul 10, 2020 •

edited

Loading

sandeep-krishnamurthy commented Jul 10, 2020

szha commented Jul 10, 2020

sandeep-krishnamurthy commented Jul 10, 2020

access2rohit commented Jul 10, 2020

access2rohit commented Jul 10, 2020 •

edited

Loading

leezu commented Jul 10, 2020

access2rohit commented Jul 10, 2020

szha commented Jul 10, 2020

[mxnet 2.0] [item 2.4] Turning on large tensor support by default #17331

[mxnet 2.0] [item 2.4] Turning on large tensor support by default #17331

Comments

apeforest commented Jan 15, 2020 • edited Loading

Description

Current Status:

Model Inference Performance

Model Training Performance

What Caused Performance Drop in BERT

Why is broadcast_axis Operator Affected

TODO

ChaiBapchya commented Jan 30, 2020 • edited Loading

ChaiBapchya commented Jan 30, 2020 • edited Loading

JonTanS commented Feb 5, 2020 • edited Loading

JonTanS commented Feb 6, 2020

eric-haibin-lin commented Feb 6, 2020

JonTanS commented Feb 6, 2020 • edited Loading

apeforest commented Feb 6, 2020

JonTanS commented Feb 6, 2020

szha commented Feb 17, 2020

apeforest commented Feb 21, 2020

access2rohit commented May 2, 2020

access2rohit commented May 2, 2020 • edited Loading

apeforest commented May 6, 2020 • edited Loading

access2rohit commented May 7, 2020

access2rohit commented Jun 27, 2020

access2rohit commented Jun 27, 2020 • edited Loading

access2rohit commented Jul 10, 2020 • edited Loading

sandeep-krishnamurthy commented Jul 10, 2020

szha commented Jul 10, 2020

sandeep-krishnamurthy commented Jul 10, 2020

access2rohit commented Jul 10, 2020

access2rohit commented Jul 10, 2020 • edited Loading

leezu commented Jul 10, 2020

access2rohit commented Jul 10, 2020

szha commented Jul 10, 2020

apeforest commented Jan 15, 2020 •

edited

Loading

ChaiBapchya commented Jan 30, 2020 •

edited

Loading

ChaiBapchya commented Jan 30, 2020 •

edited

Loading

JonTanS commented Feb 5, 2020 •

edited

Loading

JonTanS commented Feb 6, 2020 •

edited

Loading

access2rohit commented May 2, 2020 •

edited

Loading

apeforest commented May 6, 2020 •

edited

Loading

access2rohit commented Jun 27, 2020 •

edited

Loading

access2rohit commented Jul 10, 2020 •

edited

Loading

access2rohit commented Jul 10, 2020 •

edited

Loading