Add script file to benchmark cumsum. #334

Xreki · 2020-03-16T06:44:38Z

cumsum这个op的GPU Kernel使用Eigen实现，GPU性能甚至远远差于CPU。

CPU profile结果

-------------------------       Event Summary       -------------------------

Event                  Calls       Total       Min.        Max.        Ave.        Ratio.
thread0::cumsum        100         309.388     2.97553     8.028       3.09388     0.622258
thread0::fetch         100         186.402     1.60643     6.46663     1.86402     0.374903
thread0::feed          100         1.41176     0.01086     0.032311    0.0141176   0.00283942

GPU profile结果

-------------------------       Event Summary       -------------------------

Event                         Calls       Total       CPU Time (Ratio)        GPU Time (Ratio)        Min.        Max.        Ave.        Ratio.
thread0::fetch                100         9907.99     9770.541932 (0.986128)  137.445174 (0.013872)   98.6611     121.807     99.0799     0.507399
  GpuMemcpySync:GPU->CPU      100         9904.85     9767.399851 (0.986123)  137.445174 (0.013877)   98.6326     121.505     99.0485     0.507238
thread0::cumsum               100         9617.47     14.435434 (0.001501)    9603.031365 (0.998499)  95.8951     111.769     96.1747     0.492521
thread0::feed                 100         1.54803     1.548035 (1.000000)     0.000000 (0.000000)     0.013429    0.028579    0.0154803   7.92766e-05

TensorFlow的cumsum GPU曾经也是调用Eigen实现，其性能被吐槽差于PyTorch 4000倍，吐槽链接。

Xreki · 2020-04-16T12:48:56Z

Paddle的nvprof结果：

{
  framework: "paddle",
  version: "0.0.0",
  name: "cumsum",
  device: "GPU",
  speed: { repeat: 100, start: 10, end: 90, total: 103.18988, feed: 0.00000, compute: 0.00000, fetch: 0.00000 }
}
==5449== Profiling application: python cumsum.py --task speed --framework paddle --dtype float32 --run_with_executor True --check_output False --profiler none --backward False --use_gpu True --repeat 100 --log_level 0
==5449== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   97.17%  9.59798s       100  95.980ms  95.774ms  107.53ms  void Eigen::ScanKernel<Eigen::TensorEvaluator<Eigen::TensorScanOp<Eigen::internal::SumReducer<float>, Eigen::TensorReshapingOp<Eigen::DSizes<long, int=1> const , Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, long>, int=0, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, Eigen::internal::SumReducer<float>>(float, long, Eigen::ScanKernel<Eigen::TensorEvaluator<Eigen::TensorScanOp<Eigen::internal::SumReducer<float>, Eigen::TensorReshapingOp<Eigen::DSizes<long, int=1> const , Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, long>, int=0, Eigen::MakePointer> const > const > const , Eigen::GpuDevice::CoeffReturnType>, Eigen::internal::SumReducer<float>>*)
                    1.48%  146.43ms       103  1.4217ms  1.8240us  1.5743ms  [CUDA memcpy HtoD]
                    1.35%  133.37ms       100  1.3337ms  1.2508ms  5.6956ms  [CUDA memcpy DtoH]
                    0.00%  7.0400us         4  1.7600us  1.6960us  1.9200us  [CUDA memset]
      API calls:   53.06%  9.91811s       203  48.858ms  21.230us  114.48ms  cudaMemcpy
                   24.61%  4.60058s         8  575.07ms  2.1930us  4.59993s  cudaStreamCreateWithFlags
                   10.69%  1.99792s         1  1.99792s  1.99792s  1.99792s  cudaStreamCreate
                    5.76%  1.07668s       430  2.5039ms  6.1020us  78.889ms  cuModuleUnload
                    5.66%  1.05839s         4  264.60ms  1.0220us  1.05838s  cudaFree
                    0.12%  21.895ms        16  1.3684ms  6.5630us  19.809ms  cudaMalloc
                    0.03%  5.2315ms       100  52.315us  42.076us  161.11us  cudaLaunchKernel

tf的nvprof结果：

{
  framework: "tensorflow",
  version: "1.15.0",
  name: "cumsum",
  device: "GPU",
  speed: { repeat: 100, start: 10, end: 90, total: 7.01281, feed: 0.00000, compute: 0.00000, fetch: 0.00000 }
}
==5240== Profiling application: python cumsum.py --task speed --framework tf --dtype float32 --run_with_executor True --check_output False --profiler none --backward False --use_gpu True --repeat 100 --log_level 0
==5240== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   45.63%  194.81ms       100  1.9481ms  1.8871ms  2.1991ms  void tensorflow::functor::scan_kernel<float, tensorflow::functor::Sum<float>, int=1024, int=4>(float const *, tensorflow::functor::scan_kernel<float, tensorflow::functor::Sum<float>, int=1024, int=4>*, int, int, int, bool, bool, float)
                   30.86%  131.74ms       100  1.3174ms  1.1182ms  1.8107ms  [CUDA memcpy HtoD]
                   23.51%  100.36ms       100  1.0036ms  1.0029ms  1.0054ms  [CUDA memcpy DtoH]
                    0.00%  2.4320us         1  2.4320us  2.4320us  2.4320us  [CUDA memset]
      API calls:   53.95%  1.61007s       400  4.0252ms  1.0710us  1.60765s  cudaPointerGetAttributes
                   17.47%  521.29ms       229  2.2764ms  7.4370us  60.034ms  cuModuleUnload
                   12.06%  360.01ms         1  360.01ms  360.01ms  360.01ms  cuDevicePrimaryCtxRetain
                    9.61%  286.91ms       102  2.8128ms  14.556us  3.1754ms  cuCtxSynchronize
                    4.68%  139.52ms       100  1.3952ms  1.0618ms  2.0871ms  cuMemcpyHtoDAsync
                    0.56%  16.847ms         1  16.847ms  16.847ms  16.847ms  cuMemAlloc
                    0.42%  12.390ms         3  4.1300ms  1.7378ms  5.4064ms  cuMemHostAlloc
                    0.35%  10.581ms       404  26.190us     222ns  1.0817ms  cuDeviceGetAttribute
                    0.26%  7.7055ms      4542  1.6960us     806ns  83.043us  cuEventQuery
                    0.19%  5.5688ms         9  618.76us  508.88us  827.07us  cuDeviceTotalMem
                    0.14%  4.1198ms         4  1.0300ms  915.16us  1.2995ms  cudaGetDeviceProperties
                    0.14%  4.1161ms       100  41.161us  26.112us  89.701us  cudaLaunchKernel

wangchaochaohu

LGTM

* Add script file to benchmark cumsum. * Update cumsum to the new framework. * Update to the newest framework. * Change the support of atol.

Xreki added 2 commits March 16, 2020 06:43

Add script file to benchmark cumsum.

bb860f6

Merge branch 'master' into api/cumsum

0d8f1f3

Update cumsum to the new framework.

6389f49

Xreki force-pushed the api/cumsum branch from 49c981b to 6389f49 Compare April 16, 2020 12:57

Xreki added 4 commits April 28, 2020 12:34

Merge branch 'master' into api/cumsum

6ff26ee

Update to the newest framework.

0e77f93

Merge branch 'master' into api/cumsum

3b07fe0

Change the support of atol.

00f300f

wangchaochaohu approved these changes May 7, 2020

View reviewed changes

wangchaochaohu closed this May 7, 2020

wangchaochaohu reopened this May 7, 2020

Xreki merged commit 77bb96c into PaddlePaddle:master May 8, 2020

Xreki deleted the api/cumsum branch May 8, 2020 02:30

wangchaochaohu mentioned this pull request May 13, 2020

Cusum optimize PaddlePaddle/Paddle#24321

Merged

Xreki added a commit to Xreki/benchmark that referenced this pull request Oct 20, 2020

Add script file to benchmark cumsum. (PaddlePaddle#334)

0f3b52f

* Add script file to benchmark cumsum. * Update cumsum to the new framework. * Update to the newest framework. * Change the support of atol.

wangchaochaohu mentioned this pull request Nov 28, 2020

optimize the performance of cumsum op PaddlePaddle/Paddle#29193

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add script file to benchmark cumsum. #334

Add script file to benchmark cumsum. #334

Xreki commented Mar 16, 2020 •

edited by lelelelelez

Loading

Xreki commented Apr 16, 2020

wangchaochaohu left a comment

Add script file to benchmark cumsum. #334

Add script file to benchmark cumsum. #334

Conversation

Xreki commented Mar 16, 2020 • edited by lelelelelez Loading

Xreki commented Apr 16, 2020

wangchaochaohu left a comment

Choose a reason for hiding this comment

Xreki commented Mar 16, 2020 •

edited by lelelelelez

Loading