Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add script file to benchmark cumsum. #334

Merged
merged 7 commits into from
May 8, 2020
Merged

Conversation

Xreki
Copy link
Collaborator

@Xreki Xreki commented Mar 16, 2020

cumsum这个op的GPU Kernel使用Eigen实现,GPU性能甚至远远差于CPU。

  • CPU profile结果
-------------------------       Event Summary       -------------------------

Event                  Calls       Total       Min.        Max.        Ave.        Ratio.
thread0::cumsum        100         309.388     2.97553     8.028       3.09388     0.622258
thread0::fetch         100         186.402     1.60643     6.46663     1.86402     0.374903
thread0::feed          100         1.41176     0.01086     0.032311    0.0141176   0.00283942
  • GPU profile结果
-------------------------       Event Summary       -------------------------

Event                         Calls       Total       CPU Time (Ratio)        GPU Time (Ratio)        Min.        Max.        Ave.        Ratio.
thread0::fetch                100         9907.99     9770.541932 (0.986128)  137.445174 (0.013872)   98.6611     121.807     99.0799     0.507399
  GpuMemcpySync:GPU->CPU      100         9904.85     9767.399851 (0.986123)  137.445174 (0.013877)   98.6326     121.505     99.0485     0.507238
thread0::cumsum               100         9617.47     14.435434 (0.001501)    9603.031365 (0.998499)  95.8951     111.769     96.1747     0.492521
thread0::feed                 100         1.54803     1.548035 (1.000000)     0.000000 (0.000000)     0.013429    0.028579    0.0154803   7.92766e-05

TensorFlow的cumsum GPU曾经也是调用Eigen实现,其性能被吐槽差于PyTorch 4000倍吐槽链接

image

@Xreki
Copy link
Collaborator Author

Xreki commented Apr 16, 2020

  • Paddle的nvprof结果:
{
  framework: "paddle",
  version: "0.0.0",
  name: "cumsum",
  device: "GPU",
  speed: { repeat: 100, start: 10, end: 90, total: 103.18988, feed: 0.00000, compute: 0.00000, fetch: 0.00000 }
}
==5449== Profiling application: python cumsum.py --task speed --framework paddle --dtype float32 --run_with_executor True --check_output False --profiler none --backward False --use_gpu True --repeat 100 --log_level 0
==5449== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   97.17%  9.59798s       100  95.980ms  95.774ms  107.53ms  void Eigen::ScanKernel<Eigen::TensorEvaluator<Eigen::TensorScanOp<Eigen::internal::SumReducer<float>, Eigen::TensorReshapingOp<Eigen::DSizes<long, int=1> const , Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, long>, int=0, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, Eigen::internal::SumReducer<float>>(float, long, Eigen::ScanKernel<Eigen::TensorEvaluator<Eigen::TensorScanOp<Eigen::internal::SumReducer<float>, Eigen::TensorReshapingOp<Eigen::DSizes<long, int=1> const , Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, long>, int=0, Eigen::MakePointer> const > const > const , Eigen::GpuDevice::CoeffReturnType>, Eigen::internal::SumReducer<float>>*)
                    1.48%  146.43ms       103  1.4217ms  1.8240us  1.5743ms  [CUDA memcpy HtoD]
                    1.35%  133.37ms       100  1.3337ms  1.2508ms  5.6956ms  [CUDA memcpy DtoH]
                    0.00%  7.0400us         4  1.7600us  1.6960us  1.9200us  [CUDA memset]
      API calls:   53.06%  9.91811s       203  48.858ms  21.230us  114.48ms  cudaMemcpy
                   24.61%  4.60058s         8  575.07ms  2.1930us  4.59993s  cudaStreamCreateWithFlags
                   10.69%  1.99792s         1  1.99792s  1.99792s  1.99792s  cudaStreamCreate
                    5.76%  1.07668s       430  2.5039ms  6.1020us  78.889ms  cuModuleUnload
                    5.66%  1.05839s         4  264.60ms  1.0220us  1.05838s  cudaFree
                    0.12%  21.895ms        16  1.3684ms  6.5630us  19.809ms  cudaMalloc
                    0.03%  5.2315ms       100  52.315us  42.076us  161.11us  cudaLaunchKernel
  • tf的nvprof结果:
{
  framework: "tensorflow",
  version: "1.15.0",
  name: "cumsum",
  device: "GPU",
  speed: { repeat: 100, start: 10, end: 90, total: 7.01281, feed: 0.00000, compute: 0.00000, fetch: 0.00000 }
}
==5240== Profiling application: python cumsum.py --task speed --framework tf --dtype float32 --run_with_executor True --check_output False --profiler none --backward False --use_gpu True --repeat 100 --log_level 0
==5240== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   45.63%  194.81ms       100  1.9481ms  1.8871ms  2.1991ms  void tensorflow::functor::scan_kernel<float, tensorflow::functor::Sum<float>, int=1024, int=4>(float const *, tensorflow::functor::scan_kernel<float, tensorflow::functor::Sum<float>, int=1024, int=4>*, int, int, int, bool, bool, float)
                   30.86%  131.74ms       100  1.3174ms  1.1182ms  1.8107ms  [CUDA memcpy HtoD]
                   23.51%  100.36ms       100  1.0036ms  1.0029ms  1.0054ms  [CUDA memcpy DtoH]
                    0.00%  2.4320us         1  2.4320us  2.4320us  2.4320us  [CUDA memset]
      API calls:   53.95%  1.61007s       400  4.0252ms  1.0710us  1.60765s  cudaPointerGetAttributes
                   17.47%  521.29ms       229  2.2764ms  7.4370us  60.034ms  cuModuleUnload
                   12.06%  360.01ms         1  360.01ms  360.01ms  360.01ms  cuDevicePrimaryCtxRetain
                    9.61%  286.91ms       102  2.8128ms  14.556us  3.1754ms  cuCtxSynchronize
                    4.68%  139.52ms       100  1.3952ms  1.0618ms  2.0871ms  cuMemcpyHtoDAsync
                    0.56%  16.847ms         1  16.847ms  16.847ms  16.847ms  cuMemAlloc
                    0.42%  12.390ms         3  4.1300ms  1.7378ms  5.4064ms  cuMemHostAlloc
                    0.35%  10.581ms       404  26.190us     222ns  1.0817ms  cuDeviceGetAttribute
                    0.26%  7.7055ms      4542  1.6960us     806ns  83.043us  cuEventQuery
                    0.19%  5.5688ms         9  618.76us  508.88us  827.07us  cuDeviceTotalMem
                    0.14%  4.1198ms         4  1.0300ms  915.16us  1.2995ms  cudaGetDeviceProperties
                    0.14%  4.1161ms       100  41.161us  26.112us  89.701us  cudaLaunchKernel

Copy link
Contributor

@wangchaochaohu wangchaochaohu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@wangchaochaohu wangchaochaohu reopened this May 7, 2020
@Xreki Xreki merged commit 77bb96c into PaddlePaddle:master May 8, 2020
@Xreki Xreki deleted the api/cumsum branch May 8, 2020 02:30
Xreki added a commit to Xreki/benchmark that referenced this pull request Oct 20, 2020
* Add script file to benchmark cumsum.

* Update cumsum to the new framework.

* Update to the newest framework.

* Change the support of atol.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants