-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[PERFORMANCE] [master] Layer normalization code from Marian for CPU #19602
Conversation
Experiment with OMP_NUM_THREADS=4, times in s, c5.12xlarge |batchxchanne| New code | MKL | | 1x 32 | 0.0000288| 0.0000278| | 128x 32 | 0.0000308| 0.0000311| | 2560x 32 | 0.0000712| 0.0000672| | 4096x 32 | 0.0000946| 0.0000910| | 8192x 32 | 0.0001597| 0.0001523| |16384x 32 | 0.0002905| 0.0002619| | 1x 64 | 0.0000264| 0.0000256| | 128x 64 | 0.0000339| 0.0000330| | 2560x 64 | 0.0000829| 0.0000972| | 4096x 64 | 0.0001137| 0.0001356| | 8192x 64 | 0.0002027| 0.0002435| |16384x 64 | 0.0003715| 0.0004639| | 1x 128 | 0.0000262| 0.0000263| | 128x 128 | 0.0000325| 0.0000389| | 2560x 128 | 0.0001074| 0.0001580| | 4096x 128 | 0.0001505| 0.0002336| | 8192x 128 | 0.0002861| 0.0004481| |16384x 128 | 0.0005648| 0.0008613| | 1x 256 | 0.0000273| 0.0000276| | 128x 256 | 0.0000390| 0.0000431| | 2560x 256 | 0.0001533| 0.0002811| | 4096x 256 | 0.0002258| 0.0004300| | 8192x 256 | 0.0004300| 0.0008464| |16384x 256 | 0.0010436| 0.0017613| | 1x 512 | 0.0000256| 0.0000302| | 128x 512 | 0.0000408| 0.0000551| | 2560x 512 | 0.0002444| 0.0005225| | 4096x 512 | 0.0003828| 0.0008147| | 8192x 512 | 0.0008832| 0.0017192| |16384x 512 | 0.0058463| 0.0074497| | 1x 768 | 0.0000252| 0.0000308| | 128x 768 | 0.0000450| 0.0000676| | 2560x 768 | 0.0003440| 0.0007719| | 4096x 768 | 0.0005890| 0.0013346| | 8192x 768 | 0.0014946| 0.0026145| |16384x 768 | 0.0089495| 0.0113557| | 1x 1024 | 0.0000285| 0.0000308| | 128x 1024 | 0.0000487| 0.0000786| | 2560x 1024 | 0.0004614| 0.0010190| | 4096x 1024 | 0.0008083| 0.0017376| | 8192x 1024 | 0.0059020| 0.0075588| |16384x 1024 | 0.0116553| 0.0146855| Benchmark program ```python import mxnet as mx import time def time_procedure(shape, count): data = mx.nd.random_uniform(shape=shape, low=-1.0, high = 1.0) factors = mx.nd.random_uniform(shape=(shape[-1],)) mx.nd.waitall() begin = time.time() for i in range(0, count): out = mx.nd.LayerNorm(data, factors, factors) mx.nd.waitall() return (time.time() - begin) / count count = 200 for channel in [32, 64, 128, 256, 512, 768, 1024]: for batch in [1, 128, 2560, 4096, 8192, 16384]: s = (batch, channel) timing = time_procedure(s, count) print("{:5d}x{:5d} | {:.7f}".format(s[0], s[1], timing)) ```
Hey @kpuatamazon , Thanks for submitting the PR
CI supported jobs: [centos-cpu, clang, windows-gpu, sanity, centos-gpu, miscellaneous, unix-cpu, windows-cpu, website, unix-gpu, edge] Note: |
@mxnet-bot run ci [all] Sigh everything is broken on some python HTTP thing.
|
Jenkins CI successfully triggered : [edge, sanity, windows-gpu, unix-gpu, clang, centos-gpu, unix-cpu, miscellaneous, website, centos-cpu, windows-cpu] |
@mxnet-bot run ci [unix-cpu, website, windows-cpu, windows-gpu] Playing more CI docker daemon lottery. |
Jenkins CI successfully triggered : [website, windows-gpu, unix-cpu, windows-cpu] |
@mxnet-bot run ci [unix-cpu] Memory gambling is annoying. |
Jenkins CI successfully triggered : [unix-cpu] |
@mxnet-bot run ci [unix-cpu] Still just running out of RAM compiling numpy kernels. https://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-19602/19/pipeline/ |
Jenkins CI successfully triggered : [unix-cpu] |
std::conditional<std::is_same<mshadow::half::half_t, Data>::value, | ||
float, | ||
Data>::type> | ||
void LayerNormCPUKernel(size_t width, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend to change the name to LayerNormContiguousCPUKernel
or LayerNormLastAxisCPUKernel
One naming issue. Looks good to me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor issue (can be addressed later actually).
What are the next steps for this PR? Is this ready to be merged? |
@fhieber I've just merged. Feel free to try it out. |
Description
This is the master version of #19601. There isn't much different in LayerNormalization implementation between v1.x and master.
Checklist
Essentials
Changes
Comments
See #19601 for benchmarks.