Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unify on NCHW and maybe support CHWN and HWCN tensor layout #132

Closed
mratsim opened this issue Oct 29, 2017 · 0 comments
Closed

Unify on NCHW and maybe support CHWN and HWCN tensor layout #132

mratsim opened this issue Oct 29, 2017 · 0 comments

Comments

@mratsim
Copy link
Owner

mratsim commented Oct 29, 2017

N: batch size
C: Channel / convolution feature_map
H: Height
W: Width

NCHW

NCHW is the most widespread format and is the default format in:

  • CuDNN, Torch, PyTorch, Chainer

N is the first index which is the most familiar to data scientists. It however presents optimization challenges even on Nvidia side. See soumith/convnet-benchmarks#93

CHWN

CHWN is the format used by Neon and the (dead) pioneer cuda-convnet:
It is perfect for Winograd convolution. Main issue is that N on the right is unfamiliar. Also for RNNs it might be worse to have batch as the innermost dimension.

Some models also concatenates other the feature maps which can be done directly similar to (C1 + C2 + C3)HWN

  • CuDNN supports it according to doc
    2017-10-29_10-03-54
    but for CuDNN v7 cudnn.h only has this:
typedef enum
{
    CUDNN_TENSOR_NCHW = 0,          /* row major (wStride = 1, hStride = w) */
    CUDNN_TENSOR_NHWC = 1,          /* feature maps interleaved ( cStride = 1 )*/
    CUDNN_TENSOR_NCHW_VECT_C = 2    /* each image point is vector of element of C : the length of the vector is carried by the data type*/
} cudnnTensorFormat_t;

HWCN

HWCN is the format used by Tensorflow. It is also better than NCHW for Winograd (but not as good as CHWN), it is also the best format for implementing "Memory Efficient Convolution" #131

Format conversion

Converting between NCHW and CHWN can be done very efficiently by considering a transposition between [N, CHW] and [CHW, N]

Implementation by Neon: NervanaSystems/neon@682dde6

Implementation by NVIDIA: https://devblogs.nvidia.com/parallelforall/efficient-matrix-transpose-cuda-cc/

Paper: Optimizing memory efficiency for DCNN on GPUs

2017-10-29_10-36-13

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant