Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

support slice on MKLDNN arrays better #12303

Closed
zheng-da opened this issue Aug 23, 2018 · 16 comments
Closed

support slice on MKLDNN arrays better #12303

zheng-da opened this issue Aug 23, 2018 · 16 comments

Comments

@zheng-da
Copy link
Contributor

Currently, slice on an MKLDNN array requires to convert the array to the default layout before taking a slice. However, the MKLDNN library actually provides a view for MKLDNN memory. As such, by taking advantage of the MKLDNN view, we don't really need to convert data layout for slice.
For details, please see the discussion here: oneapi-src/oneDNN#306, oneapi-src/oneDNN#69, oneapi-src/oneDNN#290
@pengzhao-intel @TaoLv @azai91 @safrooze

@pengzhao-intel
Copy link
Contributor

Yes, I think it's doable and worth to do.

In other words, we need an MKL-DNN based slice OP.

Do you need our engineer to help with this kind of functionality?

@zheng-da
Copy link
Contributor Author

@pengzhao-intel if your team has bandwidth to make it happen, it'll be great.

@pengzhao-intel
Copy link
Contributor

OK, we will take over this work and submit PR later.

@ankkhedia
Copy link
Contributor

@mxnet-label-bot : [MKLDNN, Feature Request]

@zheng-da
Copy link
Contributor Author

@safrooze Could you provide a use case for @pengzhao-intel for testing?

@pengzhao-intel
Copy link
Contributor

@safrooze we're starting the implementation of slide OP.
It will be more focus if you can provide the case for us.
But if it's also fine if it is not convenience from your side and we will make the OP as general as possible.

@safrooze
Copy link
Contributor

safrooze commented Sep 4, 2018

The use case is implementing effectively a circular buffer using concat+slice. Here is the code:


from mxnet import profiler
profiler.set_config(profile_all=True, aggregate_stats=True,
                    filename='/home/ec2-user/src/mkl_slice_op_profile.json')


class TestBlock(gluon.HybridBlock):
    def __init__(self):
        super(TestBlock, self).__init__()
        with self.name_scope():
            self.conv = gluon.nn.Conv2D(512, kernel_size=(1, 3), dilation=512)

    def hybrid_forward(self, F, x):
        out = self.conv(x)
        x = F.concat(x, out, dim=3)
        x = F.slice_axis(x, axis=3, begin=-1025, end=None)
        # x = F.slice(x, begin=(None, None, None, -1025), end=(None, None, None, None))
        return x


x = nd.random.uniform(shape=(32, 512, 1, 1025))
net = TestBlock()
net.initialize()
net.hybridize(static_alloc=True, static_shape=True)
x = net(x)

profiler.set_state('run')
for _ in range(100):
    x = net(x)

nd.waitall()
profiler.set_state('stop')
profiler.dump()
print(profiler.dumps(reset=True))
exit(0)

And here are the interesting profiling results.

  1. Profile with mxnet package and slice_axis operator (in hybrid_forward(), uncomment slice and comment slice_axis) (no MKL)
operator
=================
Name                          Total Count        Time (ms)    Min Time (ms)    Max Time (ms)    Avg Time (ms)
----                          -----------        ---------    -------------    -------------    -------------
slice_axis                            200        4048.8311          20.1010          20.3790          20.2442
Concat                                200       17641.7461          88.0750          89.5890          88.2087
Convolution                           200        2944.2839          14.5890          14.8890          14.7214
DeleteVariable                        206         517.0800           0.0030           2.6670           2.5101
  1. Profile with mxnet package and slice operator (no MKL) (Consistently performs ~2% better than slice_axis!!)
operator
=================
Name                          Total Count        Time (ms)    Min Time (ms)    Max Time (ms)    Avg Time (ms)
----                          -----------        ---------    -------------    -------------    -------------
slice                                 200        3938.1279          19.5190          19.9520          19.6906
Concat                                200       17636.0566          88.0600          88.7120          88.1803
Convolution                           200        2945.0759          14.5760          14.8420          14.7254
DeleteVariable                        206         521.2870           0.0030           2.6960           2.5305
  1. Profile with mxnet-mkl package and slice_axis operator (with MKLDNN)
operator
=================
Name                          Total Count        Time (ms)    Min Time (ms)    Max Time (ms)    Avg Time (ms)
----                          -----------        ---------    -------------    -------------    -------------
Reorder                               202           2.9610           0.0000           1.3190           0.0147
slice_axis                            200        4979.5488          24.6100          26.1240          24.8977
Concat                                200         881.7350           4.3000           4.5370           4.4087
Convolution                           200        1231.0720           5.9080          11.6130           6.1554
DeleteVariable                        408         982.9400           0.0030           2.8100           2.4092
  1. Profile with mxnet-mkl package and slice operator (with MKLDNN)
operator
=================
Name                          Total Count        Time (ms)    Min Time (ms)    Max Time (ms)    Avg Time (ms)
----                          -----------        ---------    -------------    -------------    -------------
Reorder                               202           2.8510           0.0000           1.2710           0.0141
slice                                 200        5012.6240          24.8500          27.0280          25.0631
Concat                                200         880.1710           4.2900           4.5270           4.4009
Convolution                           200        1252.7841           5.9060          11.7800           6.2639
DeleteVariable                        408         970.0030           0.0040           2.8370           2.3775

@pengzhao-intel
Copy link
Contributor

Thanks @safrooze :)
@fall4knight will follow up your test cases.

@fall4knight
Copy link

fall4knight commented Sep 5, 2018

@safrooze Thanks for your usecase. I have implemented the first edition of an MKL-DNN supported version for slice OP.
In cases on format nChw16c, which is the most widely used format, MKL-DNN is proved to have the capability to boost the slice OP by a lot.
Additionally, we found that the larger the input size is, the bigger the improvement is in the case of nChw16c.
Please check the profile log down below.

slice w/o MKL-DNN

Name Total Count Time (ms) Min Time (ms) Max Time (ms) Avg Time (ms)
Reorder 202 2.808 0 1.318 0.0139
slice 202 1145.891 5.357 6.295 5.6727
Convolution 202 518.247 2.423 5.015 2.5656
CopyCPU2CPU 4 4.495 0.02 2.228 1.1237
Concat 202 352.702 1.668 4.333 1.746
_full 2 0.023 0.011 0.012 0.0115
_random_uniform 4 19.74 0.386 9.484 4.935
_zeros 8 6.206 0.003 2.733 0.7757
DeleteVariable 408 102.104 0.003 0.349 0.2503
ResourceParallelRandomSetSeed 2 6.704 3.351 3.353 3.352

slice w/ MKL-DNN

Name Total Count Time (ms) Min Time (ms) Max Time (ms) Avg Time (ms)
Reorder 202 2.212 0 1.012 0.011
slice 202 507.673 2.395 2.802 2.5132
Convolution 202 520.934 2.372 4.951 2.5789
CopyCPU2CPU 4 5.424 0.023 2.689 1.356
Concat 202 332.056 1.601 2.755 1.6438
_full 2 0.025 0.012 0.013 0.0125
_random_uniform 4 19.853 0.413 9.515 4.9633
_zeros 8 8.877 0.004 4.09 1.1096
DeleteVariable 408 37.766 0.005 0.217 0.1833
ResourceParallelRandomSetSeed 2 7.638 3.818 3.82 3.819

@safrooze
Copy link
Contributor

safrooze commented Sep 5, 2018

Great results @fall4knight! Does it make sense to you that slice is about 50% more expensive than concat and almost as expensive as convolution?

@fall4knight
Copy link

@safrooze I think the reason is that in your usecase you set dilation=512, in which case the convolution is completely skipped according to the optimal algorithm. You can set dilation to other commonly used numbers like 1 and see what happens. Thanks.

@safrooze
Copy link
Contributor

@fall4knight Any update on submitting a PR for this fix?

@pengzhao-intel
Copy link
Contributor

Thanks @safrooze We are still WIP for the different type of slice, like sliceChannel, and the backward path.

@pengzhao-intel
Copy link
Contributor

#13730 @zheng-da @safrooze

@huangzhiyuan
Copy link
Contributor

@pengzhao-intel , @safrooze
Update profile result with #13730 [Add mkldnn OP for slice]
The implementation of slice mkldnn can speed up by about 2+ times, which is consistent with @fall4knight.

slice w/o MKL-DNN

Name                          Total Count        Time (ms)    Min Time (ms)    Max Time (ms)    Avg Time (ms)
----                          -----------        ---------    -------------    -------------    -------------
Reorder                               202           3.2840           0.0000           1.5200           0.0163
slice                                 200         948.5400           4.3540           5.5340           4.7427
Concat                                200         258.6810           1.2110           1.4840           1.2934
Convolution                           200         474.5550           2.1420           3.8940           2.3728
DeleteVariable                        408         140.0790           0.0050           0.5690           0.3433

slice w/ MKL-DNN

Name                          Total Count        Time (ms)    Min Time (ms)    Max Time (ms)    Avg Time (ms)
----                          -----------        ---------    -------------    -------------    -------------
Reorder                               202           3.8630           0.0000           1.7790           0.0191
slice                                 200         437.4760           1.9620           2.4460           2.1874
Concat                                200         273.7890           1.2180           1.6770           1.3689
Convolution                           200         486.8030           2.1530           4.0300           2.4340
DeleteVariable                        206          47.9190           0.0050           0.4690           0.2326

@TaoLv
Copy link
Member

TaoLv commented Jan 16, 2019

Closed via #13730 .

@TaoLv TaoLv closed this as completed Jan 16, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

8 participants