Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Simplified HybridBlock.forward commit made Sockeye 4% slower #18699

Open
kpuatamazon opened this issue Jul 13, 2020 · 12 comments
Open

Simplified HybridBlock.forward commit made Sockeye 4% slower #18699

kpuatamazon opened this issue Jul 13, 2020 · 12 comments
Assignees
Labels

Comments

@kpuatamazon
Copy link
Contributor

kpuatamazon commented Jul 13, 2020

I'm experiencing a 4% slowdown in Sockeye due to commit 83b5170 "Add simplified HybridBlock.forward without F (#17530)".

Commit Time (s)
08528c5 295.90
56e7985 295.73
d4052fd 295.69
9a355eb 295.28
3840786 293.25
83b5170 293.70
8e39518 281.37
b133899 282.58
2f358fd 281.95
f01dc80 283.60
3667e9a 283.79
f7c4323 282.60

But it's slightly more complicated. At the beginning (f7c432), the build worked with MKLDNN at cb2cc7ac. Then 3667e9a broke the build with an MKLDNN upgrade, a bunch of commits went in with MKLDNN broken so they don't compile, and 08528c5 fixed it by downgrading MKL back to cb2cc7ac.

Hence I wrote this script that downgrades MKLDNN to make stuff build and find the relevant commit:

#!/bin/bash
export LD_PRELOAD=/opt/intel/mkl/lib/intel64/libmkl_rt.so
export CXXFLAGS="-O3 -march=native -DUSE_MKL -I/opt/intel/mkl/include -pipe"
set -e -o pipefail
. ~/test/bin/activate
cd ~/mxnet
git reset --hard
git checkout --force $1
git clean -xdff
git reset --hard
git submodule foreach --recursive git clean -ffxd
git submodule foreach --recursive git reset --hard
git submodule update --init --recursive
cd 3rdparty/mkldnn/
git checkout cb2cc7ac17ff4e2ef50805c7048d33256d82be4d
cd ../..
rm -rf build
mkdir build
cd build
cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release ..
ninja -j 4
cd ../python
pip3 install -e .
~/benchmark.sh

Test conditions:

  • c5.2xlarge specifically a "Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz"
  • OMP_NUM_THREADS=3
  • Forced MKL backend with export CXXFLAGS="-O3 -march=native -DUSE_MKL -I/opt/intel/mkl/include -pipe"
  • Sockeye 45d704a4
  • Batch size 1

More broadly, I'm trying to unpick performance differences seen in Sockeye as MXNet has changed since v1.5.x. This image shows commits since master diverged from v1.5.x. v1.5.x is on the left and cbbb864 is on the right.

master

The first big slowdown is an MKLDNN change on the left but that appears to have been fixed. Then there's a slowdown near the right that doesn't appear to be a single commit but rather a bunch of incremental changes. And this is the first of them I've been able to isolate.

@leezu
Copy link
Contributor

leezu commented Jul 13, 2020

Can you share the benchmark script and point out where it uses the HybridBlock? How many forward / backward operations are run by the script? We need to identify if the regression is due to changes in frontend or backend.

If it's in the frontend, the overhead might be from supporting both the deprecated and to be removed Python APIs as well as the updated API (ie. compatibility / MX1to MX2 transition mode).

@mjdenkowski
Copy link

This benchmark was run using a trained Sockeye model. This is a full sequence-to-sequence model with HybridBlocks glued together by Python code. We're running inference (translating a test set with beam search), so there shouldn't be any backward operations. We can share a model with input data and a run script if that will help with debugging.

Alternatively, do you have an idea of what good tests would be for determining frontend vs backend speed regressions or multi-API overhead? Are there build/runtime options for MXNet that we can try with our sample model? If we have a good idea of everything that changed in commit 83b5170, we can work through it by process of elimination.

@leezu
Copy link
Contributor

leezu commented Jul 20, 2020

@mjdenkowski

This benchmark was run using a trained Sockeye model. This is a full sequence-to-sequence model with HybridBlocks glued together by Python code. We're running inference (translating a test set with beam search), so there shouldn't be any backward operations. We can share a model with input data and a run script if that will help with debugging.

Do you run multiple inference passes in a single run (multiple forward) or only a single one?

Alternatively, do you have an idea of what good tests would be for determining frontend vs backend speed regressions or multi-API overhead?

The commit in question contains both changes in the Python frontend and in the C++ backend. To verify which part introduces an overhead, I recommend to partially apply the commit in question, discarding all Python changes and only applying the C++ changes.

In more detail, for the Python frontend, the following changes in the commit of question may introduce a overhead for your use-case:

  • if self.hybrid_forward.__func__ is not HybridBlock.hybrid_forward: checks which return True and thus trigger the MXNet 1 compatible code-path. You can refer to the block.py file.
  • C API call MXNDArraySetDeferredComputeVariable in parameter.py when getting a parameter data array.
  • Call op.cast instead of op.empty and op.copyto for array.astype implementation. See `ndarray.py

Note that these changes do not affect the 1.x branch and your code will stop working on the master branch the closer we get to MXNet 2 release due to API changes (#17676). Most importantly, the existing ndarray operators will be removed in favor of the numpy operators. Your code currently only runs due to convenience backwards compatibility provided during the early stages of MXNet 2 development, which will be removed before the release.

@kpuatamazon
Copy link
Contributor Author

kpuatamazon commented Jul 24, 2020

I split commit 83b5170 into just C++ changes and pushed to this branch so you can see precisely what it looks like: https://github.com/kpuatamazon/incubator-mxnet/tree/split-hybrid . The C++ changes made it faster then the Python changes made it slower.

Similar experimental setup.

Time (s) Commit
288.385 83b5170 + downgrade MKLDNN to cb2cc7a
271.170 8e39518 + downgrade MKLDNN to cb2cc7a + only the C++ changes from 83b5170 at https://github.com/kpuatamazon/incubator-mxnet/tree/split-hybrid
281.465 8e39518 + downgrade MKLDNN to cb2cc7a

@leezu
Copy link
Contributor

leezu commented Jul 24, 2020

Thank you for splitting up the commit and comparing the frontend to backend changes! To find out which part of the Python frontend changes is causing the overhead, I'd suggest to apply all changes from 83b5170 minus respectively a) multiarray.py and ndarray.py b) parameter.py c) block.py. My assumption is that the extra C API call in parameter.py causes the overhead and that variant b) would resolve the overhead. This works because Sockeye doesn't use HybridBlock.forward but HybridBlock.hybrid_forward

@mjdenkowski
Copy link

Thanks for your help with this, Leonard!

If this turns out to be the issue, are there any Sockeye-level changes that would avoid the extra overhead (forward vs hybrid_forward, etc.), or is this something that needs to be addressed in MXNet?

@leezu
Copy link
Contributor

leezu commented Jul 24, 2020

@mjdenkowski when Sockeye adopts MXNet 2, you will need to rename from hybrid_forward(F, ...) to forward(...). The overhead you currently experience could be from the additional C API call in parameter.py and the implementation in parameter.py would need to be refactored to remove the overhead.

For verifying the hypothesis, one can simply revert the changes in parameter.py while continuing to use hybrid_forward(F, ...). But that won't work in the final MXNet 2 release.

@kpuatamazon
Copy link
Contributor Author

kpuatamazon commented Jul 24, 2020

3 runs each sorted by increasing time taken. All downgrade MKLDNN to cb2cc7a + force MKL backend hack.

time time time git checkout 8e39518 ...
288.442 293.029 295.974 nothing, just keep 83b5170
287.809 299.730 307.538 a: python/mxnet/numpy/multiarray.py python/mxnet/ndarray/ndarray.py
289.946 290.054 292.440 b: python/mxnet/gluon/parameter.py
290.481 296.822 308.965 c: python/mxnet/gluon/block.py
275.382 282.285 284.385 justc++: python/mxnet/__init__.py python/mxnet/_deferred_compute.py python/mxnet/gluon/parameter.py python/mxnet/ndarray/ndarray.py python/mxnet/ndarray/sparse.py python/mxnet/numpy/multiarray.py python/mxnet/gluon/block.py (ok, I had to delete stuff that didn't exist in 83b5170 instead of git checkout)

@kpuatamazon
Copy link
Contributor Author

kpuatamazon commented Jul 27, 2020

Ran these in a loop over the weekend with 136 samples.

keep 83b5170 a b c justc++  
301.324 298.468 294.094 301.243 289.476 Mean
19.978 20.590 19.759 20.746 19.351 Sample stddev
266.168 262.355 260.640 266.384 255.469 Min
287.478 284.945 279.012 288.021 278.632 Q1
307.157 303.188 297.966 306.039 294.594 Median

Running a 1-tailed t-test assuming the same variance, we can say that the new version of python/mxnet/gluon/parameter.py (included in "keep 83b5170") is slower than the old version (in "b") with 99.8% confidence. There may also be other effects like a) seems to be faster than keep but I can't say it's significantly slower.

Raw data

@leezu
Copy link
Contributor

leezu commented Jul 27, 2020

Thank you @kpuatamazon. I'll make sure to fix this regression before the stable release. One idea is to speed up the communication with the backend, ie by moving away from ctypes.

As you mentioned that this is one of many changes causing regressions: if you would like to look at the following changes resulting in slowdown, you should be able to disregard the Python changes of 83b5170 (at least in parameter.py, multiarray.py, ndarray.py which shouldn't cause any conflicts due to the simplicity of the respective changes)

For b), it's still at 260.640 vs 255.469 Min runtime of the C++ only changes.. I think this may be due to the is_deferred_compute() calls in __getitem__, which also calls the C API to figure out if deferred compute is enabled or not.
Does your code use __getitem__ outside of the HybridBlock.hybrid_forward? This can be fixed by getting rid of global state in the MXNet backend and managing the state in the language frontend or speeding up the calls to the backend.

@fhieber
Copy link
Contributor

fhieber commented Jul 27, 2020

Hi @leezu thanks for the insight!
What __get_item__ are you referring to, for parameters?

@leezu
Copy link
Contributor

leezu commented Jul 27, 2020

@fhieber Sorry for being unclear. I'm referring to the 5 seconds overhead that remains after reverting parameter.py. I think it may be introduced by the extra C API call via is_deferred_compute() in __getitem__ method in multiarray.py / ndarray.py

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

4 participants