Simplified HybridBlock.forward commit made Sockeye 4% slower #18699

kpuatamazon · 2020-07-13T19:03:32Z

I'm experiencing a 4% slowdown in Sockeye due to commit 83b5170 "Add simplified HybridBlock.forward without F (#17530)".

Commit	Time (s)
`08528c5`	295.90
`56e7985`	295.73
`d4052fd`	295.69
`9a355eb`	295.28
`3840786`	293.25
`83b5170`	293.70
`8e39518`	281.37
`b133899`	282.58
`2f358fd`	281.95
`f01dc80`	283.60
`3667e9a`	283.79
`f7c4323`	282.60

But it's slightly more complicated. At the beginning (f7c432), the build worked with MKLDNN at cb2cc7ac. Then 3667e9a broke the build with an MKLDNN upgrade, a bunch of commits went in with MKLDNN broken so they don't compile, and 08528c5 fixed it by downgrading MKL back to cb2cc7ac.

Hence I wrote this script that downgrades MKLDNN to make stuff build and find the relevant commit:

#!/bin/bash
export LD_PRELOAD=/opt/intel/mkl/lib/intel64/libmkl_rt.so
export CXXFLAGS="-O3 -march=native -DUSE_MKL -I/opt/intel/mkl/include -pipe"
set -e -o pipefail
. ~/test/bin/activate
cd ~/mxnet
git reset --hard
git checkout --force $1
git clean -xdff
git reset --hard
git submodule foreach --recursive git clean -ffxd
git submodule foreach --recursive git reset --hard
git submodule update --init --recursive
cd 3rdparty/mkldnn/
git checkout cb2cc7ac17ff4e2ef50805c7048d33256d82be4d
cd ../..
rm -rf build
mkdir build
cd build
cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release ..
ninja -j 4
cd ../python
pip3 install -e .
~/benchmark.sh

Test conditions:

c5.2xlarge specifically a "Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz"
OMP_NUM_THREADS=3
Forced MKL backend with export CXXFLAGS="-O3 -march=native -DUSE_MKL -I/opt/intel/mkl/include -pipe"
Sockeye 45d704a4
Batch size 1

More broadly, I'm trying to unpick performance differences seen in Sockeye as MXNet has changed since v1.5.x. This image shows commits since master diverged from v1.5.x. v1.5.x is on the left and cbbb864 is on the right.

The first big slowdown is an MKLDNN change on the left but that appears to have been fixed. Then there's a slowdown near the right that doesn't appear to be a single commit but rather a bunch of incremental changes. And this is the first of them I've been able to isolate.

The text was updated successfully, but these errors were encountered:

leezu · 2020-07-13T19:20:28Z

Can you share the benchmark script and point out where it uses the HybridBlock? How many forward / backward operations are run by the script? We need to identify if the regression is due to changes in frontend or backend.

If it's in the frontend, the overhead might be from supporting both the deprecated and to be removed Python APIs as well as the updated API (ie. compatibility / MX1to MX2 transition mode).

mjdenkowski · 2020-07-19T19:05:57Z

This benchmark was run using a trained Sockeye model. This is a full sequence-to-sequence model with HybridBlocks glued together by Python code. We're running inference (translating a test set with beam search), so there shouldn't be any backward operations. We can share a model with input data and a run script if that will help with debugging.

Alternatively, do you have an idea of what good tests would be for determining frontend vs backend speed regressions or multi-API overhead? Are there build/runtime options for MXNet that we can try with our sample model? If we have a good idea of everything that changed in commit 83b5170, we can work through it by process of elimination.

leezu · 2020-07-20T17:56:08Z

@mjdenkowski

This benchmark was run using a trained Sockeye model. This is a full sequence-to-sequence model with HybridBlocks glued together by Python code. We're running inference (translating a test set with beam search), so there shouldn't be any backward operations. We can share a model with input data and a run script if that will help with debugging.

Do you run multiple inference passes in a single run (multiple forward) or only a single one?

Alternatively, do you have an idea of what good tests would be for determining frontend vs backend speed regressions or multi-API overhead?

The commit in question contains both changes in the Python frontend and in the C++ backend. To verify which part introduces an overhead, I recommend to partially apply the commit in question, discarding all Python changes and only applying the C++ changes.

In more detail, for the Python frontend, the following changes in the commit of question may introduce a overhead for your use-case:

if self.hybrid_forward.__func__ is not HybridBlock.hybrid_forward: checks which return True and thus trigger the MXNet 1 compatible code-path. You can refer to the block.py file.
C API call MXNDArraySetDeferredComputeVariable in parameter.py when getting a parameter data array.
Call op.cast instead of op.empty and op.copyto for array.astype implementation. See `ndarray.py

Note that these changes do not affect the 1.x branch and your code will stop working on the master branch the closer we get to MXNet 2 release due to API changes (#17676). Most importantly, the existing ndarray operators will be removed in favor of the numpy operators. Your code currently only runs due to convenience backwards compatibility provided during the early stages of MXNet 2 development, which will be removed before the release.

kpuatamazon · 2020-07-24T14:21:50Z

I split commit 83b5170 into just C++ changes and pushed to this branch so you can see precisely what it looks like: https://github.com/kpuatamazon/incubator-mxnet/tree/split-hybrid . The C++ changes made it faster then the Python changes made it slower.

Similar experimental setup.

Time (s)	Commit
288.385	`83b5170` + downgrade MKLDNN to cb2cc7a
271.170	`8e39518` + downgrade MKLDNN to cb2cc7a + only the C++ changes from `83b5170` at https://github.com/kpuatamazon/incubator-mxnet/tree/split-hybrid
281.465	`8e39518` + downgrade MKLDNN to cb2cc7a

leezu · 2020-07-24T16:24:44Z

Thank you for splitting up the commit and comparing the frontend to backend changes! To find out which part of the Python frontend changes is causing the overhead, I'd suggest to apply all changes from 83b5170 minus respectively a) multiarray.py and ndarray.py b) parameter.py c) block.py. My assumption is that the extra C API call in parameter.py causes the overhead and that variant b) would resolve the overhead. This works because Sockeye doesn't use HybridBlock.forward but HybridBlock.hybrid_forward

mjdenkowski · 2020-07-24T17:28:21Z

Thanks for your help with this, Leonard!

If this turns out to be the issue, are there any Sockeye-level changes that would avoid the extra overhead (forward vs hybrid_forward, etc.), or is this something that needs to be addressed in MXNet?

leezu · 2020-07-24T20:06:11Z

@mjdenkowski when Sockeye adopts MXNet 2, you will need to rename from hybrid_forward(F, ...) to forward(...). The overhead you currently experience could be from the additional C API call in parameter.py and the implementation in parameter.py would need to be refactored to remove the overhead.

For verifying the hypothesis, one can simply revert the changes in parameter.py while continuing to use hybrid_forward(F, ...). But that won't work in the final MXNet 2 release.

kpuatamazon · 2020-07-24T21:50:04Z

3 runs each sorted by increasing time taken. All downgrade MKLDNN to cb2cc7a + force MKL backend hack.

time	time	time	git checkout `8e39518` ...
288.442	293.029	295.974	nothing, just keep `83b5170`
287.809	299.730	307.538	a: `python/mxnet/numpy/multiarray.py python/mxnet/ndarray/ndarray.py`
289.946	290.054	292.440	b: `python/mxnet/gluon/parameter.py`
290.481	296.822	308.965	c: `python/mxnet/gluon/block.py`
275.382	282.285	284.385	justc++: `python/mxnet/__init__.py python/mxnet/_deferred_compute.py python/mxnet/gluon/parameter.py python/mxnet/ndarray/ndarray.py python/mxnet/ndarray/sparse.py python/mxnet/numpy/multiarray.py python/mxnet/gluon/block.py` (ok, I had to delete stuff that didn't exist in `83b5170` instead of `git checkout`)

kpuatamazon · 2020-07-27T10:01:53Z

Ran these in a loop over the weekend with 136 samples.

keep `83b5170`	a	b	c	justc++
301.324	298.468	294.094	301.243	289.476	Mean
19.978	20.590	19.759	20.746	19.351	Sample stddev
266.168	262.355	260.640	266.384	255.469	Min
287.478	284.945	279.012	288.021	278.632	Q1
307.157	303.188	297.966	306.039	294.594	Median

Running a 1-tailed t-test assuming the same variance, we can say that the new version of python/mxnet/gluon/parameter.py (included in "keep 83b5170") is slower than the old version (in "b") with 99.8% confidence. There may also be other effects like a) seems to be faster than keep but I can't say it's significantly slower.

Raw data

leezu · 2020-07-27T18:57:00Z

Thank you @kpuatamazon. I'll make sure to fix this regression before the stable release. One idea is to speed up the communication with the backend, ie by moving away from ctypes.

As you mentioned that this is one of many changes causing regressions: if you would like to look at the following changes resulting in slowdown, you should be able to disregard the Python changes of 83b5170 (at least in parameter.py, multiarray.py, ndarray.py which shouldn't cause any conflicts due to the simplicity of the respective changes)

For b), it's still at 260.640 vs 255.469 Min runtime of the C++ only changes.. I think this may be due to the is_deferred_compute() calls in __getitem__, which also calls the C API to figure out if deferred compute is enabled or not.
Does your code use __getitem__ outside of the HybridBlock.hybrid_forward? This can be fixed by getting rid of global state in the MXNet backend and managing the state in the language frontend or speeding up the calls to the backend.

fhieber · 2020-07-27T19:42:36Z

Hi @leezu thanks for the insight!
What __get_item__ are you referring to, for parameters?

leezu · 2020-07-27T21:17:16Z

@fhieber Sorry for being unclear. I'm referring to the 5 seconds overhead that remains after reverting parameter.py. I think it may be introduced by the extra C API call via is_deferred_compute() in __getitem__ method in multiarray.py / ndarray.py

kpuatamazon added the Bug label Jul 13, 2020

szha assigned leezu Jul 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplified HybridBlock.forward commit made Sockeye 4% slower #18699

Simplified HybridBlock.forward commit made Sockeye 4% slower #18699

kpuatamazon commented Jul 13, 2020 •

edited

Loading

leezu commented Jul 13, 2020

mjdenkowski commented Jul 19, 2020

leezu commented Jul 20, 2020

kpuatamazon commented Jul 24, 2020 •

edited

Loading

leezu commented Jul 24, 2020

mjdenkowski commented Jul 24, 2020

leezu commented Jul 24, 2020

kpuatamazon commented Jul 24, 2020 •

edited

Loading

kpuatamazon commented Jul 27, 2020 •

edited

Loading

leezu commented Jul 27, 2020 •

edited

Loading

fhieber commented Jul 27, 2020

leezu commented Jul 27, 2020

Simplified HybridBlock.forward commit made Sockeye 4% slower #18699

Simplified HybridBlock.forward commit made Sockeye 4% slower #18699

Comments

kpuatamazon commented Jul 13, 2020 • edited Loading

leezu commented Jul 13, 2020

mjdenkowski commented Jul 19, 2020

leezu commented Jul 20, 2020

kpuatamazon commented Jul 24, 2020 • edited Loading

leezu commented Jul 24, 2020

mjdenkowski commented Jul 24, 2020

leezu commented Jul 24, 2020

kpuatamazon commented Jul 24, 2020 • edited Loading

kpuatamazon commented Jul 27, 2020 • edited Loading

leezu commented Jul 27, 2020 • edited Loading

fhieber commented Jul 27, 2020

leezu commented Jul 27, 2020

kpuatamazon commented Jul 13, 2020 •

edited

Loading

kpuatamazon commented Jul 24, 2020 •

edited

Loading

kpuatamazon commented Jul 24, 2020 •

edited

Loading

kpuatamazon commented Jul 27, 2020 •

edited

Loading

leezu commented Jul 27, 2020 •

edited

Loading