Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

fix naive engine for multi-threaded inference #15574

Merged
merged 3 commits into from
Aug 21, 2019

Conversation

arcadiaphy
Copy link
Member

Description

In the cpp example of image classification, MXPredCreateMultiThread is used to perform multi-threaded inference with naive engine. The req_completed_ variable in naive engine is not thread-safe, which is fixed in this PR.

The problem is reported in #8966.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

@apeforest
Copy link
Contributor

@anirudh2290 Could you please take a look?

@marcoabreu
Copy link
Contributor

Thanks for the contribution, but MXNets frontend is not considered threadsafe in general. Thus, I'd prefer if we hold up with this fix.

I feel like we should rather do a proper analysis about the threadsafety of MXNet and then act upon that. Small patches won't really solve the problem but might possibly introduce further issues.

@@ -238,7 +238,11 @@ class NaiveEngine final : public Engine {
static_cast<NaiveEngine*>(engine)->req_completed_ = true;
}
// whether action is completed
bool req_completed_;
#if DMLC_CXX11_THREAD_LOCAL
static thread_local bool req_completed_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pengzhao-intel
Copy link
Contributor

@ZhennanQin could you help to take a review for the fix?

@ZhennanQin
Copy link
Contributor

I just evaluated the thread-safe condition for NaiveEngine, and I agree with @marcoabreu that it's not enough with this point fix. There's too much data race when using MXNet in multi-thread usage, even for NaiveEngine, we need to take care all class member not only req_completed_. Creating private NaiveEngine for each thread is another option. So we need a whole picture plan for MXNet multi-thread support.

static thread_local bool req_completed_;
#else
static MX_THREAD_LOCAL bool req_completed_;
#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This introduces an additional overhead for thread local, and the number of ops in mxnet model can be in the scale of 100s. I am not sure we should add this without a build flag since we are adding overhead for single threaded use cases too.

Copy link
Contributor

@marcoabreu marcoabreu Jul 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please not another build flag - at some point we're gonna spend more money on CI than the NASA :D

Could we make some benchmarks to determine the impact here?

I'm not that deep inside the engine code, so excuse me if that's a stupid question. But why are we declaring the internal variables of the engine as thread local instead of creating thread-local instantiations of the engines. Each thread gets its own engine (or multiple engines or a single engine is shared across threads, synchronized with locks) and we switch all static variables to private ones. The current design of the engine doesn't seem to have the creation of multiple instantiations in mind so we should fix that instead of making all the static variables threadlocal.

@anirudh2290
Copy link
Member

@marcoabreu @ZhennanQin I agree we need a bigger solution for the problem at hand.Having said that, Looks like the scope of the PR is limited to fix the existing multi threaded inference which is limited to the C Predict API which should be fine.

@arcadiaphy
Copy link
Member Author

arcadiaphy commented Jul 18, 2019

@apeforest @ZhennanQin @marcoabreu @anirudh2290
Multi-threaded inference is very common in deploying, so we need to have a solution on this scenario and provide a workable example for users. Let me take some time to sum up what I've known now.

Since threaded engine uses multi-thread scheduling for operator computation, so a natural idea is to utilize it for parallel inference which is covered in this proposal. There are two main problem for this kind of usage in mxnet:

  • The threaded engine is not thread-safe, so we need to restrict the computation pushing and result pulling in one thread, otherwise we'll have all kinds of errors in engine.
  • For computation pushing, one thread is OK, but when we pull the result using functions like asnumpy, it will block the thread and increase the latency.

Due to these problems and the lacking of a workable example, so I focus on the naive engine instead. The only example in mxnet now is cpp predict with MXPredCreateMultiThread API, which creates executors for each thread by sharing model parameters, a very seductive option with regard to both speed and memory footprint. In production environment, naive engine looks like a more usable method by restricting one inference to only one thread. Also, I think this way is the basis of other advanced methods, if you can't make it work in naive engine, then very likely you can't make it work in threaded engine too. There are two problems in this method too:

  • The naive engine is not thread-safe, which I want to fix in this PR. Without this fix, the cpp example is not usable at all. I agree that a more thorough analysis is needed.
  • Perhaps there are some issues in MKLDNN calling, I've found a very strange phenomenon which is mentioned in Multi-threaded inference broken with MKLDNN #15576. I hope that this can be fixed too since MKLDNN is so good.

@arcadiaphy
Copy link
Member Author

@apeforest @anirudh2290
Do we really need req_completed_ in naive engine? I'm inclined to remove it now.

@anirudh2290
Copy link
Member

@arcadiaphy I guess that would be nice to explore. We can remove it only if we can make sure that all exec_funs pushed make a call to callback, since its original purpose is to prevent async calls being pushed.

@marcoabreu
Copy link
Contributor

marcoabreu commented Jul 19, 2019

I understand that you're trying to keep the changes to a minimum, but judging from the static variable definition, it doesn't seem like the engine was designed to be threadsafe - or even have multiple instantiations. I'd prefer if we make the engine only access internal-variables, get rid of static ones (aka proper object oriented design) and add synchronization to the APIs.

But I don't really have a dog in that hunt, so take my opinion with a grain of salt here.

@arcadiaphy
Copy link
Member Author

arcadiaphy commented Jul 21, 2019

@anirudh2290 Yes, in order to prevent async call from being pushed, using a single req_completed_ is not enough. We can add the req_completed_ to NaiveOpr and check whether it's been set to true in complete callback.

Edit:
I used a simpler implementation to remove field req_completed_. Since cpu stream is dummy stream, so for cpu mode, the naive engine is now stateless not considering bool flag shutdown_phase_.

@piyushghai
Copy link
Contributor

@apeforest @anirudh2290 Gentle ping for the review.

@anirudh2290 anirudh2290 merged commit d225074 into apache:master Aug 21, 2019
@arcadiaphy arcadiaphy deleted the pr_thread branch December 5, 2019 10:36
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants