Skip to content

Conversation

@snnn
Copy link
Contributor

@snnn snnn commented Dec 11, 2018

This reverts PR #86 and #115, to unblock our CI build.

 ./onnx_test_runner -e mkldnn /data/testdata/onnx/5af210ca8a1c73aa6bae8754c9346ec54d0a756e
2018-12-11 11:10:43.385313765 [E:onnxruntime:Default, runner.cc:131 ParallelRunTests] Running tests in parallel: at most 8 models at any time
^[[A2018-12-11 11:10:43.451823048 [E:onnxruntime:Default, runner.cc:457 RunTaskImpl] sum_two_inputs:output=result:expected 6, got 10, diff: 4, tol=0.00106. 3 of 3 differ
2018-12-11 11:10:43.451847970 [E:onnxruntime:Default, runner.h:72 finish] sum_two_inputs: result differs. Dataset:/data/testdata/onnx/5af210ca8a1c73aa6bae8754c9346ec54d0a756e/node/test_sum_two_inputs/test_data_set_0

The implementation is not thread safe. Please fix the memory errors found by valgrind.

==13813== Memcheck, a memory error detector
==13813== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==13813== Using Valgrind-3.14.0 and LibVEX; rerun with -h for copyright info
==13813== Command: ./onnx_test_runner -e mkldnn /data/testdata/onnx/5af210ca8a1c73aa6bae8754c9346ec54d0a756e
==13813== 
2018-12-11 10:47:39.989740200 [E:onnxruntime:Default, runner.cc:131 ParallelRunTests] Running tests in parallel: at most 8 models at any time
==13813== Thread 6:
==13813== Conditional jump or move depends on uninitialised value(s)
==13813==    at 0x67FD06A: vfprintf (vfprintf.c:1642)
==13813==    by 0x68251EF: vsnprintf (vsnprintf.c:114)
==13813==    by 0x680479E: snprintf (snprintf.c:33)
==13813==    by 0x4ECB81F: void mkldnn::impl::init_info_eltwise<mkldnn::impl::eltwise_fwd_pd_t>(mkldnn::impl::eltwise_fwd_pd_t*, char*) (verbose.hpp:166)
==13813==    by 0x4F042DC: init_info (eltwise_pd.hpp:44)
==13813==    by 0x4F042DC: mkldnn_status_t mkldnn_primitive_desc::create<mkldnn::impl::cpu::jit_uni_eltwise_fwd_t<(mkldnn::impl::cpu::cpu_isa_t)3>::pd_t>(mkldnn_primitive_desc**, mkldnn::impl::op_desc_t const*, mkldnn_primitive_attr const*, mkldnn_engine*, mkldnn_primitive_desc const*) (primitive_desc.hpp:93)
==13813==    by 0x4EA82F6: operator++ (primitive_iterator.hpp:51)
==13813==    by 0x4EA82F6: mkldnn_primitive_desc_iterator_create_v2 (primitive_iterator.cpp:39)
==13813==    by 0x494F77: primitive_desc (mkldnn.hpp:1262)
==13813==    by 0x494F77: primitive_desc (mkldnn.hpp:2389)
==13813==    by 0x494F77: onnxruntime::mkl_dnn::(anonymous namespace)::ReluPrimitive<float>::Initialize(onnxruntime::mkl_dnn::(anonymous namespace)::ReluParams const&) (activations.cc:125)
==13813==    by 0x498DDC: ReluPrimitive (activations.cc:46)
==13813==    by 0x498DDC: make_unique<onnxruntime::mkl_dnn::(anonymous namespace)::ReluPrimitive<float>, const onnxruntime::mkl_dnn::(anonymous namespace)::ReluParams&> (unique_ptr.h:825)
==13813==    by 0x498DDC: Get (activations.cc:161)
==13813==    by 0x498DDC: onnxruntime::mkl_dnn::Relu<float>::Compute(onnxruntime::OpKernelContext*) const (activations.cc:203)
==13813==    by 0x668600: onnxruntime::SequentialExecutor::Execute(onnxruntime::SessionState const&, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, onnxruntime::MLValue, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, onnxruntime::MLValue> > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<onnxruntime::MLValue, std::allocator<onnxruntime::MLValue> >&, onnxruntime::logging::Logger const&) (sequential_executor.cc:102)
==13813==    by 0x461F08: onnxruntime::InferenceSession::Impl::Run(ONNXRuntimeRunOptions const&, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, onnxruntime::MLValue, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, onnxruntime::MLValue> > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<onnxruntime::MLValue, std::allocator<onnxruntime::MLValue> >*) (inference_session.cc:777)
==13813==    by 0x45850C: onnxruntime::InferenceSession::Run(ONNXRuntimeRunOptions const&, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, onnxruntime::MLValue, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, onnxruntime::MLValue> > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<onnxruntime::MLValue, std::allocator<onnxruntime::MLValue> >*) (inference_session.cc:1117)
==13813==    by 0x44C38C: ONNXRuntimeRunInference (onnxruntime_c_api.cc:440)

@snnn snnn requested a review from a team December 11, 2018 19:14
@snnn snnn changed the title Revert "mkldnn activations.relu (#86)" Revert "mkldnn Relu/Sum/BatchNormalization kernels" Dec 11, 2018
@snnn
Copy link
Contributor Author

snnn commented Dec 11, 2018

@jywu-msft
Copy link
Member

@pranavsharma fyi

@jywu-msft
Copy link
Member

@sreekanth-yalachigere can you please investigate?

@snnn reports that valgrind finds error in mkldnn relu, and onnx_test_runner -e mkldnn intermittently fails the node test "test_sum_two_inputs"

/data/testdata/onnx/5af210ca8a1c73aa6bae8754c9346ec54d0a756e/node/test_sum_two_inputs/test_data_set_0

@sreekanth-yalachigere
Copy link
Contributor

@snnn @jywu-msft working on it. Thanks.

@sreekanth-yalachigere
Copy link
Contributor

@snnn @jywu-msft I am having hard time to reproduce memory leak. My log is at
https://github.com/sreekanth-yalachigere/SYLog/blob/master/valgrind

Please let me know if I am running it differently.

@sreekanth-yalachigere
Copy link
Contributor

for sum error, I tried to reproduce problem with test_runner it with sum_two_inputs and I am getting the following version mismatch and runner is falling back to cpu implementation.

Sum kernel is not supported in MKLDNNExecutionProvider Encountered following errors: Op: Sum Version mismatch. node_version: 8 kernel start version: 6 kernel_end_version: 2147483647

complete log

./onnx_test_runner -A  -r 1 -v  -e mkldnn /home/padma/mkldnn-data/data/sum2
2018-12-11 13:57:43.063640910 [I:onnxruntime:sum_two_inputs, inference_session.cc:309 Initialize] Initializing session.
2018-12-11 13:57:43.063678786 [I:onnxruntime:sum_two_inputs, inference_session.cc:323 Initialize] Adding default CPU execution provider.
2018-12-11 13:57:43.064624799 [I:onnxruntime:Default, kernel_registry.cc:220 TryFindKernel] Sum kernel is not supported in MKLDNNExecutionProvider Encountered following errors: Op: Sum Version mismatch. node_version: 8 kernel start version: 6 kernel_end_version: 2147483647
2018-12-11 13:57:43.064650915 [I:onnxruntime:Default, kernel_registry.cc:220 TryFindKernel] Sum kernel is not supported in MKLDNNExecutionProvider Encountered following errors: Op: Sum Execution provider mismatch. Expected: MKLDNNExecutionProvider Acutal: CPUExecutionProvider Op: Sum Execution provider mismatch. Expected: MKLDNNExecutionProvider Acutal: CPUExecutionProvider
2018-12-11 13:57:43.064683928 [I:onnxruntime:Default, kernel_registry.cc:220 TryFindKernel] Sum kernel is not supported in CPUExecutionProvider Encountered following errors: Op: Sum Execution provider mismatch. Expected: CPUExecutionProvider Acutal: MKLDNNExecutionProvider
2018-12-11 13:57:43.064758359 [I:onnxruntime:sum_two_inputs, session_state_initializer.cc:189 SaveMLValueNameIndexMapping] SaveMLValueNameIndexMapping
2018-12-11 13:57:43.064775842 [I:onnxruntime:sum_two_inputs, session_state_initializer.cc:235 SaveMLValueNameIndexMapping] Done saving MLValue mappings.
2018-12-11 13:57:43.064792385 [I:onnxruntime:Default, kernel_registry.cc:220 TryFindKernel] Sum kernel is not supported in  Encountered following errors:
2018-12-11 13:57:43.064808131 [I:onnxruntime:Default, kernel_registry.cc:220 TryFindKernel] Sum kernel is not supported in  Encountered following errors: Op: Sum Execution provider mismatch. Expected: CPUExecutionProvider Acutal: MKLDNNExecutionProvider
2018-12-11 13:57:43.064834382 [I:onnxruntime:sum_two_inputs, session_state_initializer.cc:304 SaveInitializedTensorsWithMemPattern] Saving initialized tensors.
2018-12-11 13:57:43.064851458 [I:onnxruntime:sum_two_inputs, session_state_initializer.cc:376 SaveInitializedTensorsWithMemPattern] Done saving initialized tensors
2018-12-11 13:57:43.064865269 [I:onnxruntime:sum_two_inputs, session_state_initializer.cc:464 SaveKernels] Saving kernels.
2018-12-11 13:57:43.064878646 [I:onnxruntime:Default, kernel_registry.cc:220 TryFindKernel] Sum kernel is not supported in CPUExecutionProvider Encountered following errors:
2018-12-11 13:57:43.064894332 [I:onnxruntime:Default, kernel_registry.cc:220 TryFindKernel] Sum kernel is not supported in CPUExecutionProvider Encountered following errors: Op: Sum Execution provider mismatch. Expected: CPUExecutionProvider Acutal: MKLDNNExecutionProvider
2018-12-11 13:57:43.064912268 [I:onnxruntime:sum_two_inputs, session_state_initializer.cc:473 SaveKernels] Done saving kernels.
2018-12-11 13:57:43.064925268 [I:onnxruntime:Default, kernel_registry.cc:220 TryFindKernel] Sum kernel is not supported in  Encountered following errors:
2018-12-11 13:57:43.064939245 [I:onnxruntime:Default, kernel_registry.cc:220 TryFindKernel] Sum kernel is not supported in  Encountered following errors: Op: Sum Execution provider mismatch. Expected: CPUExecutionProvider Acutal: MKLDNNExecutionProvider
2018-12-11 13:57:43.064955752 [I:onnxruntime:Default, kernel_registry.cc:220 TryFindKernel] Sum kernel is not supported in  Encountered following errors:
2018-12-11 13:57:43.064969749 [I:onnxruntime:Default, kernel_registry.cc:220 TryFindKernel] Sum kernel is not supported in  Encountered following errors: Op: Sum Execution provider mismatch. Expected: CPUExecutionProvider Acutal: MKLDNNExecutionProvider
2018-12-11 13:57:43.064986640 [I:onnxruntime:sum_two_inputs, inference_session.cc:357 Initialize] Session successfully initialized.
2018-12-11 13:57:43.065009937 [I:onnxruntime:Default, runner.cc:497 RunSingleTestCase] testing sum_two_inputs

2018-12-11 13:57:43.065146090 [I:onnxruntime:sum_two_inputs, sequential_executor.cc:38 Execute] Begin execution
result:
        Models: 1
        Total test cases: 1
                Succeeded: 1
                Not implemented: 0
                Failed: 0
        Stats by Operator type:
                Not implemented(0):
                Failed:
Failed Test Cases:

@sreekanth-yalachigere
Copy link
Contributor

@snnn @jywu-msft

update on sum error
I changed Sum mkldnn kernel version from 6 to 8 and now I see that test is passing. mkldnn-verbose confirms that mkldnn-Sum operator executed sum_two_inputs.

2018-12-12 10:20:14.7101828 [I:onnxruntime:Default, runner.cc:497 RunSingleTestCase] testing sum_two_inputs
2018-12-12 10:20:14.7145605 [I:onnxruntime:Default, bfc_arena.cc:102 onnxruntime::BFCArena::Extend] Extending allocation by 1048576 bytes.
2018-12-12 10:20:14.7186621 [I:onnxruntime:Default, bfc_arena.cc:106 onnxruntime::BFCArena::Extend] Total allocated bytes: 1048576
2018-12-12 10:20:14.7223202 [I:onnxruntime:Default, bfc_arena.cc:109 onnxruntime::BFCArena::Extend] Allocated memory at 000002292D9E4040 to 000002292DAE4040
2018-12-12 10:20:14.7270113 [I:onnxruntime:sum_two_inputs, sequential_executor.cc:38 onnxruntime::SequentialExecutor::Execute] Begin execution
mkldnn_verbose,exec,sum,simple:any,undef,in:f32_x out:f32_x,num:2,3,3.96508
result:
        Models: 1
        Total test cases: 1
                Succeeded: 1
                Not implemented: 0
                Failed: 0
        Stats by Operator type:
                Not implemented(0):
                Failed:
Failed Test Cases:

@snnn
Copy link
Contributor Author

snnn commented Dec 12, 2018

Hi @sreekanth-yalachigere
There are two issues here, one is in relu, another is in sum.
The relu issue is not about memory leak. Valgrind says you are trying to access some uninitialised memory.
The sum issue is not about kernel version. It is we can find the kernel but sometimes it outputs incorrect result.

sum_two_inputs:output=result:expected 6, got 10, diff: 4, 

@snnn
Copy link
Contributor Author

snnn commented Dec 12, 2018

Let me revert these two PR first, because we are sure it is causing CI build failures. We can create a new branch and diagnose the bug on that separate branch.

@snnn snnn merged commit 2e79597 into master Dec 12, 2018
@snnn snnn deleted the snnn/revert_relu branch December 12, 2018 22:28
souptc pushed a commit that referenced this pull request Dec 17, 2018
Implement transpose kernel

Related work items: #147
oliviajain pushed a commit that referenced this pull request Apr 21, 2022
Bug Fix for Multiple inputs/outputs scenario with OV-EP
TedThemistokleous pushed a commit to TedThemistokleous/onnxruntime that referenced this pull request Jul 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants