Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimize unordered_map in profiler #166

Merged

Conversation

zwxxx
Copy link
Contributor

@zwxxx zwxxx commented Dec 13, 2018

Arguments of Profiler::EndTimeAndRecordEvent are constructed even profiler is disabled. The unordered_map construction is very expensive in terms of both CPU and latency (dynamic allocation then memory contention). It costs 6.1% CPU in test. After this optimization, end-2-end avg latency reduced from 686us to 608us (11.4%), 1min CPU samples reduced from 19K to 13K (31,6%). operator new inc% reduced from 8.9% to 4.6%. (Absolute values subject to variance test by test)
The test is performed under 1000QPS, 10 threads on production V15 machine.

@zwxxx zwxxx requested a review from a team as a code owner December 13, 2018 04:06
@msftclas
Copy link

msftclas commented Dec 13, 2018

CLA assistant check
All CLA requirements met.

@RyanUnderhill
Copy link
Member

Looks good, thanks!

Copy link
Contributor

@pranavsharma pranavsharma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why even bother calling the EndTime... method if the profiler is disabled? This way you can tackle the problem at the source.

@zwxxx
Copy link
Contributor Author

zwxxx commented Dec 13, 2018

It's checked inside EndTime. If it's disabled it returned directly. We can move the check outside EndTime but it looks less nicer.

@tracysh
Copy link
Contributor

tracysh commented Dec 13, 2018

As it is, EndTime still constructs a throwaway string object (node_name + "_kernel_time") that is wasteful. We also hit this in a scenario in WinML where these calls were costing us. It would be nice to not have any heap allocations due to EndTime.

@zwxxx
Copy link
Contributor Author

zwxxx commented Dec 13, 2018

@tracysh exactly. std::string actually has small length optimization so if the length is short likely no allocation happens, although std::string construction is still a waste. At least I didn't see allocations from std::string from my CPU profiles. std::unordered_map is an allocation beast. Actually I'm not even sure why EventRecord (EndTime calls it internally) accepts unordered_map, it likely can be std::initializer_list.

@tracysh
Copy link
Contributor

tracysh commented Dec 13, 2018

I just stepped through my winml binary and I see these string calls hit the heap. I'm all for changing the map code too as it reduces the size of the executor function, but at some point, I also want to see these strings not constructed.

@zwxxx
Copy link
Contributor Author

zwxxx commented Dec 13, 2018

@tracysh I'm actually hunting all the dynamic allocations. operator new takes ~9% in my profile and end-2-end latency is very bad. Here it's just one of many. There's more severe allocations in other place: make_unique, and TensorShape itself is a vector etc. I'm trying to see what I can do with these. Will loop you in.

@pranavsharma
Copy link
Contributor

It's checked inside EndTime. If it's disabled it returned directly. We can move the check outside EndTime but it looks less nicer.

As you can see, it's too late to check inside EndTime. The API might require some rework for sure. I think for now we should avoid invoking it in the first place if the profiler is disabled.

@zwxxx
Copy link
Contributor Author

zwxxx commented Dec 13, 2018

@pranavsharma Actually I thought about reworking API. There're two options: 1) Expose IsEnabled() and check if profiler is enabled before EndTime(). This is less nicer as you have to do this before every profiler API call. 2) Create a macro that wraps IsEnabled() an profiler API. I'm not a super fan of abusing macro. Which option you prefer, or you have better idea?

@yufenglee yufenglee merged commit 2ffaa8a into microsoft:master Dec 14, 2018
@pranavsharma
Copy link
Contributor

@pranavsharma Actually I thought about reworking API. There're two options: 1) Expose IsEnabled() and check if profiler is enabled before EndTime(). This is less nicer as you have to do this before every profiler API call. 2) Create a macro that wraps IsEnabled() an profiler API. I'm not a super fan of abusing macro. Which option you prefer, or you have better idea?

The macro option is fine. You can do it in a separate PR since this one is already merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants