-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
optimize unordered_map in profiler #166
optimize unordered_map in profiler #166
Conversation
Looks good, thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why even bother calling the EndTime... method if the profiler is disabled? This way you can tackle the problem at the source.
It's checked inside EndTime. If it's disabled it returned directly. We can move the check outside EndTime but it looks less nicer. |
As it is, EndTime still constructs a throwaway string object (node_name + "_kernel_time") that is wasteful. We also hit this in a scenario in WinML where these calls were costing us. It would be nice to not have any heap allocations due to EndTime. |
@tracysh exactly. std::string actually has small length optimization so if the length is short likely no allocation happens, although std::string construction is still a waste. At least I didn't see allocations from std::string from my CPU profiles. std::unordered_map is an allocation beast. Actually I'm not even sure why EventRecord (EndTime calls it internally) accepts unordered_map, it likely can be std::initializer_list. |
I just stepped through my winml binary and I see these string calls hit the heap. I'm all for changing the map code too as it reduces the size of the executor function, but at some point, I also want to see these strings not constructed. |
@tracysh I'm actually hunting all the dynamic allocations. operator new takes ~9% in my profile and end-2-end latency is very bad. Here it's just one of many. There's more severe allocations in other place: make_unique, and TensorShape itself is a vector etc. I'm trying to see what I can do with these. Will loop you in. |
As you can see, it's too late to check inside EndTime. The API might require some rework for sure. I think for now we should avoid invoking it in the first place if the profiler is disabled. |
@pranavsharma Actually I thought about reworking API. There're two options: 1) Expose IsEnabled() and check if profiler is enabled before EndTime(). This is less nicer as you have to do this before every profiler API call. 2) Create a macro that wraps IsEnabled() an profiler API. I'm not a super fan of abusing macro. Which option you prefer, or you have better idea? |
The macro option is fine. You can do it in a separate PR since this one is already merged. |
Arguments of Profiler::EndTimeAndRecordEvent are constructed even profiler is disabled. The unordered_map construction is very expensive in terms of both CPU and latency (dynamic allocation then memory contention). It costs 6.1% CPU in test. After this optimization, end-2-end avg latency reduced from 686us to 608us (11.4%), 1min CPU samples reduced from 19K to 13K (31,6%). operator new inc% reduced from 8.9% to 4.6%. (Absolute values subject to variance test by test)
The test is performed under 1000QPS, 10 threads on production V15 machine.