Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to generate JIT code with frame pointers to enable Linux perf stack unwinding #1232

Open
aalexand opened this issue Jan 15, 2022 · 21 comments
Assignees
Labels
enhancement A feature or an optimization request platform:cpu-x64 Intel64/AMD64 processors. Codeowner: @oneapi-src/onednn-cpu-x64

Comments

@aalexand
Copy link

It seems that the JIT code generated by MKL DNN today does not set up frame pointers which means performance tools like Linux perf cannot unwind the stack by default.

It's understood that utilizing a frame pointer register can have negative performance impact due to one less register available for general purposes, but this may be an acceptable trade-off for some users.

It's also understood that the JIT code would not be symbolized unless further integration with the profiler is in place, but merely being able to unwind the code is useful since the caller functions would be symbolized and can provide enough context on what part of the call tree this JIT code is invoked by and hence what it does.

Would it be possible to have an opt-in mode where the generated JIT code would set up frame pointers to enable Linux perf stack unwinding in the default frame pointer mode?

@aalexand aalexand added the enhancement A feature or an optimization request label Jan 15, 2022
@dzarukin
Copy link
Contributor

Hi @aalexand, thank you for the question. Did you have a chance to look through this article? If following steps and using certain variables oneDNN supports, would it cover your need? Thank you.

@aalexand
Copy link
Author

@dzarukin Is there a value for ONEDNN_JIT_PROFILE that enables frame pointer generation while NOT enabling perf-pid or JIT dump file generation?

From the doc it is not clear that there is.

@dzarukin
Copy link
Contributor

Is there a value for ONEDNN_JIT_PROFILE that enables frame pointer generation while NOT enabling perf-pid or JIT dump file generation?

There's no such value and, likely, support you are asking about.

Could you explain what's the problem are you trying to solve and why with exact approach you are asking about?

All of jit-ted implementation follow the same pattern - there's an execute(...) call, inside which there's a call to parallel section (parallel(...), parallel_nd(...) or some of their modifications). Each thread in such section creates kernel arguments and use function call operator operator() of kernel which is defined in jit_generator.hpp.

I don't see what kind of stack you want to inspect with perf (rather than gdb) and what's the purpose. Sorry.

P.S. If you would like to see the stack chain I've just described, you may try to compile with -fno-omit-frame-pointer and see if it provides the necessary level of stack trace (though I have a feeling one may need to deal with threading first somehow). Thank you.

@aalexand
Copy link
Author

Our profiling infrastructure collects periodic profiles using Linux perf on machines running production workloads. Each particular machine is visited rarely, so it's important that any kind of overhead outside the profiling sessions is low. Using either perf-pid or JIT dump files is thus not affordable as that requires storing that data continuously for the full duration of the lifetime of the production workload, so that is out of question at least for the purpose of this discussion. What I was trying to understand is whether we can at least make it so that perf can unwind the stack so that for the JIT code the engineer looking at the profile can understand what part of their code invoked the library function.

It sounds like you are saying that unwinding the stack would not help with that since all JIT code executes on worker threads, is that right? And the thread that invoked the library function does not take any share of the work and simply waits for the work to complete in the workers?

you may try to compile with -fno-omit-frame-pointer

I would think that -fno-omit-frame-pointer would only affect the compiler-generated code in the binary, not the JIT code that the library generates?

@dzarukin
Copy link
Contributor

It seems to me I don't understand what's going on...

Please correct me if I understood the case correctly: there's a production application. By some reason, from time to time, such application may be started under Linux perf utility to collect a profile. Based on that profile one sees some hotspots. In case oneDNN is used, the assumption is that jit kernels would be in the top of the list, and now the question is: "Where these JIT kernels come from?" Since this is JIT, it's not easy to understand where it comes from and the ultimate question is: "How to match the hotspot from profile to some specific spot at oneDNN and then further at higher level application which calls oneDNN?"

Is it the right interpretation? Thank you.

@aalexand
Copy link
Author

By some reason, from time to time, such application may be started under Linux perf utility to collect a profile.

Linux perf does not start the application. The application is a server that is continuously running on a production node. Linux perf is invoked periodically in the system-wide mode to collect 10 seconds of data for the whole machine to record statistical fleet-wide profiles for all applications in the data center. See https://research.google/pubs/pub36575/.

The rest of the description seems roughly correct. The goal is to understand the call path to the JIT code.

@jczaja
Copy link
Contributor

jczaja commented Jan 19, 2022

@aalexand Have you tried using lbr based stack unwinding in perf? It does not need -fno-omit-frame-pointer to be passed. On the other hand lbr only captures 32 levels of callstack.

@aalexand
Copy link
Author

@jczaja Yes, we are using that. It helps some, but not too much since as I said continuous production profiling collects short sessions of the data, not profiling application from the start. So the LBR call stack collection will only see the changes during the 10 second profiling session which means it will only be able to gather the portion of the stack that changed during the profiling session, not the full stack.

@dzarukin
Copy link
Contributor

Hi @aalexand, I feel I have a common sense about the nature of the question.
May I ask you to provide specific commands or instructions so that I could reproduce the behavior on our side and see if there's anything I could do to improve what we have now.
It will be also good to have any additional or specific environment to run those command if any. Thank you.

@aalexand
Copy link
Author

aalexand commented Jan 20, 2022

I don't have a reproducer at hand as current examples are production code that I can't share. I think you can reproduce it easily on any example where, say, main calls foo calls bar which calls a DNN primitive. What we want is that with the JIT symbolization off we still get the callchain with main, foo and bar in it.

I think the crux of the request that we need to get to is whether you can confirm that the JIT code intentionally does not generate frame pointer setup today (i.e. no push %rbp; mov %rsp, %rbp% sequence and can clobber %rbp) and whether there are any obstacles to having a mode where it would.

@igorsafo
Copy link
Contributor

Hi @aalexand , Unfortunately due to limited number of GPRs we use rbp inside JIT kernels, so we re-write it as a result rbp-based stack unwinding is not possible. Since all JIT kernels are independent, it might require a great effort to modify all of them to keep rbp register untouched. In general it seems to be a good idea to support it.

@vpirogov
Copy link
Member

Let's scope out the changes though to see how expensive this is. I also do not have clear understanding of how high is register pressure and whether reserving rbp will result in material performance impact.

@dzarukin
Copy link
Contributor

@aalexand, I was mostly asking about exact perf commands you are using to collect those traces and report the stack unwinding. I can run benchdnn to validate all specter of kernels, but I would like to have the exact methodology you are using so that I can see that changing $rbp availability would result in desired behavior. Thanks.

@aalexand
Copy link
Author

One thing we do need to confirm is that at least some stacks unwound from the JIT kernels are going to be useful as in they are going to lead to the user calling code. Previously in this thread it was mentioned that all JIT kernels might be executing on worker threads. If that's the case then the value of unwinding the stack might be marginal.

@aalexand
Copy link
Author

@dzarukin The perf command would be perf record -a sleep 10 or something like that. Basically the application is started independently from the perf collection, the perf collection is system-wide, a session of 10 seconds, with the default FP-based stack unwinding.

@dzarukin
Copy link
Contributor

Previously in this thread it was mentioned that all JIT kernels might be executing on worker threads.

I forgot to answer this one. So, parallel section would utilize as many threads as requested including master thread. That's why getting stack from master thread to get to user code may make sense. I haven't educated myself on how recording works in multithreaded case, but, I guess, we can start from one thread scenario. Thanks.

@dzarukin dzarukin self-assigned this Jan 20, 2022
@jczaja
Copy link
Contributor

jczaja commented Jan 21, 2022

@aalexand Regarding JIT functions to be part of call stack. The procedure for perf (updating perf map) is in manual:
https://oneapi-src.github.io/oneDNN/dev_guide_profilers.html as well as information that it is recommended to have recent linux kernel (5.x) to have perf working with JIT code. From my basic experience (centos8), I got JIT functions being part of callstack when using LBR stack walker . Any other method (FP, DWARF) did not really work properly , but I think it was perf client of centos8 which was not working fully properly for dwarf, fp. Use case was using callgraphs to generate flamegraphs(based on Brendan Gregg scripts) and based on flamegraphs I looked if JIT functions are properly placed and meaningful in FP, DWARF and LBR.

@aalexand
Copy link
Author

@jczaja We are not interested in the JIT symbolization for now for the reasons I mentioned earlier. This issue is exclusively about the stack unwinding.

@dzarukin
Copy link
Contributor

Preliminary analysis on rbp register usage is a bit discouraging:

  • The beginning of the kernel, known as preamble() is not saving rsp state to rbp at the start of jit section. This is easy to fix.
  • The bigger problem, which should be solved first, is AVX-512 and above architectures, where rbp register (to be precise, it's 32-bit ebp counterpart) (have no idea why exactly it) is used for so-called EVEX_compress method which allows to use strides of more than INT_MAX in certain kernels. That was very important for KNL/KNM performance. We are checking if this is still important for Xeon platform overall.
  • Since all kernels are hard-coded with what registers to use, it means that to change this ebp on something else would require all kernels that utilize compression to be checked whether registers could be safely swapped.
  • Once swapped, the next stage would be to validate all kernels that are using either rbp or ebp could use a different register (if there are unused ones) or should make spills in certain places to still have kernel functionally correct.
  • Given the amount of kernels, it may take quite a long time for inspection and implementation.

On top of that, I have some troubles with AVX-512 setup. RHEL 7.4 with perf 3.10.0-693 does not show any stack no matter what, even if reference kernel was used, while AVX2 setup with RHEL 7.9 and perf 3.10.0-1160 shows stack as expected. I don't know if this is a machine problem, or Linux kernel issues, or perf issues, but it slows investigation down, unfortunately.

Given all that, @aalexand, we may try to expedite this and narrow down the scope if you could share what platforms are the main target (mostly from ISA perspective) and what functionality from oneDNN is used so that we may focus on those kernels in first place and deliver them earlier than the others. Thanks.

@boulos
Copy link

boulos commented Jan 21, 2022

I think getting AVX2 working with frame pointers (and not yet caring that AVX-512 / EVEX_compress) would be a big improvement already even if AVX-512 kernels were still "blind". We could validate the improvement on AVX2 and then motivate the AVX-512 work.

Unfortunately, it seems like your 3rd point (kernels hard coded) is the larger effort. Lots of them seem to already have is_windows logic (e.g., this gemm one), so I'm not sure how many would need to get changed to support "good enough" visibility.

I think to support the performance vs visibility tradeoff, you'd probably want to support the equivalent of a -fno-omit-frame-pointer for the JIT.

Can you say more about getting your stacks just fine on the AVX2 setup? [Was this after fixing preamble to save frame pointers?]

@dzarukin
Copy link
Contributor

Hi @boulos, sorry for a late response.

I got your point about AVX2. I also think it will be a great start of improving the situation. The team will prioritize GEMM part and do the best to free rbp register. We will keep you posting about the progress.

The part about is_windows is more about ABI, and it's pretty consistent among different kernels, AFAIK, it's just written here this way.

you'd probably want to support the equivalent of a -fno-omit-frame-pointer for the JIT

It would work if there's something like resource manager for GPRs, VRs and other JIT related object. Until then an option doesn't make much sense since it would be really hard to maintain both version not to say to validate them both.

Can you say more about getting your stacks just fine on the AVX2 setup? [Was this after fixing preamble to save frame pointers?]

Sorry I was not detailed enough. I'm not sure what kind of tools are used to get a stack trace, but perf report shows me the stack from where the code was executed. It works on current master without any changes (just -fno-omit-frame-pointer compiler option enabled) . But, again, I'm not sure I can completely reproduce the situation described above. This is a screen I usually get.
image

I understand this is about keeping rbp uncorrupted, but having a reproducer for behavior you experiencing will help a lot. At least when we prepare changes to validate them properly. This is command I use for collecting: perf record -g --call-graph fp -k1 --delay=5000 -- benchdnn ..., and this I use for reporting: perf report -i perf.data. Not sure what I'm doing wrong... Anyway, this thing is now on our radar, we will see how it goes. Thanks.

@vpirogov vpirogov added the platform:cpu-x64 Intel64/AMD64 processors. Codeowner: @oneapi-src/onednn-cpu-x64 label Mar 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A feature or an optimization request platform:cpu-x64 Intel64/AMD64 processors. Codeowner: @oneapi-src/onednn-cpu-x64
Projects
None yet
Development

No branches or pull requests

6 participants