-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ability to generate JIT code with frame pointers to enable Linux perf stack unwinding #1232
Comments
Hi @aalexand, thank you for the question. Did you have a chance to look through this article? If following steps and using certain variables oneDNN supports, would it cover your need? Thank you. |
@dzarukin Is there a value for ONEDNN_JIT_PROFILE that enables frame pointer generation while NOT enabling perf-pid or JIT dump file generation? From the doc it is not clear that there is. |
There's no such value and, likely, support you are asking about. Could you explain what's the problem are you trying to solve and why with exact approach you are asking about? All of jit-ted implementation follow the same pattern - there's an I don't see what kind of stack you want to inspect with perf (rather than gdb) and what's the purpose. Sorry. P.S. If you would like to see the stack chain I've just described, you may try to compile with |
Our profiling infrastructure collects periodic profiles using Linux perf on machines running production workloads. Each particular machine is visited rarely, so it's important that any kind of overhead outside the profiling sessions is low. Using either perf-pid or JIT dump files is thus not affordable as that requires storing that data continuously for the full duration of the lifetime of the production workload, so that is out of question at least for the purpose of this discussion. What I was trying to understand is whether we can at least make it so that perf can unwind the stack so that for the JIT code the engineer looking at the profile can understand what part of their code invoked the library function. It sounds like you are saying that unwinding the stack would not help with that since all JIT code executes on worker threads, is that right? And the thread that invoked the library function does not take any share of the work and simply waits for the work to complete in the workers?
I would think that |
It seems to me I don't understand what's going on... Please correct me if I understood the case correctly: there's a production application. By some reason, from time to time, such application may be started under Linux Is it the right interpretation? Thank you. |
Linux perf does not start the application. The application is a server that is continuously running on a production node. Linux perf is invoked periodically in the system-wide mode to collect 10 seconds of data for the whole machine to record statistical fleet-wide profiles for all applications in the data center. See https://research.google/pubs/pub36575/. The rest of the description seems roughly correct. The goal is to understand the call path to the JIT code. |
@aalexand Have you tried using lbr based stack unwinding in perf? It does not need -fno-omit-frame-pointer to be passed. On the other hand lbr only captures 32 levels of callstack. |
@jczaja Yes, we are using that. It helps some, but not too much since as I said continuous production profiling collects short sessions of the data, not profiling application from the start. So the LBR call stack collection will only see the changes during the 10 second profiling session which means it will only be able to gather the portion of the stack that changed during the profiling session, not the full stack. |
Hi @aalexand, I feel I have a common sense about the nature of the question. |
I don't have a reproducer at hand as current examples are production code that I can't share. I think you can reproduce it easily on any example where, say, I think the crux of the request that we need to get to is whether you can confirm that the JIT code intentionally does not generate frame pointer setup today (i.e. no |
Hi @aalexand , Unfortunately due to limited number of GPRs we use |
Let's scope out the changes though to see how expensive this is. I also do not have clear understanding of how high is register pressure and whether reserving |
@aalexand, I was mostly asking about exact |
One thing we do need to confirm is that at least some stacks unwound from the JIT kernels are going to be useful as in they are going to lead to the user calling code. Previously in this thread it was mentioned that all JIT kernels might be executing on worker threads. If that's the case then the value of unwinding the stack might be marginal. |
@dzarukin The perf command would be |
I forgot to answer this one. So, parallel section would utilize as many threads as requested including master thread. That's why getting stack from master thread to get to user code may make sense. I haven't educated myself on how recording works in multithreaded case, but, I guess, we can start from one thread scenario. Thanks. |
@aalexand Regarding JIT functions to be part of call stack. The procedure for perf (updating perf map) is in manual: |
@jczaja We are not interested in the JIT symbolization for now for the reasons I mentioned earlier. This issue is exclusively about the stack unwinding. |
Preliminary analysis on
On top of that, I have some troubles with AVX-512 setup. RHEL 7.4 with perf Given all that, @aalexand, we may try to expedite this and narrow down the scope if you could share what platforms are the main target (mostly from ISA perspective) and what functionality from oneDNN is used so that we may focus on those kernels in first place and deliver them earlier than the others. Thanks. |
I think getting AVX2 working with frame pointers (and not yet caring that AVX-512 / EVEX_compress) would be a big improvement already even if AVX-512 kernels were still "blind". We could validate the improvement on AVX2 and then motivate the AVX-512 work. Unfortunately, it seems like your 3rd point (kernels hard coded) is the larger effort. Lots of them seem to already have I think to support the performance vs visibility tradeoff, you'd probably want to support the equivalent of a Can you say more about getting your stacks just fine on the AVX2 setup? [Was this after fixing preamble to save frame pointers?] |
Hi @boulos, sorry for a late response. I got your point about AVX2. I also think it will be a great start of improving the situation. The team will prioritize GEMM part and do the best to free The part about
It would work if there's something like resource manager for GPRs, VRs and other JIT related object. Until then an option doesn't make much sense since it would be really hard to maintain both version not to say to validate them both.
Sorry I was not detailed enough. I'm not sure what kind of tools are used to get a stack trace, but I understand this is about keeping |
It seems that the JIT code generated by MKL DNN today does not set up frame pointers which means performance tools like Linux perf cannot unwind the stack by default.
It's understood that utilizing a frame pointer register can have negative performance impact due to one less register available for general purposes, but this may be an acceptable trade-off for some users.
It's also understood that the JIT code would not be symbolized unless further integration with the profiler is in place, but merely being able to unwind the code is useful since the caller functions would be symbolized and can provide enough context on what part of the call tree this JIT code is invoked by and hence what it does.
Would it be possible to have an opt-in mode where the generated JIT code would set up frame pointers to enable Linux perf stack unwinding in the default frame pointer mode?
The text was updated successfully, but these errors were encountered: