fix(python): recover .cold interpreter range from FDE #552
fix(python): recover .cold interpreter range from FDE #552christos68k merged 6 commits intoopen-telemetry:mainfrom
Conversation
fabled
left a comment
There was a problem hiding this comment.
Thanks for working on this! First round of comments added.
| // FindExternalJump decodes every instruction in the sym function and searches for | ||
| // a relative jump outside itself - to an address not covered by the sym. | ||
| // FindExternalJump returns the destination address of the relative jump outside the function or 0. | ||
| func FindExternalJump(ef *pfelf.File, f *libpf.Symbol) (libpf.Address, error) { |
There was a problem hiding this comment.
I would not involve pfelf.FIle in this layer. How about passing instead just the []byte code slice here?
| end = int64(f.Address) + int64(f.Size) | ||
| code []byte | ||
| ) | ||
| code, err = ef.VirtualMemory(rip, int(f.Size), math.MaxInt) |
There was a problem hiding this comment.
This code will move, but I'd just use SymbolData here. Though pfelf.File.SymbolData may need to be updated to return libpf.Symbol instead of just the libpf.SymbolValue.
And given this specific use case, the first outside jump is typically at the startup? How about just giving a hard maximum of 4kB or similar? The maximum cap is used only when mmap is not available (test cases) and prevents the agent on trying to do huge memory allocations.
There was a problem hiding this comment.
I can not restrict the size without sacrificing reliability. What if I cap the size to 4k and the rel jump is at 45k?
| if int(f.Size) != len(code) { | ||
| return 0, errors.New("truncated code") | ||
| } |
There was a problem hiding this comment.
I think truncation does not matter here.
| // Hash32 returns a 32 bits hash of the input. | ||
| // It's main purpose is to be used as key for caching. | ||
| func (fid FileID) Hash32() uint32 { | ||
| return uint32(fid) | ||
| } | ||
|
|
There was a problem hiding this comment.
This will not be needed, see below.
| func findColdRangeCached(fid host.FileID, ef *pfelf.File, interp *libpf.Symbol) util.Range { | ||
| if ef.Machine != elf.EM_X86_64 { | ||
| return util.Range{} | ||
| } | ||
| if cached, ok := coldRangeCache.Get(fid); ok { | ||
| return cached | ||
| } | ||
| coldRange, err := findColdRange(ef, interp) | ||
| coldRangeCache.Add(fid, coldRange) | ||
| if err != nil { | ||
| log.WithError(err).Errorf("failed to recover python ranges %s", | ||
| fid.StringNoQuotes()) | ||
| } | ||
| return coldRange | ||
| } | ||
|
|
||
| var coldRangeCache, _ = freelru.NewSynced[host.FileID, util.Range]( | ||
| 256, host.FileID.Hash32) |
There was a problem hiding this comment.
The Loader returned data pythonData gets cached by process manager. This is the layer that caches the interpreter ranges. This cache would never be queried in the hot path.
It is true that this data can get unloaded if all processes of using the ELF exit. But then its ok to just do the work here. Finding the cold range is fast anyways. And I'll be working on #532 to make sure these get cached even if the process exists.
Given the extra complexity of code, increased memory usage for little gain, I'd just remove this caching layer.
fabled
left a comment
There was a problem hiding this comment.
Good stuff! Added one more round of comments.
| if interp, code, err = ef.SymbolData("_PyEval_EvalFrameDefault", math.MaxInt64); err != nil { | ||
| interp, code, err = ef.SymbolData("PyEval_EvalFrameEx", math.MaxInt64) |
There was a problem hiding this comment.
using math.MaxInt64 should not be done here. It is used to limit heap allocations. If this function was to be huge, it'd kill the profiler with out-of-memory errors. I should be some kbs. Perhaps just 2 or 4kB. In general truncating the function machine code should not be a problem as the jump should be found from the early portions. Or even if allowing/wanting the full function, it should be a realistic upper cap such as 128kB.
Though, I understand that in this case the SymbolData should be adjusted to return libpf.Symbol with the actual length, so the bounds checking for what is outside of function jump works correctly.
| // FindExternalJump decodes every instruction in the sym function and searches for | ||
| // a relative jump outside itself - to an address not covered by the sym. | ||
| // FindExternalJump returns the destination address of the relative jump outside the function or 0. | ||
| func FindExternalJump(code []byte, f util.Range) (libpf.Address, error) { |
There was a problem hiding this comment.
I wonder if libpf.Symbol would make sense instead of util.Range. Though, the Symbol will have the extra/unused field of symbol name. But doing this would avoid unnecessary in the current (and possibly also callers).
| log.WithError(err).Errorf("failed to recover python ranges %s", | ||
| fid.StringNoQuotes()) |
There was a problem hiding this comment.
I think should be a warning or info message. Even if it means something went wrong instead of just not having a cold block. If it was an error, the operation would fail and err would be returned instead.
If logging something, it would probably make more sense to log the file name instead of the ID? So pass info.FileName() instead of FileID?
fabled
left a comment
There was a problem hiding this comment.
LGTM! Thanks! @christos68k or @florianl would you have time to review this?
I'll wrap it up today. |
Fixes #416
_PyEval_EvalFrameDefaultoutside itself by disassembling the whole symbol.elfunwindinfo.EhFrameTableto recover the.coldrange.