Skip to content

fix(python): recover .cold interpreter range from FDE #552

Merged
christos68k merged 6 commits intoopen-telemetry:mainfrom
grafana:recover_interpreter_ranges
Jun 27, 2025
Merged

fix(python): recover .cold interpreter range from FDE #552
christos68k merged 6 commits intoopen-telemetry:mainfrom
grafana:recover_interpreter_ranges

Conversation

@korniltsev
Copy link
Copy Markdown
Contributor

Fixes #416

  • Find a relative jump from the _PyEval_EvalFrameDefault outside itself by disassembling the whole symbol.
  • Use the new elfunwindinfo.EhFrameTable to recover the .cold range.
  • Cache recovered results.

Copy link
Copy Markdown
Contributor

@fabled fabled left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! First round of comments added.

Comment thread asm/amd/insn.go Outdated
// FindExternalJump decodes every instruction in the sym function and searches for
// a relative jump outside itself - to an address not covered by the sym.
// FindExternalJump returns the destination address of the relative jump outside the function or 0.
func FindExternalJump(ef *pfelf.File, f *libpf.Symbol) (libpf.Address, error) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not involve pfelf.FIle in this layer. How about passing instead just the []byte code slice here?

Comment thread asm/amd/insn.go Outdated
end = int64(f.Address) + int64(f.Size)
code []byte
)
code, err = ef.VirtualMemory(rip, int(f.Size), math.MaxInt)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code will move, but I'd just use SymbolData here. Though pfelf.File.SymbolData may need to be updated to return libpf.Symbol instead of just the libpf.SymbolValue.

And given this specific use case, the first outside jump is typically at the startup? How about just giving a hard maximum of 4kB or similar? The maximum cap is used only when mmap is not available (test cases) and prevents the agent on trying to do huge memory allocations.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can not restrict the size without sacrificing reliability. What if I cap the size to 4k and the rel jump is at 45k?

Comment thread asm/amd/insn.go Outdated
Comment on lines +46 to +48
if int(f.Size) != len(code) {
return 0, errors.New("truncated code")
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think truncation does not matter here.

Comment thread host/host.go Outdated
Comment on lines +37 to +42
// Hash32 returns a 32 bits hash of the input.
// It's main purpose is to be used as key for caching.
func (fid FileID) Hash32() uint32 {
return uint32(fid)
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will not be needed, see below.

Comment thread interpreter/python/python.go Outdated
Comment on lines +887 to +904
func findColdRangeCached(fid host.FileID, ef *pfelf.File, interp *libpf.Symbol) util.Range {
if ef.Machine != elf.EM_X86_64 {
return util.Range{}
}
if cached, ok := coldRangeCache.Get(fid); ok {
return cached
}
coldRange, err := findColdRange(ef, interp)
coldRangeCache.Add(fid, coldRange)
if err != nil {
log.WithError(err).Errorf("failed to recover python ranges %s",
fid.StringNoQuotes())
}
return coldRange
}

var coldRangeCache, _ = freelru.NewSynced[host.FileID, util.Range](
256, host.FileID.Hash32)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Loader returned data pythonData gets cached by process manager. This is the layer that caches the interpreter ranges. This cache would never be queried in the hot path.

It is true that this data can get unloaded if all processes of using the ELF exit. But then its ok to just do the work here. Finding the cold range is fast anyways. And I'll be working on #532 to make sure these get cached even if the process exists.

Given the extra complexity of code, increased memory usage for little gain, I'd just remove this caching layer.

@korniltsev korniltsev marked this pull request as ready for review June 26, 2025 12:48
@korniltsev korniltsev requested review from a team as code owners June 26, 2025 12:48
@korniltsev korniltsev requested a review from fabled June 26, 2025 12:48
Copy link
Copy Markdown
Contributor

@fabled fabled left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff! Added one more round of comments.

Comment thread interpreter/python/python.go Outdated
Comment on lines +847 to +848
if interp, code, err = ef.SymbolData("_PyEval_EvalFrameDefault", math.MaxInt64); err != nil {
interp, code, err = ef.SymbolData("PyEval_EvalFrameEx", math.MaxInt64)
Copy link
Copy Markdown
Contributor

@fabled fabled Jun 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using math.MaxInt64 should not be done here. It is used to limit heap allocations. If this function was to be huge, it'd kill the profiler with out-of-memory errors. I should be some kbs. Perhaps just 2 or 4kB. In general truncating the function machine code should not be a problem as the jump should be found from the early portions. Or even if allowing/wanting the full function, it should be a realistic upper cap such as 128kB.

Though, I understand that in this case the SymbolData should be adjusted to return libpf.Symbol with the actual length, so the bounds checking for what is outside of function jump works correctly.

Comment thread asm/amd/insn.go Outdated
// FindExternalJump decodes every instruction in the sym function and searches for
// a relative jump outside itself - to an address not covered by the sym.
// FindExternalJump returns the destination address of the relative jump outside the function or 0.
func FindExternalJump(code []byte, f util.Range) (libpf.Address, error) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if libpf.Symbol would make sense instead of util.Range. Though, the Symbol will have the extra/unused field of symbol name. But doing this would avoid unnecessary in the current (and possibly also callers).

Comment thread interpreter/python/python.go Outdated
Comment on lines +861 to +862
log.WithError(err).Errorf("failed to recover python ranges %s",
fid.StringNoQuotes())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think should be a warning or info message. Even if it means something went wrong instead of just not having a cold block. If it was an error, the operation would fail and err would be returned instead.

If logging something, it would probably make more sense to log the file name instead of the ID? So pass info.FileName() instead of FileID?

@korniltsev korniltsev requested a review from fabled June 26, 2025 14:19
Copy link
Copy Markdown
Contributor

@fabled fabled left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks! @christos68k or @florianl would you have time to review this?

@christos68k
Copy link
Copy Markdown
Member

LGTM! Thanks! @christos68k or @florianl would you have time to review this?

I'll wrap it up today.

Copy link
Copy Markdown
Member

@christos68k christos68k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

python: handle cold interpreter func ranges

3 participants