Skip to content

Fix refcount bug in map input conversion that caused shutdown segfault#27413

Merged
tianleiwu merged 1 commit intomainfrom
tlwu/fix_pybind_mlvalue_deref
Feb 23, 2026
Merged

Fix refcount bug in map input conversion that caused shutdown segfault#27413
tianleiwu merged 1 commit intomainfrom
tlwu/fix_pybind_mlvalue_deref

Conversation

@tianleiwu
Copy link
Copy Markdown
Contributor

@tianleiwu tianleiwu commented Feb 22, 2026

Fix Python refcount bug in map input conversion that caused shutdown segfault in onnxruntime_test_python_mlops.py ( see #27392).

Summary

This PR fixes a Python reference-count ownership bug in the map conversion path in onnxruntime/python/onnxruntime_pybind_mlvalue.cc.

In Python 3.14 test runs, the bug could manifest as a segmentation fault after tests completed (typically at interpreter shutdown), even when test assertions passed. .

Root Cause

In CreateMapMLValue_LoopIntoMap, error paths decremented item unconditionally.

  • In single-map flow, item is a borrowed reference (must not be decref'd there).
  • In iterator/vector-map flow, item is an owned reference (must be decref'd).

The unconditional decref in borrowed-reference flow caused refcount corruption and eventually a crash.

Fix

Add explicit ownership handling for item:

  • CreateMapMLValue_LoopIntoMap(..., bool owns_item_ref, ...)
  • Pass owns_item_ref = false from single-map path (CreateMapMLValue_Map)
  • Pass owns_item_ref = true from vector-map path (CreateMapMLValue_VectorMap)
  • Only Py_XDECREF(item) on error when owns_item_ref is true

This preserves existing behavior while correcting reference ownership.

Validation

cd onnxruntime/test/python
python onnxruntime_test_python_mlops.py

Result:

  • OK
  • Exit code 0 (no shutdown segfault)

Notes

  • Although this became reproducible in Python 3.14, the underlying refcount bug is version-agnostic C-extension undefined behavior.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a critical Python reference counting bug in the map input conversion code that could cause segmentation faults at interpreter shutdown. The bug manifested when error paths in CreateMapMLValue_LoopIntoMap unconditionally decremented references that were sometimes borrowed (should not be decremented) and sometimes owned (should be decremented).

Changes:

  • Added explicit reference ownership tracking through a boolean parameter owns_item_ref in CreateMapMLValue_LoopIntoMap
  • Conditionally decrement item reference only when owned on error paths
  • Updated both caller sites to pass correct ownership flags (false for borrowed refs, true for owned refs)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@tianleiwu tianleiwu merged commit f501e1d into main Feb 23, 2026
106 of 112 checks passed
@tianleiwu tianleiwu deleted the tlwu/fix_pybind_mlvalue_deref branch February 23, 2026 17:05
tianleiwu added a commit that referenced this pull request Feb 26, 2026
#27413)

Fix Python refcount bug in map input conversion that caused shutdown
segfault in `onnxruntime_test_python_mlops.py` ( see
#27392).

## Summary
This PR fixes a Python reference-count ownership bug in the map
conversion path in `onnxruntime/python/onnxruntime_pybind_mlvalue.cc`.

In Python 3.14 test runs, the bug could manifest as a segmentation fault
after tests completed (typically at interpreter shutdown), even when
test assertions passed. .

## Root Cause
In `CreateMapMLValue_LoopIntoMap`, error paths decremented `item`
unconditionally.

- In single-map flow, `item` is a **borrowed reference** (must not be
decref'd there).
- In iterator/vector-map flow, `item` is an **owned reference** (must be
decref'd).

The unconditional decref in borrowed-reference flow caused refcount
corruption and eventually a crash.

## Fix
Add explicit ownership handling for `item`:

- `CreateMapMLValue_LoopIntoMap(..., bool owns_item_ref, ...)`
- Pass `owns_item_ref = false` from single-map path
(`CreateMapMLValue_Map`)
- Pass `owns_item_ref = true` from vector-map path
(`CreateMapMLValue_VectorMap`)
- Only `Py_XDECREF(item)` on error when `owns_item_ref` is true

This preserves existing behavior while correcting reference ownership.

## Validation
```bash
cd onnxruntime/test/python
python onnxruntime_test_python_mlops.py
```

Result:
- `OK`
- Exit code `0` (no shutdown segfault)

## Notes
- Although this became reproducible in Python 3.14, the underlying
refcount bug is version-agnostic C-extension undefined behavior.
tianleiwu added a commit that referenced this pull request Feb 27, 2026
This cherry-picks the following commits for the release:

| Commit ID | PR Number | Commit Title |
|-----------|-----------|-------------|
| decd177 | #27090 | Fix GatherND division by zero when batch
dimensions mismatch |
| 55f8234 | #27360 | Fix QMoE CPU Operator |
| df9146f | #27403 | [MLAS] Adding DynamicQGemm function pointers and
ukernel interface |
| 0f93853 | #27318 | [js/web] Use embedded WASM module in Blob URL
workers when wasmBinary is provided |
| b2a6e69 | #27364 | QMoE CPU Performance Update (Up to 4x on 4-bit)
|
| f501e1d | #27413 | Fix refcount bug in map input conversion that
caused shutdown segfault |
| b32b205 | #27421 | Fix error where bytes is not assigned for
dynamic qgemm pack b size |
| 426b006 | #27397 | Fix DllImportResolver |
| 0982844 | #27412 | MatmulNBits prepacking scales fix |
| 9afb0d2 | #27430 | Fix validation for external data paths for
models loaded from bytes |
| 71d2cd0 | #27401 | Enable Python 3.14 CI and Upgrade Dependencies |
| 79e0676 | #27419 | fix: out of bounds access for resize operation |
| 82eb99c | #27459 | Fix SkipLayerNorm fusion incorrectly applied
when gamma/beta are not 1D |
| 355278a | #27444 | Fix GatherCopyData Integer Truncation Leading to
Heap Out-of-Bounds Read/Write |
| cf96123 | #27411 | [web] fix usage of wasmBinary together with a
blob URL for .mjs |
| 1131a86 | #27399 | [web] remove the unhelpful "Unknown CPU vendor"
warning. |
| ffbbc4f | #27316 | Build Windows ARM64X binaries as part of
packaging pipeline |

---------

Signed-off-by: Jonathan Clohessy <Jonathan.Clohessy@arm.com>
Co-authored-by: patryk-kaiser-ARM <patryk.kaiser@arm.com>
Co-authored-by: don <70039285+0-don@users.noreply.github.com>
Co-authored-by: Jonathan Clohessy <jonathan.clohessy@arm.com>
Co-authored-by: Hariharan Seshadri <shariharan91@gmail.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com>
Co-authored-by: Lukas Folle <126877803+lukas-folle-snkeos@users.noreply.github.com>
Co-authored-by: Chi Lo <54722500+chilo-ms@users.noreply.github.com>
Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
Co-authored-by: Chaya <cha182350@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Co-authored-by: Erik <erscor@microsoft.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants