Skip to content

Fix: libxml manual memory management#15906

Merged
straight-shoota merged 17 commits intocrystal-lang:masterfrom
ysbaddaden:fix/libxml-manual-memory-management
Jun 26, 2025
Merged

Fix: libxml manual memory management#15906
straight-shoota merged 17 commits intocrystal-lang:masterfrom
ysbaddaden:fix/libxml-manual-memory-management

Conversation

@ysbaddaden
Copy link
Collaborator

@ysbaddaden ysbaddaden commented Jun 17, 2025

libxml allows to customize the memory allocators, which we used to plug the GC.

Under certain conditions (e.g. MT, libxml 2.14), this integration leads to segfaults when a GC cycle happens within a libxml function.

I'm not quite sure what's happening. Maybe libxml keeps pointers somewhere that the GC doesn't scan (thread locals?) and the GC collects them... though that would create random segfaults, not segfaults during GC. Anyway: removing the GC integration fixes the issue.

In addition, the libxml2 distributed in macOS 15.4 is patched to remove the custom memory allocators API :trollface:

Other bindings to external libraries in Crystal don't plug the GC but use manual memory management to free the external allocations when not needed anymore (automated through finalizers).

The difficulty is that we allocate the whole DOM tree with libxml. We could use a libxml parser to build a DOM tree with Crystal objects, and it would be fantastic, but then we'd have to reimplement XPath 😰

This patch replaces the GC integration with explicit memory management instead.

For readers, writers, or xpath contexts, we merely free the libxml allocations in finalizers.

Tree nodes are more complex. The xml free functions will recursively free the whole subtree; a node might be unlinked and thus won't be freed with the rest of the document anymore, etc.

We can store a reference to the XML::Node of documents into the libxml node (using the _private struct member) because the XML::Node will live exactly as long as the libxml doc.

We can't do that for subtree nodes that may be collected at any time and would leave dangling pointers. We could not care (and allow multiple XML::Node to represent one libxml node for example, since collecting the doc will free the libxml node) but a XML::Node can be unlinked from the main tree, and must be manually freed. I thus introduced a dual reference:

  1. Every XML::Node has a strong reference to its document's XML::Node (can't free document while we have references to a subtree node);
  2. Every document XML::Node keeps a cache with weak references to all its subtree XML::Node (only those that have been instantiated).

A XML::Node is thus unique per libxml node, yet can be collected when unused without freeing some of the document, after which we can instantiate a new XML::Node.

NOTE: I was able to avoid breaking changes, especially in constructors, but I still undocumented the constructors that take an external libxml pointer, or expose internal details. I assume they should be internal API. We should extend the XML integration, for example with DOM manipulation (following the DOM spec, not the libxml API) so nobody has to extend the libxml integration themselves.

Depends on #15899.
Closes #15619 (to be verified on macOS).

On Linux: no more segfaults with libxml 2.14 when the crystal code is compiled, however the interpreter still segfaults and prints error messages to STDERR.

No issues with libxml 2.13 and below.

@ysbaddaden ysbaddaden self-assigned this Jun 17, 2025
@ysbaddaden ysbaddaden added kind:refactor breaking-change May have breaking effect in some edge cases. Mostly negligible, but still noteworthy. topic:stdlib:serialization labels Jun 17, 2025
@ysbaddaden ysbaddaden force-pushed the fix/libxml-manual-memory-management branch from 905d839 to 0100017 Compare June 19, 2025 12:48
@ysbaddaden
Copy link
Collaborator Author

It's probably not fully safe. There's nothing preventing to dup a XML::Node document. We can have multiple XML::Node instances for the same tree node, for example multiple #xpath requests, a call to #children and #next and so on.

What if we take a XML::Node twice: we call Node#unlink from one instance and we lose the reference, the GC will finalize and free the LibXML::Node, then we try to access the second node => 💥

@ysbaddaden
Copy link
Collaborator Author

What's the _private struct member for LibXML structs? it's always NULL... could we... use it for application data 🤔

@HertzDevil
Copy link
Contributor

HertzDevil commented Jun 19, 2025

It says:

    /**
     * @deprecated Use xmlCtxtGetPrivate() and xmlCtxtSetPrivate()
     *
     * For user data, libxml won't touch it
     */

Another one:

    /** Application data. Often used by language bindings. */

@ysbaddaden
Copy link
Collaborator Author

ysbaddaden commented Jun 19, 2025

I tried to use the _private struct member to store the corresponding XML::Node reference, so we only ever have one XML::Node instance per LibXML::Node* but we can't safely nullify the _private when we lose the XML::Node because xmlFreeDoc or xmlFreeNode can run before a finalizer and will have freed the LibXML::Node* and we get malloc errors or segfaults.

Edit: only a Hash(LibXML::Node*, WeakRef(XML::Node)) seems safe.

Edit: or just make sure we can't dup a the XML::Node of a document (so we don't free twice), or can't unlink a node twice... with the difficulty that different XML::Node for the same LibXML::Node* might try to unlink (or future proof: it got added to another doc) 😭

@ysbaddaden
Copy link
Collaborator Author

ysbaddaden commented Jun 19, 2025

I may have found a solution (no more issues):

  1. Allocate a document Node once and store it into _private; this is fine because the libxml node lives as long as the document XML::Node (the latter is the one that frees the former).

  2. Allocate each subtree Node once and keep a cache of libxml node to Node references per document, so we're sure that every unlinked node will share the same Node and will be unlinked once without having to access the libxml node that may have already been freed.

And... I'm just noticing an issue: collecting an unlinked node doesn't clean up the unlinked node (nor its subtree), and at best we're leaking some entries in the cache (not cool), and at worst new LibXML::Node* reusing the same malloc 💥

I could cleanup the cache in the finalizer, at least zero the Node address to avoid mutating the Hash, but there's no telling a thread wasn't mutating or rehashing the hash 😭

Oh, we keep the unlinked node reference to the document, so either we collect everything or we collect nothing and thus never collect an unlinked node before its document, so we're not leaking 🎉

The complexity will be when we adopt the node into another document (not supported): we'll have to move the unlinked node (and its subtree) from the original doc's cache into the new doc's cache.

Edit: but... if I don't need to do any cleanup from the cache, and we collect everything or nothing, then I can just put every instantiated Node in _private, and that would simplify adopting a Node into another document (only need to update the @document of sub nodes)

Edit 2: Nope. Without a strong reference from doc Node to subtree nodes we can lose a reference to a XML::Node!

  • With _private the XML::Node finalizer must nullify _private, but we can't because it may have been freed 💥.
  • With a blind cache (no pointers) the finalizer must clear each XML::Node from the cache (optimization: unless the libxml doc has been freed).
  • If the cache has pointers, Boehm complains that finalizers have dual references (node -> doc -> cache -> node).
  • This leaves us with WeakRef, maybe?

Edit 3: we indeed needed WeakRef 🎉

The main advantage is a much simpler interface. The disadvantages are:

1. `XML::NodeSet` can now only hold 2.1 billion entries because
   `Slice#size` is an Int32; it shouldn't be an issue in practice.

2. `XML::Node#children` will now allocate an `XML::Node` for every child
   node, but will only allocate once, not on every access.
@ysbaddaden ysbaddaden force-pushed the fix/libxml-manual-memory-management branch from 623c758 to 22edf3f Compare June 19, 2025 18:53
@ysbaddaden ysbaddaden force-pushed the fix/libxml-manual-memory-management branch from 5b55479 to 83bd1dc Compare June 20, 2025 10:21
@ysbaddaden ysbaddaden force-pushed the fix/libxml-manual-memory-management branch from 83bd1dc to 2790e21 Compare June 20, 2025 10:33
@ysbaddaden ysbaddaden marked this pull request as ready for review June 20, 2025 10:34
@ysbaddaden
Copy link
Collaborator Author

ysbaddaden commented Jun 20, 2025

I updated the PR description to reflect the current state.

I'm putting the PR back into DRAFT because there's still an issue: we can't free the libxml of an unlinked node because it will free its subtree, yet we might still have live XML::Node to them 😭

@ysbaddaden ysbaddaden marked this pull request as draft June 20, 2025 12:45
That would free its subtree nodes, but we might have live XML::Node
referencing these nodes. We thus have to delay until the document is
unreachable, at which point we can free everything at once.
@ysbaddaden ysbaddaden marked this pull request as ready for review June 20, 2025 13:31
@straight-shoota straight-shoota added this to the 1.17.0 milestone Jun 24, 2025
@straight-shoota straight-shoota merged commit 7f0f956 into crystal-lang:master Jun 26, 2025
38 checks passed
@straight-shoota straight-shoota added kind:breaking Intentional breaking change with significant impact. Shows up on top of the changelog. and removed breaking-change May have breaking effect in some edge cases. Mostly negligible, but still noteworthy. kind:refactor labels Jun 26, 2025
@ysbaddaden ysbaddaden deleted the fix/libxml-manual-memory-management branch June 26, 2025 08:14
@ysbaddaden
Copy link
Collaborator Author

ysbaddaden commented Jun 26, 2025

The interpreter still segfaulting may be due to the compiler being linked to libLLVM that links with libxml2.so.2, but the interpreter is trying to load libxml.so.16 (ABI bump in 2.14) and we end up with conflicts?

I got a warning when trying to link the std specs with both LLVM and libxml about this.

@crysbot
Copy link
Collaborator

crysbot commented Dec 30, 2025

This pull request has been mentioned on Crystal Forum. There might be relevant details there:

https://forum.crystal-lang.org/t/a-static-site-generator-using-crystal/4281/20

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind:breaking Intentional breaking change with significant impact. Shows up on top of the changelog. topic:stdlib:serialization

Projects

None yet

5 participants