-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-16673: [Java] C data interface: Allow ownership transferring for imported buffer #13248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
| @Override | ||
| public OwnershipTransferResult transferOwnership(ArrowBuf sourceBuffer, BufferAllocator targetAllocator) { | ||
| throw new UnsupportedOperationException(); | ||
| ArrowBuf targetArrowBuf = this.deriveBuffer(sourceBuffer, 0, sourceBuffer.capacity()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it expected that targetAllocator isn't involved in this method at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kind of. I would understand this as the original design of CDataReferenceManager not to bind itself to any allocator.
To me it would be great if we take some time to find if it is possible to make some refactors on ReferenceManager and BufferLedger to decouple allocator instance from the base interface ReferenceManager. Sounds OK to me that a native memory block doesn't "belong to" any Java side allocator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It concerns me somewhat that the method doesn't perform its stated purpose: "Transfer the memory accounting ownership of this ArrowBuf to another allocator", but I also understand what you're saying about the referenceManager not binding itself to any allocator: the getAllocator() method in this class always returns null. Even so, I wonder if it's not better to give this operation a name that matches the functionality better.
I noticed also that this implementation is very similar to public ArrowBuf retain(ArrowBuf srcBuffer, BufferAllocator targetAllocator) which also has an unused bufferAllocator argument. The main difference between the two methods seems to be that the retain() implementation increments the reference count, while this method does not. It also returns the newly derived buffer directly, rather than wrapped in a OwnershipTransferResult. But since the flag in the result here is always true, it doesn't seem to add too much.
Are you trying to make the CDataReferenceManager usable in code that supports other ReferenceManager implementations? If not, I guess I would just ask that you confirm that a variation without the reference count increment is needed, and that it should have this signature instead of something more descriptive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you trying to make the CDataReferenceManager usable in code that supports other ReferenceManager implementations?
This would be helpful for a project I'm currently working on. For CDataRefereneceManager specifically, though, do you know if there's currently any way to copy a ValueVector? I'm essentially trying to clone a VectorSchemaRoot that's read from a Parquet file into a RootAllocator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adzcai Is the 'This' that would be helpful is the new method that doesn't increment the reference count, or some other aspect of the PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does seem like this might be a partial fix, that might have unintended consequences. It seems like this was already discussed on the main page of the PR was there any resolution here?
|
@lwhite1 Do you feel acquainted enough with the C data interface to give this a look? |
|
I will take a look, but it would probably be best to have someone with more
experience look at it also.
…On Tue, Jun 7, 2022 at 7:46 AM Antoine Pitrou ***@***.***> wrote:
@lwhite1 <https://github.com/lwhite1> Do you feel acquainted enough with
the C data interface to give this a look?
—
Reply to this email directly, view it on GitHub
<#13248 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA2FPAVRBFNOOME2RNXEC3DVN4ZCPANCNFSM5XDA4UBQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
|
@lidavidm Would you please take a look at new method here. I understand now that the method implements a previously unsupported method in the interface, so it's nice to have even with the unused argument, etc. I'm still uncertain about whether it's correct to add a buffer without calling retain(). |
|
The semantics of this method in the interface are rather confusing, but I think this operation does not make sense to implement with the current design. The way it works is that allocators form a hierarchy, with a RootAllocator at the…root, and child allocators deriving from that. An ArrowBuf is a slice of an underlying buffer whose reference count is managed by a BufferLedger (so when you manipulate an ArrowBuf's refcnt, it's actually updating the BufferLedger). Now, a buffer/BufferLedger can be referenced by multiple allocators, but is only "owned" by a single allocator. This method is just meant to swap which allocator owns the buffer. However, as implemented, it only works for allocators with the same root: arrow/java/memory/memory-core/src/main/java/org/apache/arrow/memory/AllocationManager.java Lines 96 to 97 in cc9b89a
So I take this to mean that the intent of this method is to shuffle the accounting of who is ultimately responsible for the reference count. Since C Data Interface buffers aren't owned by any allocator, they can't be transferred (or it doesn't make sense for them to be transferred). As an example: I think the intent of this method is that you could open a child allocator, then open a file with the child allocator, and read a batch. The batch's buffers would belong to the child allocator. Now you want to close the file, but keep the memory around, so then you would transfer ownership to the root allocator. That way, the child allocator can be closed without it complaining about unclosed buffers. (Or maybe instead of files, think about passing batches between nodes in a computation graph. Each graph could have its own child allocator so that at the end of the query, you can pinpoint any memory leaks by individual nodes; then you need to transfer ownership so that you don't get spurious errors.) I think what needs to happen is that C Data Interface allocations need to get associated with an actual allocator, even if the allocator isn't actually doing the allocation. The allocator should include the memory of these native buffers in its own accounting (I think this is a desirable property, so that you can properly track and limit memory consumption as before). Then it would make sense to have transferOwnership. |
This is how we used to handle native-allocated blocks in Java dataset codes before migrating to ABI:
d25660e#diff-177060b1e9fd73540eacd8d55a32a81702ddbc39b6f1f406702d345620fbd1ceR25 Although the codes binding it to allocator was more like a workaround: d25660e#diff-6f45bb0f8f4da15fddfe7027b1983fe16dda4332b1cba9a9f1a6f28a0fa66a26R101-R104 I am not sure if introducing a new API
However this may require for refactors on current allocation API which might introduce some breaking changes. |
|
Ok, thanks. In that case, it seems like we need the ability to associate the C Data reference manager with a Java allocator, and update APIs to require a BufferAllocator to associate with. I don't think that'll require breaking changes in the core APIs, but it would probably mean updating/deprecating APIs in the C Data interface. |
I guess this is reasonable given where we are, but associating with an allocator that isn't doing the allocation seems likely to confuse future developers if the memory appears to be managed in the Java code but isn't. Is the idea that all users of a given bit of memory keep track of their own use, but only the originator (C++ presumably in this case) actually frees it?
Since the API already has a bufferAllocator argument (that is ignored), I'm not sure what change you're suggesting unless it's simply to annotate the argument as non-nullable. If there was no bufferAllocator argument it would be simple enough to deprecate the current method and add a new one. |
@zhztheplayer correct me if I'm wrong, but what I see as the issue is this: data comes in via the C Data Interface. You then want to unload the data into an ArrowRecordBatch and load it into another root (that happens to have its own allocator) to perform some computation on it. But since transfer is not implemented, that currently fails. The problem is that this transfer, semantically, is to change the ownership of the buffer. So yes, an allocator would end up with memory it didn't technically allocate. But that's already the case - the whole point of these APIs is to make an allocator responsible for something it didn't allocate! The main implementation problem I see is that currently, the transfer assumes the memory was allocated by an allocator in the same allocator tree, i.e. that the allocator implementation is the same. That's the main assumption that would have to change. Perhaps 'allocator' is the wrong way to think about the API if/when we make the change; BufferAllocator is primarily an arena or nursery used for accounting of buffers, that can also allocate new buffers that are associated with itself. In this model, it makes perfect sense to transfer buffers between allocators: it's just a conscious choice to hand over memory ownership. (My memory of the details is fuzzy, but the underlying refcount doesn't really change; it's just changes whether the allocator will complain about a buffer when closed or not.) The Java library works by refcounting a buffer, and freeing it when the reference count reaches 0. From this perspective, memory from the C Data Interface is no different, it's just another kind of buffer to refcount. When the Java-side refcount reaches 0, the C-side release callback is invoked (this is already how it works).
Yes, this argument would become required. |
Yes this could be the original issue. If we all agree on associating an allocator with the imported buffer I can start to try making a implementation then. Actually I was trying to get the actual purpose of buffer transferring for some time. In c++ we don't have this design but things seem to go on well. Also it seems in netty or java nio they don't design transfer semantic for buffer too. |
|
C++ lets you specify local memory pools, but doesn't require it. It's like always having a static global BufferAllocator. Java is stricter about asking you to consider when and where memory is allocated. |
That sounds good to me. It would be great if you could add something to the class javadoc about this use case. |
|
Hoping to restart this thread. I'm beginning to wonder if we (including and maybe especially, me) haven't made this all too complicated. The proposed simple change makes the code work the same as other ReferenceManager implementations (or nearly so), and allows the original Vector's ArrowBuf to hand off its memory to another ArrowBuf. The code here around the management of imported data is pretty complicated but I think this change is pretty safe. From my review, I believe this is what is happening: In the ArrayImporter code, the CDataReferenceManager releases the memory as soon as the import is complete: So the ref count is already 0 when the transfer attempt occurs later. The Data class that provides the high-level interface for getting C-Data has methods like In spite of the naming/documentation of the methods involved, there is nothing that requires a TransferPair to actually change what Allocator the newly created ArrowBuf is associated with. There are numerous use-cases for "transferring" the memory to the same Allocator it started with. Maybe the most important existing, but currently unsupported, use cases are:
Right now, you can use data from C in the format it was received in, or not at all. It's not clear to me that the original change makes anything worse than it currently is, and it would provide support for these two important use-cases. Of course, I could be wrong again. What am I missing? |
That can't be right, ref count 0 is freed and so that would mean the code causes a use-after-free. The reference count should be nonzero, because doImport calls into FieldVector#loadFieldBuffers and the type-specific implementations call retainBuffer on the bufferse as necessary. |
| TransferPair transferPair = fieldVector.getTransferPair(targetAllocator); | ||
| transferPair.transfer(); | ||
| ValueVector to = transferPair.getTo(); | ||
| assertEquals(cnt, to.getValueCount()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does targetAllocator report for allocated bytes after the transfer?
|
My objection here is that the semantics of what is going on here are unclear. We can add code that makes things pass tests but if the intent is already hard to follow then we're just setting up future library users - and ourselves - for more confusion. In any case I don't have enough time really to look into things deeply. So if another maintainer wants to merge this then I'm not going to object. I still think we're too fixated on this being a memory 'allocator' when it seems more like a 'nursery' object. Also I guess going back to this
This makes sense to me from the 'nursery' or 'accountant' perspective. Closing a buffer delegates to the specific ReferenceManager implementation, which may use Java code or native code. It also updates the allocation metric in the BufferAllocator, which is separate from how the buffer is actually implemented. Then the application closes the BufferAllocator, which can check at runtime that the application actually closed all buffers it claims were associated with this scope. Transfer, in this scheme, is a way to change what object is responsible for allocated bytes, as opposed to allocating new ones or freeing allocated ones. |
|
eek, and right now we don't even know the buffer size so we can't actually do any proper accounting. hmm. arrow/java/c/src/main/java/org/apache/arrow/c/ArrayImporter.java Lines 144 to 146 in 83ad54c
|
Agreed, but the argument for moving ahead is not so much 'make tests pass' as 'unblock all uses of a vector or VectorSchemaRoot returned by the C-Data-interface that require a memory transfer'.
I appreciate all the time you put into this. I will go hunting for additional eyeballs.
Yes. The name is unfortunate. TBH, I'm not sure what role the Java allocator plays for data managed by native memory. It seems like it's mostly there because the API requires it.
I agree that the name "allocator" is somewhat confusing when it may not necessarily allocate, but I feel like a general cleanup of this code is probably beyond the scope of this PR. Maybe a ticket could be opened for that and better documentation would help in the short term? |
You're probably right. I didn't actually check, but rather assumed the importer would only have one reference. |
|
@emkornfield Any chance you could find the time to look at this? |
|
Yes, will try to get to it by end of week, feel free to pester me, i you don't hear from me by then. |
|
Hi @emkornfield, just checking in to see if this is still doable for you this week. |
|
@emkornfield I'm sure this qualifies as pestering now :). If you think that there is someone else who could look at it, lmk and I can try them. Thanks. |
|
Perhaps @lidavidm can give another look? :-) |
| targetArrowBuf.readerIndex(sourceBuffer.readerIndex()); | ||
| targetArrowBuf.writerIndex(sourceBuffer.writerIndex()); | ||
|
|
||
| return new OwnershipTransferResult() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can OwnershipTransferNOOP be used here?
|
Sorry for the delay, it seems like there is a question on overall design here? |
There is a question on the design. I was hoping that we could avoid trying to remedy underlying design issues. It may be the case that this solution has unintended downstream consequences; I'm not sure. The author hasn't had time to work on the issue for a couple months. My concern is that the current implementation is arguably already broken in that an attempt to transfer an otherwise normal-seeming vector or root results in an exception at runtime, and there's no forward motion. If this solution is indeed not satisfactory then maybe this PR should be closed? |
|
Sorry, I don't have the expertise on these APIs to evaluate the solution here. @jacques-n might recall the use-cases here and whether this is desirable, or @lidavidm |
|
Highly WIP, but #14506 explores what I think the 'proper' solution is: model C Data allocations as actual allocations in Java, and integrate them into the BufferAllocator hierarchy. The caveats are 1) we need to figure out the length of buffers to do this (not too hard, just tedious) and 2) we need to expose some APIs from the memory-core package (we could avoid this by providing a formal API for 'foreign allocations', IMO) |
|
Yeah, my main thoughts were people are trying to do a end run rather fully integrate. @lidavidm, I think your general goals are good. I assume you saw the large markdown doc that covers allocator internals? |
|
Yeah - I need to spend some time reconciling that with doc comments, Sphinx documentation, and the actual implementation - ideally something that important should be in a more visible place |
|
@lidavidm Thanks a lot for taking the work on since I was too occupied by some other stuffs to keep watching on this these days.
If we decide to follow this path, it probably be able to help that I had some design which I might mention once: This was to add API The interface public interface MemoryChunk {
/**
* The size (in bytes) of this chunk.
*
* @return size of this chunk (in bytes).
*/
long size();
/**
* The native address indicating this chunk.
*
* @return the native address of this chunk.
*/
long memoryAddress();
/**
* Destroy and reclaim all memory spaces of this chunk.
*/
void destroy();
}Would you think it could be helpful here? |
|
Thanks! That's the same API as AllocationManager. Though described as internal, it's effectively a public API already so I plan to reuse that. |
|
Should be resolved by #14506 |
No description provided.