-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-2998: [C++] Add unique_ptr versions of Allocate[Resizable]Buffer #2395
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
wesm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this. Adding some basic unit tests would be a good idea. These could be done with TYPED_TEST to avoid code duplication in the test suite if desired
cpp/src/arrow/buffer.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be templated something like
template <typename BufferType, typename Container>
inline Status ReturnBufferSized(Container<BufferType>&& buffer, const int64_t size,
Container<Buffer>* out) {
RETURN_NOT_OK(buffer->Resize(size));
buffer->ZeroPadding();
*out = std::move(buffer);
return Status::OK();
}This can also be used to address code duplication in AllocateResizableBuffer as it is now
|
Hmm, why is this useful? This is basically duplicating the existing API. If we start applying this pattern everywhere, we'll end up maintaining two mostly similar APIs... |
|
I don't think we should use |
|
@pitrou My rationale is that @wesm I don't see a thread-safe way to get a pointer out of a (*) the output parameters at the end of the param list, usually. |
|
The runtime overhead is only when copying a As for that fact that only one reference exists, it may not always be the case. For example, if you are asking for a 0-sized buffer, returning a shared 0-sized Buffer would be a valid optimization IMO. |
|
I think there's a small amount of overhead when the pointer is dereferenced. Since memory allocation is the lowest level of the stack, I'm fine with having |
|
Needs a rebase |
100c7a3 to
d8cbe2e
Compare
I doubt it's significantly different from |
|
@wesm in the libc++ that comes with gcc 5.4.0, there is no dereference overhead, but there is overhead in destruction, which must be thread-safe and so uses atomic operations.
|
|
I'm familiar with the guidelines. What I'm saying is that we have a project here that is oriented largely around hierarchical shared memory references (e.g. to memory maps, POSIX shared memory, payloads coming over the wire), which explains our general preference for using |
|
I think that APi consistency trumps micro-optimization here. If we discover a hot internal code path where the |
|
I originally suggested to Jim to submit this patch in apache/parquet-cpp#432 |
|
Ah, right, parquet-cpp is using different internal conventions :-/ How do you plan to deal with that if we merge the codebases together? Would we migrate parquet-cpp to the same conventions as the Arrow C++ codebase? |
|
The use case there was a private buffer that would not be exported outside the scope of the Bloom filter. I think it is OK for components to use In the case of a lot of the rest of Arrow, e.g. the columnar data structures, the memory could be shared or reused in many cases, so we need to use |
|
It's ok, but is it worth adding to our API surface? You'd be saving a tiny bit of memory and a tiny bit of overhead. |
|
I'm in favor in this limited case. We have plenty of other APIs that could return both kinds of pointers, e.g. https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/file.h#L48. I wouldn't be in favor of adding both variants of the functions in such cases. I agree that there is the slippery slope possibility of trying to "be all things to all people", but memory allocation is about as close to the metal as we get. I would rather see people reuse these abstractions (particularly since we deal with jemalloc interop, padding/alignment, and other issues) rather than rolling their own. |
|
Maybe we should use a similar pattern directly in |
|
Rebased. The flaked build (from a package manager timeout) should hopefully pass now. +1 -- per above discussion I think we should be conservative going forward about adding too many duplicate unique_ptr/shared_ptr APIs |
|
Ok with me. |
Codecov Report
@@ Coverage Diff @@
## master #2395 +/- ##
==========================================
+ Coverage 84.8% 86.89% +2.09%
==========================================
Files 296 237 -59
Lines 45641 42706 -2935
==========================================
- Hits 38705 37110 -1595
+ Misses 6891 5596 -1295
+ Partials 45 0 -45
Continue to review full report at Codecov.
|
|
thanks @jbapple-cloudera! |
This could be improved in a couple of ways:
Remove duplication. I didn't do this yet because ther already is duplication in buffer.cc and I wanted some feedback before proceeding.
Add tests. I didn't do this yet because the testing of the existing
shared_ptrAllocateBufferfunctions is quite slim, so I wanted some feedback before proceeding.