Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-2586: [C++] Changing the type of ListBuilder's and StructBuilder's children from unique_ptr to shared_ptr so that it can support deserialization from Parquet to Arrow with arbitrary nesting #2034

Closed

Conversation

joshuastorck
Copy link
Contributor

This allows the class that is responsible for deserializing to hold a shared_ptr to the concrete type without having to static_cast from the (List|Struct)Builder's getters.

This was needed for apache/parquet-cpp#462.

@wesm
Copy link
Member

wesm commented May 12, 2018

Do the classes that use these builders in parquet-cpp need to retain ownership of the builder objects in some way? I'm wondering if bare pointers would not be sufficient vs. having shared_ptr

@joshuastorck
Copy link
Contributor Author

In parquet-cpp, the graph of nested nodes is built depth first. If you wanted to use raw pointers, the child would have to hold onto a raw pointer in the construction process and then pass ownership to the parent. That makes things fairly complicated in the presence of failures in construction of the tree. Basically, the child nodes would have to hold an extra variable to indicate whether or not they have transferred ownership to their parent. In their destructors, they would need to check to see if they had passed ownership and conditionally destruct the builder if they hadn't. That basically is a poor man's version of a shared_ptr.

Conceptually, I think the building is "shared" between the (List|Struct)Builders and the classes that are appending values and don't see why we shouldn't be using shared_ptr's.

@pitrou
Copy link
Member

pitrou commented May 15, 2018

On the principle it sounds reasonable to me. Two things:

  1. You should prepend the JIRA ticket number and category to the PR's title.
  2. You should check that this doesn't degrade performance (it probably does). If the shared pointer is referenced in a loop, it will add overhead, so we need to store the raw pointer as well and use it where performance is sensitive.

@joshuastorck
Copy link
Contributor Author

In regards to performance, there should be no difference between a unique_ptr and a shared_ptr when using operator->. When I generated ".i" files, here are the respective implementations of unique_ptr and shared_ptr:

template<typename _Tp>
class shared_ptr : public __shared_ptr<_Tp>
{
 public:

  element_type*
  operator->() const throw()
  {
    return _M_ptr;
  }
  
};

template <typename _Tp, typename _Dp = default_delete<_Tp> >
class unique_ptr
{
 public:

 pointer
 operator->() const noexcept
 {
   return get();
 }

 pointer
 get() const noexcept
 {
   return std::get<0>(_M_t);
 }
 
};

I'm confident that all of these are inlined in an optimized build. If you are worried about that not happening, you could argue that unique_ptr would be more suspect than shared_ptr, since that call to get <0> makes two or three more function calls.

@pitrou
Copy link
Member

pitrou commented May 15, 2018

No, that looks ok. I was worried about a shared_ptr being copied around, not about the cost of dereferencing (but that would have failed with unique_ptr, so changing the unique_ptr to shared_ptr shouldn't be able to trigger such a situation).

@joshuastorck joshuastorck changed the title Changing the type of ListBuilder's and StructBuilder's children from unique_ptr to shared_ptr so that it can support deserialization from Parquet to Arrow with arbitrary nesting ARROW-2586: [C++] Changing the type of ListBuilder's and StructBuilder's children from unique_ptr to shared_ptr so that it can support deserialization from Parquet to Arrow with arbitrary nesting May 15, 2018
@codecov-io
Copy link

codecov-io commented May 15, 2018

Codecov Report

Merging #2034 into master will increase coverage by 2.36%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2034      +/-   ##
==========================================
+ Coverage   84.39%   86.75%   +2.36%     
==========================================
  Files         293      237      -56     
  Lines       44789    42046    -2743     
==========================================
- Hits        37799    36479    -1320     
+ Misses       6963     5567    -1396     
+ Partials       27        0      -27
Impacted Files Coverage Δ
cpp/src/arrow/builder.h 97.26% <ø> (ø) ⬆️
cpp/src/arrow/builder.cc 79.71% <100%> (-0.05%) ⬇️
go/arrow/array/bufferbuilder.go
rust/src/builder.rs
go/arrow/array/numericbuilder.gen.go
go/arrow/math/int64_sse4_amd64.go
go/arrow/math/uint64.go
go/arrow/memory/go_allocator.go
go/arrow/datatype_numeric.gen.go
go/arrow/array/bufferbuilder_byte.go
... and 49 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 491114b...d5aac22. Read the comment docs.

@@ -1279,14 +1279,15 @@ Status Decimal128Builder::FinishInternal(std::shared_ptr<ArrayData>* out) {
// ----------------------------------------------------------------------
// ListBuilder

ListBuilder::ListBuilder(MemoryPool* pool, std::unique_ptr<ArrayBuilder> value_builder,
ListBuilder::ListBuilder(MemoryPool* pool,
std::shared_ptr<ArrayBuilder> const& value_builder,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: we're using the const PtrType& spelling elsewhere, not PtrType const&.

@wesm
Copy link
Member

wesm commented May 17, 2018

I'm fine with changing to shared_ptr for the extra flexibility -- I think this was the case in the past, and at some point I changed to unique_ptr. Users will need to be mindful about referencing child builders inside loops, but if performance is an issue they will probably want to grab raw pointers to the child builders and store them somewhere else.

@wesm
Copy link
Member

wesm commented Jul 19, 2018

I added this to 0.10.0. I will rebase and make sure it doesn't cause any issues

Joshua Storck added 2 commits July 23, 2018 19:09
…unique_ptr to shared_ptr so that it can support deserialization from Parquet to Arrow with arbitrary nesting. This allows the class that is responsible for deserializing to hold a shared_ptr to the concrete type without having to static_cast from the (List|Struct)Builder's getters.
Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. I confirmed that this does not cause any issues in parquet-cpp master

@wesm wesm closed this in 566e398 Jul 24, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants