-
Notifications
You must be signed in to change notification settings - Fork 787
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide access to inner Write for parquet writers #5471
Provide access to inner Write for parquet writers #5471
Conversation
|
||
/// Returns a mutable reference to the underlying writer. | ||
/// | ||
/// It is inadvisable to directly write to the underlying writer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is copied from the docs on std::io::BufWriter
, which is effectively what this is.
One could argue that exposing this is dangerous as they could potentially mess with the writer underneath us, but I think this is fine because the standard library does this, and there isn't anything to stop them using a writer with interior mutability anyway (as AsyncArrowWriter in fact did)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can add some additional commentary about why it is inadvisable and maybe when it is ok (clearly there is some usecase otherwise we wouldn't have it) for example
/// It is inadvisable to directly write to the underlying writer. | |
/// WARNING: directly writing to the underlying writer | |
/// prior to closing this ArrowWriter can result in corrupted | |
/// parquet files. | |
/// | |
/// This API can be used to flush the underlying writer |
I think the actual usecase in this PR is more complicated (the async writer writes directly into the buffer) so I probably don't fully understand what is going on / the use case. Perhaps all the more reason to improve the documentation 🤔
FYI @ShiKaiWi |
a70270c
to
96de92b
Compare
96de92b
to
9195e9d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me -- thank you @tustvold
I had some comment suggestions but nothing that would prevent merging
I think it is worth considering if we really need to provide inner_mut()
, though perhaps some more documentation would make it clearer what was going on / when it was ok to do so
|
||
/// Returns a mutable reference to the underlying writer. | ||
/// | ||
/// It is inadvisable to directly write to the underlying writer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can add some additional commentary about why it is inadvisable and maybe when it is ok (clearly there is some usecase otherwise we wouldn't have it) for example
/// It is inadvisable to directly write to the underlying writer. | |
/// WARNING: directly writing to the underlying writer | |
/// prior to closing this ArrowWriter can result in corrupted | |
/// parquet files. | |
/// | |
/// This API can be used to flush the underlying writer |
I think the actual usecase in this PR is more complicated (the async writer writes directly into the buffer) so I probably don't fully understand what is going on / the use case. Perhaps all the more reason to improve the documentation 🤔
/// Flushes any outstanding data and returns the underlying writer. | ||
pub fn into_inner(mut self) -> Result<W> { | ||
self.flush()?; | ||
self.writer.into_inner() | ||
} | ||
|
||
/// Close and finalize the underlying Parquet writer | ||
pub fn close(mut self) -> Result<crate::format::FileMetaData> { | ||
/// |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// | |
/// returning the completed [`crate::formatFileMetaData`] for the written file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is sufficiently obvious
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My rationale for this suggestion was to make it easier for someone to quickly determine "what is the difference between the very similarly sounding finish()
and close()
methods
self.writer.finish() | ||
} | ||
|
||
/// Close and finalize the underlying Parquet writer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Close and finalize the underlying Parquet writer | |
/// Close and finalize the underlying Parquet writer, consuming self | |
/// | |
/// Returns the metadata for the newly created Parquet file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again I think this is sufficiently obvious
/// Trigger forced flushing once buffer size reaches this value | ||
buffer_size: usize, | ||
} | ||
|
||
impl<W: AsyncWrite + Unpin + Send> AsyncArrowWriter<W> { | ||
/// Try to create a new Async Arrow Writer. | ||
/// | ||
/// `buffer_size` determines the number of bytes to buffer before flushing | ||
/// `buffer_size` determines the minimum number of bytes to buffer before flushing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the comment below says "self::write" will flush the intermediate buffer it it is at least half full, but this comment seems to imply that buffer_size
determines when to flush. This seems inconsistent to me, but maybe I don't undersand
The same comments applies to try_new_with_options
below
|
||
/// Returns a mutable reference to the underlying writer. | ||
/// | ||
/// It is inadvisable to directly write to the underlying writer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above, I think it would be good to explain why it is inadvisable and document when one would use this API
let err = file_writer.next_row_group().err().unwrap().to_string(); | ||
assert_eq!(err, "Parquet error: SerializedFileWriter already finished"); | ||
|
||
drop(file_writer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is the drop necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The file_writer has a borrow of the vec
…-inner-parquet-writers
This reverts commit e2b1f22.
Which issue does this PR close?
Closes #5253
Rationale for this change
Whilst working on #5458 I wanted to explore how we could do IO better when writing parquet data to object storage. By allowing mutable access to the underlying writer, we make it easier to integrate with systems that need to flush data asynchronously.
This additionally came up in #5450.
#4122 added similar methods to the IPC readers/writers.
What changes are included in this PR?
Are there any user-facing changes?