-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-11773: [Rust] Support writing well formed JSON arrays as well as newline delimited json streams #9575
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| //! ``` | ||
| //! | ||
| //! Serialize record batches into line-delimited JSON bytes: | ||
| //! ## Writing JSON formatted byte streams |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The doc examples are probably the best way to see how to use this structure and what the changes looked like
|
cc @houqp |
houqp
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
|
I personally prefer the more composable formatter trait approach. |
It's streaming JSON or NDJSON (https://en.wikipedia.org/wiki/JSON_streaming#Line-delimited_JSON) |
nevi-me
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's sometimes inconvenient when trying to read the JSON files in applications that don't support the streaming format/structure. I'm fine with this change
| writer, | ||
| started: false, | ||
| finished: false, | ||
| format: F::default(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the default format? I can't tell from the code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There isn't a default format for Writer (I couldn't figure out how to make one).
This line makes an instance of the formatter. Though come to think of it, none of the formatters actually have state now 🤔 I could move some of the state into JsonArray maybe to make that clearer
|
The integration test failure seems like it is unrelated to the changes in this PR: https://issues.apache.org/jira/browse/ARROW-11717 |
Rationale
Currently the Arrow json writer makes JSON that looks like this (one record per line):
{"foo":1} {"bar":1}Which is not technically valid JSON, which would look something like this:
New Features
This PR parameterizes the JSON writer so it can write in either format. Note I needed this feature for in IOx, in https://github.com/influxdata/influxdb_iox/pull/870, and I want to propose contributing it back here).
Other Changes:
Added the function
into_inner()to retrieve the inner writer from the JSON writer, following the model of the Rust standard library (e.g. BufReader::into_innerPer Rust standard pattern, I change the JSON writer so that it doesn't add any Buffering (via
BufReader) itself, and instead allows the caller the choice of what type of buffering, if any, is needed.Added / cleaned up a bunch of documentation and comments.
Questions
I went with parameterizing the
Writeroutput as a trait rather than runtime dispatch, for performance. This shouldn't have backwards compatible issues Given the writer has not yet been released yet (introduced by @houqp #9256)However would people prefer a single
Writerthat took anOptionsstruct or something to determine how it wrote out data?