Skip to content

Conversation

@alamb
Copy link
Contributor

@alamb alamb commented Feb 25, 2021

Rationale

Currently the Arrow json writer makes JSON that looks like this (one record per line):

{"foo":1}
{"bar":1}

Which is not technically valid JSON, which would look something like this:

[
  {"foo":1},
  {"bar":1}
]

New Features

This PR parameterizes the JSON writer so it can write in either format. Note I needed this feature for in IOx, in https://github.com/influxdata/influxdb_iox/pull/870, and I want to propose contributing it back here).

Other Changes:

  1. Added the function into_inner() to retrieve the inner writer from the JSON writer, following the model of the Rust standard library (e.g. BufReader::into_inner

  2. Per Rust standard pattern, I change the JSON writer so that it doesn't add any Buffering (via BufReader) itself, and instead allows the caller the choice of what type of buffering, if any, is needed.

  3. Added / cleaned up a bunch of documentation and comments.

Questions

I went with parameterizing the Writer output as a trait rather than runtime dispatch, for performance. This shouldn't have backwards compatible issues Given the writer has not yet been released yet (introduced by @houqp #9256)

However would people prefer a single Writer that took an Options struct or something to determine how it wrote out data?

//! ```
//!
//! Serialize record batches into line-delimited JSON bytes:
//! ## Writing JSON formatted byte streams
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc examples are probably the best way to see how to use this structure and what the changes looked like

@alamb
Copy link
Contributor Author

alamb commented Feb 25, 2021

cc @houqp

Copy link
Member

@houqp houqp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@houqp
Copy link
Member

houqp commented Feb 25, 2021

I personally prefer the more composable formatter trait approach.

@github-actions
Copy link

@nevi-me
Copy link
Contributor

nevi-me commented Feb 25, 2021

Which is not technically valid JSON, which would look something like this:

It's streaming JSON or NDJSON (https://en.wikipedia.org/wiki/JSON_streaming#Line-delimited_JSON)

Copy link
Contributor

@nevi-me nevi-me left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's sometimes inconvenient when trying to read the JSON files in applications that don't support the streaming format/structure. I'm fine with this change

writer,
started: false,
finished: false,
format: F::default(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the default format? I can't tell from the code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There isn't a default format for Writer (I couldn't figure out how to make one).

This line makes an instance of the formatter. Though come to think of it, none of the formatters actually have state now 🤔 I could move some of the state into JsonArray maybe to make that clearer

@alamb alamb changed the title ARROW-11773: [Rust] Support writing well formed JSON arrays ARROW-11773: [Rust] Support writing well formed JSON arrays as well as newline delimited json streams Feb 25, 2021
@alamb
Copy link
Contributor Author

alamb commented Feb 26, 2021

The integration test failure seems like it is unrelated to the changes in this PR: https://issues.apache.org/jira/browse/ARROW-11717

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants