JSON support in Arrow and DataFusion #9103

spencerwilson · 2024-02-01T19:18:13Z

spencerwilson
Feb 1, 2024

Two questions for this community:

Is anyone aware of work on JSON in the Arrow format, or in turn in DataFusion? I'm referring to doing actual analytics queries on JSON data.
How much demand is there for that?

I came across https://www.durner.dev/app/media/papers/json-tiles-sigmod21.pdf which describes techniques used by the Ubmra RDBMS for its JSON support. Their experiments suggest excellent performance for analytic computations over semi-structured data. There is not yet an open source implementation AFAICT but the paper is decently detailed that an impl may be possible based on the information within.

I'd describe what I'm picturing as: First, specify a serialization format for JSON tiles. Then write the compute kernel functions to operate on serialized JSON tiles. I imagine it might make sense to designate this as a Binary-based Arrow extension type, too.

(I might also ask on the Arrow mailing list; will link here if I do so && it's possible to link)

alamb · 2024-02-01T20:23:35Z

alamb
Feb 1, 2024
Collaborator

Is anyone aware of work on JSON in the Arrow format, or in turn in DataFusion? I'm referring to doing actual analytics queries on JSON data.

I am not aware of anything actively

How much demand is there for that?

I think there are several contributors who would be interested in helping

Interestingly, there is a discussion of a similar topic here: #7845. I suggest we move the conversation there

0 replies

kesavkolla · 2024-02-02T15:48:58Z

kesavkolla
Feb 2, 2024

Demand is high. I work in healthcare and there is lot of data in deeply nested JSON format. Today I have to depend on query engine capabilities to deal with struct and map functions. They often slow and don't do any vectorization of processing. Often times end up in OOM while processing large data.

In general now a days the usage of JSON data has increased people are dropping relationships in favor of JSON. Specially databases like Postgresql offers JSON columns so anyone can combine concepts of relations with JSON very easily.

Having a native support in arrow and vectorize processing of JSON will help tremendously.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JSON support in Arrow and DataFusion #9103

{{title}}

Replies: 2 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

JSON support in Arrow and DataFusion #9103

spencerwilson Feb 1, 2024

Replies: 2 comments

alamb Feb 1, 2024 Collaborator

kesavkolla Feb 2, 2024

spencerwilson
Feb 1, 2024

alamb
Feb 1, 2024
Collaborator

kesavkolla
Feb 2, 2024