JSON support in Arrow and DataFusion #9103
Replies: 2 comments
-
I am not aware of anything actively
I think there are several contributors who would be interested in helping Interestingly, there is a discussion of a similar topic here: #7845. I suggest we move the conversation there |
Beta Was this translation helpful? Give feedback.
-
Demand is high. I work in healthcare and there is lot of data in deeply nested JSON format. Today I have to depend on query engine capabilities to deal with struct and map functions. They often slow and don't do any vectorization of processing. Often times end up in OOM while processing large data. In general now a days the usage of JSON data has increased people are dropping relationships in favor of JSON. Specially databases like Postgresql offers JSON columns so anyone can combine concepts of relations with JSON very easily. Having a native support in arrow and vectorize processing of JSON will help tremendously. |
Beta Was this translation helpful? Give feedback.
-
Two questions for this community:
I came across https://www.durner.dev/app/media/papers/json-tiles-sigmod21.pdf which describes techniques used by the Ubmra RDBMS for its JSON support. Their experiments suggest excellent performance for analytic computations over semi-structured data. There is not yet an open source implementation AFAICT but the paper is decently detailed that an impl may be possible based on the information within.
I'd describe what I'm picturing as: First, specify a serialization format for JSON tiles. Then write the compute kernel functions to operate on serialized JSON tiles. I imagine it might make sense to designate this as a
Binary
-based Arrow extension type, too.(I might also ask on the Arrow mailing list; will link here if I do so && it's possible to link)
Beta Was this translation helpful? Give feedback.
All reactions