Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Supports JSON format #3686

Open
WenyXu opened this issue Apr 10, 2024 · 1 comment
Open

Supports JSON format #3686

WenyXu opened this issue Apr 10, 2024 · 1 comment
Labels
C-feature Category Features

Comments

@WenyXu
Copy link
Member

WenyXu commented Apr 10, 2024

What problem does the new feature solve?

Supports inserting/querying JSON data.

What does the feature do?

Support for JSON data format will be divided into several stages:

  1. Add necessary data type support, including Lists and Nested data
  2. Supports inserting JSON data
  3. Supports querying JSON data
  4. Supports Schemaless JSON

Implementation challenges

See also:

@WenyXu WenyXu added C-feature Category Features Difficulty: Hard labels Apr 11, 2024
@WenyXu WenyXu changed the title Supports JSON formats Supports JSON format Apr 19, 2024
@WenyXu
Copy link
Member Author

WenyXu commented Apr 21, 2024

As mentioned in apache/datafusion#7845 (comment), I was greatly inspired by JSONA and proposed a JSONA variant(Maybe we can call it JSONC, JSON for columnar storage formats😙) that may benefit from the high compression rates of columnar storage formats. BTW, If we decide to implement our own JSON storage implementation, It's definitely an excellent opportunity to evaluate various storage implementations of JSON in the OLAP scenario.

A naive proposal of JSONA variant.

For JSON [false, 10, {"k":"v"}, null] can be stored as the following struct.

Struct {
    Nodes: [StartArray, FALSE, Number, StartObject, Key, String, EndObject, NULL, EndArray]
    Offsets: [NULL, NULL, 0, NULL, 0, 0, NULL,…]
    Keys: ["k"]
    Strings: ["v"]
    Numbers: [10]
}

The Struct data can be efficiently encoded into compact files using the underlying file format. In our scenario, we use the Parquet as the underlying file format. For instance, the Nodes field can be represented as UINT8 and efficiently encoded using default dictionary encoding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-feature Category Features
Projects
None yet
Development

No branches or pull requests

2 participants