|
| 1 | +# JSON Storage Design Document |
| 2 | + |
| 3 | +## 1. Data Model Design |
| 4 | + |
| 5 | +### 1.1 Data Layering |
| 6 | + |
| 7 | +#### Dense Part |
| 8 | +A set of "core fields" (such as primary keys and commonly used metadata) that are present in most records. |
| 9 | + |
| 10 | +#### Sparse Part |
| 11 | +Additional attributes that appear only in some records, potentially involving unstructured or dynamically extended information. |
| 12 | + |
| 13 | +### 1.2 JSON Splitting and Mapping |
| 14 | + |
| 15 | +#### Dense Field Extraction |
| 16 | +When parsing JSON, predefined dense fields are extracted and mapped to independent columns in Parquet. A method similar to Parquet Variant Shredding is used to flatten nested data. |
| 17 | + |
| 18 | +#### Sparse Data Preservation |
| 19 | +Fields not included in the dense part are stored in a sparse data field. They are serialized using BSON (Binary JSON) format, leveraging its efficient binary representation and rich data type support, with the result stored in a Parquet BINARY type field. |
| 20 | + |
| 21 | +## 2. Storage Strategy |
| 22 | + |
| 23 | +### 2.1 Columnar Storage for Dense Data |
| 24 | +- **Schema Definition**: Create independent columns in Parquet for each dense field, explicitly specifying data types (such as numeric, string, list, etc.). |
| 25 | +- **Query Performance**: Columnar format is suitable for large data scanning and aggregation operations, improving query efficiency, especially for vectors, indexes, and frequently queried fields. |
| 26 | + |
| 27 | +### 2.2 Row Storage for Sparse Data |
| 28 | +- **BSON Storage**: |
| 29 | + - Serialize sparse data as BSON binary format and store it in a single binary column of the Parquet file. |
| 30 | + - BSON format not only compresses more efficiently but also preserves complete data type information of the original data, avoiding numerous null values and file fragmentation issues. |
| 31 | + |
| 32 | +## 3. Parquet Schema Construction |
| 33 | +- **Columnar Part**: Build a fixed schema based on dense fields, with each field having a clear data type definition. |
| 34 | +- **Row Part**: Define a dedicated field (e.g., `sparse_data`) for storing sparse data, with type set to BINARY, directly storing BSON data. |
| 35 | +- **Hybrid Mode**: When writing, dense data is filled into respective columns, and remaining sparse data is serialized as BSON and written to the `sparse_data` field, achieving a balance between query efficiency and storage flexibility. |
| 36 | + |
| 37 | +## 4. Integration and Implementation Considerations |
| 38 | + |
| 39 | +### 4.1 Data Classification Strategy |
| 40 | +- **Density Classification**: |
| 41 | + - Classify fields as dense or sparse based on their frequency of occurrence in records (e.g., greater than 30% for dense), while considering data type consistency. If a field has multiple data types, we should treat data types that appear in more than 30% of records as dense fields, with the remaining types stored as sparse fields. |
| 42 | +- **Dynamic Extension**: |
| 43 | + - For dynamically extended fields, regardless of frequency, store them in the BSON-formatted sparse part to simplify schema evolution. |
| 44 | + |
| 45 | +### 4.2 Indexing for Sparse Data Access |
| 46 | + |
| 47 | +#### Sparse Column Key Indexing |
| 48 | +To accelerate BSON parsing, an inverted index stores BSON keys along with their offsets and sizes or values if they are of numeric type. |
| 49 | + |
| 50 | +##### Value Data Structure Diagram |
| 51 | +| Valid | Type | Row ID | Offset/Value | |
| 52 | +|:-----:|:-----:|:------:|:------------:| |
| 53 | +| 1bit | 4bit | 27bit | 16 offset, 16bit size | |
| 54 | + |
| 55 | +- **64-bit Structure Breakdown**: |
| 56 | + - **Bit 1 (Valid)**: 1 bit indicating data validity (1 = valid, 0 = invalid). |
| 57 | + - **Bits 2-5 (Type)**: 4 bits representing the data type. |
| 58 | + - **Bits 5-31 (Row ID)**: 27 bits for the row ID, uniquely identifying the data row. |
| 59 | + - **Bits 32-64 (Last 32 bits)**: |
| 60 | + - If **Valid = 1**: Last 32 bits store the actual data value. |
| 61 | + - If **Valid = 0**: Last 32 bits are split into: |
| 62 | + - **First 16 bits (Offset)**: Indicates the data offset position. |
| 63 | + - **Last 16 bits (Size)**: Indicates the data size. |
| 64 | + |
| 65 | +The column key index is optional, and can be configured at table creation time or modified later through field properties. |
| 66 | + |
| 67 | +## 5. Example Data |
| 68 | + |
| 69 | +### 5.1 Example JSON Records |
| 70 | + |
| 71 | +```json |
| 72 | +[ |
| 73 | + {"id": 1, "attr1": "value1", "attr2": 100}, |
| 74 | + {"id": 2, "attr1": "value2", "attr3": true}, |
| 75 | + {"id": 3, "attr1": "value3", "attr4": "extra", "attr5": 3.14} |
| 76 | +] |
| 77 | +``` |
| 78 | + |
| 79 | +- **Dense Data:** |
| 80 | + - The field `id` is considered dense. |
| 81 | +- **Sparse Data:** |
| 82 | + - Record 1: `attr1`, `attr2` |
| 83 | + - Record 2: `attr1`, `attr3` |
| 84 | + - Record 3: `attr1`, `attr4`, `attr5` |
| 85 | + |
| 86 | +### 5.2 Parquet File Storage |
| 87 | + |
| 88 | +#### Schema Representation |
| 89 | + |
| 90 | +| Column Name | Data Type | Description | |
| 91 | +|--------------|-----------|-------------| |
| 92 | +| **id** | int64 | Dense column storing the integer identifier. | |
| 93 | +| **sparse_data** | binary | Sparse column storing BSON-serialized data of all remaining fields. | |
| 94 | +| **sparse_index** | binary | Index column storing key offsets for efficient parsing. | |
| 95 | + |
| 96 | +#### Stored Data Breakdown |
| 97 | + |
| 98 | +- **Dense Column (`id`)**: |
| 99 | + - Row 1: `1` |
| 100 | + - Row 2: `2` |
| 101 | + - Row 3: `3` |
| 102 | + |
| 103 | +- **Sparse Column (`sparse_data`)**: |
| 104 | + - **Row 1:** BSON representation of `{"attr1": "value1", "attr2": 100}` |
| 105 | + - **Row 2:** BSON representation of `{"attr1": "value2", "attr3": true}` |
| 106 | + - **Row 3:** BSON representation of `{"attr1": "value3", "attr4": "extra", "attr5": 3.14}` |
| 107 | + |
| 108 | +- **Sparse Index (`sparse_index`)**: |
| 109 | + - **Row 1:** Index entries mapping `attr1` and `attr2` to their respective positions in `sparse_data`. |
| 110 | + - **Row 2:** Index entries mapping `attr1` and `attr3`. |
| 111 | + - **Row 3:** Index entries mapping `attr1`, `attr4`, and `attr5`. |
| 112 | + |
| 113 | +In an actual system, the sparse data would be serialized using a BSON library (e.g., bsoncxx) for a compact binary format. The example above demonstrates the logical mapping of JSON data to the Parquet storage format. |
| 114 | + |
| 115 | +--- |
| 116 | + |
0 commit comments