Skip to content

Commit 73bda08

Browse files
authored
doc: json storage format (milvus-io#40479)
the design doc for the json storage improvemnet Signed-off-by: xiaofanluan <[email protected]>
1 parent c348e61 commit 73bda08

File tree

1 file changed

+116
-0
lines changed

1 file changed

+116
-0
lines changed

docs/design_docs/json_storage.md

+116
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,116 @@
1+
# JSON Storage Design Document
2+
3+
## 1. Data Model Design
4+
5+
### 1.1 Data Layering
6+
7+
#### Dense Part
8+
A set of "core fields" (such as primary keys and commonly used metadata) that are present in most records.
9+
10+
#### Sparse Part
11+
Additional attributes that appear only in some records, potentially involving unstructured or dynamically extended information.
12+
13+
### 1.2 JSON Splitting and Mapping
14+
15+
#### Dense Field Extraction
16+
When parsing JSON, predefined dense fields are extracted and mapped to independent columns in Parquet. A method similar to Parquet Variant Shredding is used to flatten nested data.
17+
18+
#### Sparse Data Preservation
19+
Fields not included in the dense part are stored in a sparse data field. They are serialized using BSON (Binary JSON) format, leveraging its efficient binary representation and rich data type support, with the result stored in a Parquet BINARY type field.
20+
21+
## 2. Storage Strategy
22+
23+
### 2.1 Columnar Storage for Dense Data
24+
- **Schema Definition**: Create independent columns in Parquet for each dense field, explicitly specifying data types (such as numeric, string, list, etc.).
25+
- **Query Performance**: Columnar format is suitable for large data scanning and aggregation operations, improving query efficiency, especially for vectors, indexes, and frequently queried fields.
26+
27+
### 2.2 Row Storage for Sparse Data
28+
- **BSON Storage**:
29+
- Serialize sparse data as BSON binary format and store it in a single binary column of the Parquet file.
30+
- BSON format not only compresses more efficiently but also preserves complete data type information of the original data, avoiding numerous null values and file fragmentation issues.
31+
32+
## 3. Parquet Schema Construction
33+
- **Columnar Part**: Build a fixed schema based on dense fields, with each field having a clear data type definition.
34+
- **Row Part**: Define a dedicated field (e.g., `sparse_data`) for storing sparse data, with type set to BINARY, directly storing BSON data.
35+
- **Hybrid Mode**: When writing, dense data is filled into respective columns, and remaining sparse data is serialized as BSON and written to the `sparse_data` field, achieving a balance between query efficiency and storage flexibility.
36+
37+
## 4. Integration and Implementation Considerations
38+
39+
### 4.1 Data Classification Strategy
40+
- **Density Classification**:
41+
- Classify fields as dense or sparse based on their frequency of occurrence in records (e.g., greater than 30% for dense), while considering data type consistency. If a field has multiple data types, we should treat data types that appear in more than 30% of records as dense fields, with the remaining types stored as sparse fields.
42+
- **Dynamic Extension**:
43+
- For dynamically extended fields, regardless of frequency, store them in the BSON-formatted sparse part to simplify schema evolution.
44+
45+
### 4.2 Indexing for Sparse Data Access
46+
47+
#### Sparse Column Key Indexing
48+
To accelerate BSON parsing, an inverted index stores BSON keys along with their offsets and sizes or values if they are of numeric type.
49+
50+
##### Value Data Structure Diagram
51+
| Valid | Type | Row ID | Offset/Value |
52+
|:-----:|:-----:|:------:|:------------:|
53+
| 1bit | 4bit | 27bit | 16 offset, 16bit size |
54+
55+
- **64-bit Structure Breakdown**:
56+
- **Bit 1 (Valid)**: 1 bit indicating data validity (1 = valid, 0 = invalid).
57+
- **Bits 2-5 (Type)**: 4 bits representing the data type.
58+
- **Bits 5-31 (Row ID)**: 27 bits for the row ID, uniquely identifying the data row.
59+
- **Bits 32-64 (Last 32 bits)**:
60+
- If **Valid = 1**: Last 32 bits store the actual data value.
61+
- If **Valid = 0**: Last 32 bits are split into:
62+
- **First 16 bits (Offset)**: Indicates the data offset position.
63+
- **Last 16 bits (Size)**: Indicates the data size.
64+
65+
The column key index is optional, and can be configured at table creation time or modified later through field properties.
66+
67+
## 5. Example Data
68+
69+
### 5.1 Example JSON Records
70+
71+
```json
72+
[
73+
{"id": 1, "attr1": "value1", "attr2": 100},
74+
{"id": 2, "attr1": "value2", "attr3": true},
75+
{"id": 3, "attr1": "value3", "attr4": "extra", "attr5": 3.14}
76+
]
77+
```
78+
79+
- **Dense Data:**
80+
- The field `id` is considered dense.
81+
- **Sparse Data:**
82+
- Record 1: `attr1`, `attr2`
83+
- Record 2: `attr1`, `attr3`
84+
- Record 3: `attr1`, `attr4`, `attr5`
85+
86+
### 5.2 Parquet File Storage
87+
88+
#### Schema Representation
89+
90+
| Column Name | Data Type | Description |
91+
|--------------|-----------|-------------|
92+
| **id** | int64 | Dense column storing the integer identifier. |
93+
| **sparse_data** | binary | Sparse column storing BSON-serialized data of all remaining fields. |
94+
| **sparse_index** | binary | Index column storing key offsets for efficient parsing. |
95+
96+
#### Stored Data Breakdown
97+
98+
- **Dense Column (`id`)**:
99+
- Row 1: `1`
100+
- Row 2: `2`
101+
- Row 3: `3`
102+
103+
- **Sparse Column (`sparse_data`)**:
104+
- **Row 1:** BSON representation of `{"attr1": "value1", "attr2": 100}`
105+
- **Row 2:** BSON representation of `{"attr1": "value2", "attr3": true}`
106+
- **Row 3:** BSON representation of `{"attr1": "value3", "attr4": "extra", "attr5": 3.14}`
107+
108+
- **Sparse Index (`sparse_index`)**:
109+
- **Row 1:** Index entries mapping `attr1` and `attr2` to their respective positions in `sparse_data`.
110+
- **Row 2:** Index entries mapping `attr1` and `attr3`.
111+
- **Row 3:** Index entries mapping `attr1`, `attr4`, and `attr5`.
112+
113+
In an actual system, the sparse data would be serialized using a BSON library (e.g., bsoncxx) for a compact binary format. The example above demonstrates the logical mapping of JSON data to the Parquet storage format.
114+
115+
---
116+

0 commit comments

Comments
 (0)