Skip to content
This repository was archived by the owner on Mar 21, 2024. It is now read-only.

Commit e67eef6

Browse files
gmourierbidoubiwa
andcommitted
Indexing NDJSONs (#29)
* initialize a draft for json lines indexation support specification * update filename number to match related pull-request * update specs * update link to CSV spec * update spec name * Apply typos correction from code review Co-authored-by: cvermand <[email protected]> * fix typo * update impact on documentation part * replace file by data * add information about giving application/json content-type or not for a json payload * updates error codes, curl instructions * moved behavior about missing content-type in explanation part Co-authored-by: cvermand <[email protected]>
1 parent e270da6 commit e67eef6

File tree

1 file changed

+136
-0
lines changed

1 file changed

+136
-0
lines changed

text/0029-indexing-ndjson.md

Lines changed: 136 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,136 @@
1+
- Title: Indexing NDJSON
2+
- Start Date: 2021-04-12
3+
- Specification PR: [PR-#29](https://github.com/meilisearch/specifications/pull/29)
4+
- MeiliSearch Tracking-Issues: TBD
5+
6+
# Indexing NDJSON
7+
8+
## 1. Feature Description and Interaction
9+
10+
### I. Summary
11+
12+
To index documents, the body of the add documents request has to match a specific format. That specific format is then parsed and tokenized inside MeiliSearch. After which, the documents added are in the pool of searchable and returnable documents.
13+
14+
An [NDJSON](http://ndjson.org/) data format is easier to use than a CSV format because it propose a convenient format for storing structured data.
15+
16+
### II. Motivation
17+
18+
Currently, the engine only accepts JSON format as a data source. We want to give users the possibility of another simple data format to use. Thus, give them more versatility at the data source choices for the indexing step.
19+
20+
Writing performance is also a motivation since JSON Lines data parsing is less CPU and memory-intensive than parsing standard JSON. When new lines represent separate entries it makes the NDJSON data streamable, thus, more suited for indexing a consequent data set.
21+
22+
While we give the ability to Meilisearch to ingest CSV data for indexing in this [specification](https://github.com/meilisearch/specifications/pull/28), we are aware of the limitations of CSV so we also want to provide a format that is easy to validate. Handling the validity of a CSV can be frustrating and difficult. Only strings can be managed within a CSV. In addition, there is no official specification except [RFC 4180](https://tools.ietf.org/html/rfc4180) which is not sufficient for all data scheme.
23+
24+
Representing nested structures in a JSON object is easy and convenient.
25+
26+
### III. Additional Materials
27+
28+
TBD
29+
30+
### IV. Explanation
31+
32+
Newline-delimited JSON (`ndjson`), line-delimited JSON (`ldjson`), JSON lines (`jsonl`) are three terms expressing the same formats primarily intended for JSON streaming.
33+
34+
As of now, we will use `ndjson` in the next parts to refer to a data format that represents JSON entries separated by a new line character.
35+
36+
- Each entries will represent a document for MeiliSearch.
37+
- Each entries should be a valid JSON object.
38+
- The data should be encoded in UTF-8.
39+
40+
#### Example of a valid NJSON
41+
42+
Given the NDJSON payload
43+
'''
44+
{"id":1, "label": "t-shirt", "price": 4.99, "colors": ["red", "green", "blue"]}
45+
{"id":499, "label": "hoodie", "price": 19.99, "colors": ["purple"]}
46+
'''
47+
the search result should be displayed as
48+
```json
49+
{
50+
"hits": [
51+
{
52+
"id": 1,
53+
"label": "t-shirt",
54+
"price": 4.99,
55+
"colors": [
56+
"red",
57+
"green",
58+
"blue"
59+
],
60+
},
61+
{
62+
"id": 499,
63+
"label": "hoodie",
64+
"price": 19.99,
65+
"colors": [
66+
"purple"
67+
],
68+
}
69+
],
70+
...
71+
}
72+
```
73+
74+
#### API Endpoints
75+
76+
> Each API endpoints mentioned above will now require a `application/x-ndjson` as `Content-Type` header to be processed as NDJSON data.
77+
> ⚠ A missing Content-Type will be interpreted as `application/json` since it's the current behavior. Giving an `application/json` Content-Type leads to the same behavior.
78+
79+
#### Add or Replace Documents [📎](https://docs.meilisearch.com/reference/api/documents.html#add-or-replace-documents)
80+
81+
```curl
82+
curl \
83+
-X POST 'http://localhost:7700/indexes/movies/documents' \
84+
-H 'Content-Type: application/x-ndjson' \
85+
--binary-data '
86+
{"id":1, "label": "t-shirt", "price": 4.99, "colors": ["red", "green", "blue"]}\n
87+
{"id":499, "label": "hoodie", "price": 19.99, "colors": ["purple"]}
88+
'
89+
```
90+
> Response code: 202 Accepted
91+
92+
##### Error codes
93+
94+
> - Sending a different payload than the `Content-Type` header should return a `400 bad_request` error.
95+
> - Too large payload according to the limit should return a `413 payload_too_large` error
96+
> - Wrong encoding should return a `400 bad_request` error
97+
> - Invalid NDJSON data should return a `400 bad_request` error
98+
99+
### Add or Update Documents [📎](https://docs.meilisearch.com/reference/api/documents.html#add-or-update-documents)
100+
101+
```curl
102+
curl \
103+
-X PUT 'http://localhost:7700/indexes/movies/documents' \
104+
-H 'Content-Type: application/x-ndjson' \
105+
--binary-data '
106+
{"id":1, "label": "t-shirt", "price": 4.99, "colors": ["red", "green", "blue"]}\n
107+
{"id":499, "label": "hoodie", "price": 19.99, "colors": ["purple"]}
108+
'
109+
```
110+
> Response code: 202 Accepted
111+
112+
##### Errors handling
113+
114+
> - Sending a different payload than the `Content-Type` header should return a `400 bad_request` error.
115+
> - Too large payload according to the limit should return a `413 payload_too_large` error
116+
> - Wrong encoding should return a `400 bad_request` error
117+
> - Invalid NDJSON data should return a `400 bad_request` error
118+
119+
### V. Impact on documentation
120+
121+
This feature should impact MeiliSearch users documentation by adding the possibility to use `ndjson` as an accepted format in the Documents scope at [Add or replace documents](https://docs.meilisearch.com/reference/api/documents.html#add-or-replace-documents) and [Add or update documents](https://docs.meilisearch.com/reference/api/documents.html#add-or-update-documents). It should also mention that a missing Content-Type will be interpreted as `application/json` since it's the current behavior. Giving an `application/json` Content-Type leads to the same behavior.
122+
123+
We should also not only mention JSON format in `unsupported_media_type` section on the [errors page](https://docs.meilisearch.com/errors/#unsupported_media_type) and add `ndjson` format. The documentation says "Currently, MeiliSearch supports only JSON payloads."
124+
125+
Documentation should also guide the user in the correct way to properly format and send ndjson data. Adding a dedicated page for the purpose of formatting and sending ndjson data should be considered.
126+
127+
### VI. Impact on SDKs
128+
129+
This feature should impact MeiliSearch SDK's in the future by adding the possibility to send ndjson data to MeiliSearch on the previous explicited endpoints.
130+
131+
## 2. Technical Aspects
132+
N/A
133+
134+
## 3. Future possibilities
135+
- Provide an interface in the future dashboard to upload NDJSON data into an index.
136+
- Set a payload limit directly related to the type of data format. Currently, the payload size is equivalent to [JSON payload size](https://docs.meilisearch.com/reference/features/configuration.html#payload-limit-size). Metrics on feature usage and configuration update should help to choose a better suited value for this type of data.

0 commit comments

Comments
 (0)