This repository was archived by the owner on Mar 21, 2024. It is now read-only.
-
Couldn't load subscription status.
- Fork 11
Indexing NDJSONs #29
Merged
Merged
Indexing NDJSONs #29
Changes from 11 commits
Commits
Show all changes
12 commits
Select commit
Hold shift + click to select a range
d4ed754
initialize a draft for json lines indexation support specification
gmourier fa74d41
update filename number to match related pull-request
gmourier 244539a
update specs
gmourier 0a6fae8
update link to CSV spec
gmourier d3d10a3
update spec name
gmourier 6805346
Apply typos correction from code review
gmourier 174ef48
fix typo
gmourier dcc6674
update impact on documentation part
gmourier 327e15a
replace file by data
gmourier 77aee97
add information about giving application/json content-type or not for…
gmourier 93885e9
updates error codes, curl instructions
gmourier ed14253
moved behavior about missing content-type in explanation part
gmourier File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,138 @@ | ||
| - Title: Indexing NDJSON | ||
| - Start Date: 2021-04-12 | ||
| - Specification PR: [PR-#29](https://github.com/meilisearch/specifications/pull/29) | ||
| - MeiliSearch Tracking-Issues: TBD | ||
|
|
||
| # Indexing NDJSON | ||
|
|
||
| ## 1. Feature Description and Interaction | ||
|
|
||
| ### I. Summary | ||
|
|
||
| To index documents, the body of the add documents request has to match a specific format. That specific format is then parsed and tokenized inside MeiliSearch. After which, the documents added are in the pool of searchable and returnable documents. | ||
|
|
||
| An [NDJSON](http://ndjson.org/) data format is easier to use than a CSV format because it propose a convenient format for storing structured data. | ||
|
|
||
| ### II. Motivation | ||
|
|
||
| Currently, the engine only accepts JSON format as a data source. We want to give users the possibility of another simple data format to use. Thus, give them more versatility at the data source choices for the indexing step. | ||
|
|
||
| Writing performance is also a motivation since JSON Lines data parsing is less CPU and memory-intensive than parsing standard JSON. When new lines represent separate entries it makes the NDJSON data streamable, thus, more suited for indexing a consequent data set. | ||
|
|
||
| While we give the ability to Meilisearch to ingest CSV data for indexing in this [specification](https://github.com/meilisearch/specifications/pull/28), we are aware of the limitations of CSV so we also want to provide a format that is easy to validate. Handling the validity of a CSV can be frustrating and difficult. Only strings can be managed within a CSV. In addition, there is no official specification except [RFC 4180](https://tools.ietf.org/html/rfc4180) which is not sufficient for all data scheme. | ||
|
|
||
| Representing nested structures in a JSON object is easy and convenient. | ||
|
|
||
| ### III. Additional Materials | ||
|
|
||
| TBD | ||
|
|
||
| ### IV. Explanation | ||
|
|
||
| Newline-delimited JSON (`ndjson`), line-delimited JSON (`ldjson`), JSON lines (`jsonl`) are three terms expressing the same formats primarily intended for JSON streaming. | ||
|
|
||
| As of now, we will use `ndjson` in the next parts to refer to a data format that represents JSON entries separated by a new line character. | ||
|
|
||
| - Each entries will represent a document for MeiliSearch. | ||
| - Each entries should be a valid JSON object. | ||
| - The data should be encoded in UTF-8. | ||
|
|
||
| #### Example of a valid NJSON | ||
|
|
||
| Given the NDJSON payload | ||
| ''' | ||
| {"id":1, "label": "t-shirt", "price": 4.99, "colors": ["red", "green", "blue"]} | ||
| {"id":499, "label": "hoodie", "price": 19.99, "colors": ["purple"]} | ||
| ''' | ||
| the search result should be displayed as | ||
| ```json | ||
| { | ||
| "hits": [ | ||
| { | ||
| "id": 1, | ||
| "label": "t-shirt", | ||
| "price": 4.99, | ||
| "colors": [ | ||
| "red", | ||
| "green", | ||
| "blue" | ||
| ], | ||
| }, | ||
| { | ||
| "id": 499, | ||
| "label": "hoodie", | ||
| "price": 19.99, | ||
| "colors": [ | ||
| "purple" | ||
| ], | ||
| } | ||
| ], | ||
| ... | ||
| } | ||
| ``` | ||
|
|
||
| #### API Endpoints | ||
|
|
||
| > Each API endpoints mentioned above will now require a `application/x-ndjson` as `Content-Type` header to be processed as NDJSON data. | ||
|
|
||
| #### Add or Replace Documents [📎](https://docs.meilisearch.com/reference/api/documents.html#add-or-replace-documents) | ||
|
|
||
| ```curl | ||
| curl \ | ||
| -X POST 'http://localhost:7700/indexes/movies/documents' \ | ||
| -H 'Content-Type: application/x-ndjson' \ | ||
| --binary-data ' | ||
| {"id":1, "label": "t-shirt", "price": 4.99, "colors": ["red", "green", "blue"]}\n | ||
| {"id":499, "label": "hoodie", "price": 19.99, "colors": ["purple"]} | ||
| ' | ||
| ``` | ||
| > Response code: 202 Accepted | ||
|
|
||
| ##### Error codes | ||
|
|
||
| > - Sending a different payload than the `Content-Type` header should return a `400 bad_request` error. | ||
| > - Too large payload according to the limit should return a `413 payload_too_large` error | ||
| > - Wrong encoding should return a `400 bad_request` error | ||
| > - Invalid NDJSON data should return a `400 bad_request` error | ||
|
|
||
| ### Add or Update Documents [📎](https://docs.meilisearch.com/reference/api/documents.html#add-or-update-documents) | ||
|
|
||
| ```curl | ||
| curl \ | ||
| -X PUT 'http://localhost:7700/indexes/movies/documents' \ | ||
| -H 'Content-Type: application/x-ndjson' \ | ||
| --binary-data ' | ||
| {"id":1, "label": "t-shirt", "price": 4.99, "colors": ["red", "green", "blue"]}\n | ||
| {"id":499, "label": "hoodie", "price": 19.99, "colors": ["purple"]} | ||
| ' | ||
| ``` | ||
| > Response code: 202 Accepted | ||
|
|
||
| ##### Errors handling | ||
|
|
||
| > - Sending a different payload than the `Content-Type` header should return a `400 bad_request` error. | ||
| > - Too large payload according to the limit should return a `413 payload_too_large` error | ||
| > - Wrong encoding should return a `400 bad_request` error | ||
| > - Invalid NDJSON data should return a `400 bad_request` error | ||
|
|
||
| ### V. Impact on documentation | ||
|
|
||
| This feature should impact MeiliSearch users documentation by adding the possibility to use `ndjson` as an accepted format in the Documents scope at [Add or replace documents](https://docs.meilisearch.com/reference/api/documents.html#add-or-replace-documents) and [Add or update documents](https://docs.meilisearch.com/reference/api/documents.html#add-or-update-documents). It should also mention that a missing Content-Type will be interpreted as `application/json` since it's the current behavior. Giving an `application/json` Content-Type leads to the same behavior. | ||
|
|
||
| We should also not only mention JSON format in `unsupported_media_type` section on the [errors page](https://docs.meilisearch.com/errors/#unsupported_media_type) and add `ndjson` format. The documentation says "Currently, MeiliSearch supports only JSON payloads." | ||
|
|
||
| Documentation should also guide the user in the correct way to properly format and send ndjson data. Adding a dedicated page for the purpose of formatting and sending ndjson data should be considered. | ||
|
|
||
| ### VI. Impact on SDKs | ||
|
|
||
| This feature should impact MeiliSearch SDK's in the future by adding the possibility to send ndjson data to MeiliSearch on the previous explicited endpoints. | ||
|
|
||
| ## 2. Technical Aspects | ||
|
|
||
| ### I. Technical details | ||
|
|
||
| ⚠ A missing Content-Type will be interpreted as `application/json` since it's the current behavior. Giving an `application/json` Content-Type leads to the same behavior. | ||
|
|
||
| ## 3. Future possibilities | ||
| - Provide an interface in the future dashboard to upload NDJSON data into an index. | ||
| - Set a payload limit directly related to the type of data format. Currently, the payload size is equivalent to [JSON payload size](https://docs.meilisearch.com/reference/features/configuration.html#payload-limit-size). Metrics on feature usage and configuration update should help to choose a better suited value for this type of data. | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it a technical detail? The users and the documentation should be aware of this. Technical details are for internal implementation from my POV.
This information is as important as any information in
Explanation.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice point @curquiza. I will update the specification according to this feedback!