This repository was archived by the owner on Mar 21, 2024. It is now read-only.
-
Couldn't load subscription status.
- Fork 11
Indexing NDJSONs #29
Merged
Merged
Indexing NDJSONs #29
Changes from 5 commits
Commits
Show all changes
12 commits
Select commit
Hold shift + click to select a range
d4ed754
initialize a draft for json lines indexation support specification
gmourier fa74d41
update filename number to match related pull-request
gmourier 244539a
update specs
gmourier 0a6fae8
update link to CSV spec
gmourier d3d10a3
update spec name
gmourier 6805346
Apply typos correction from code review
gmourier 174ef48
fix typo
gmourier dcc6674
update impact on documentation part
gmourier 327e15a
replace file by data
gmourier 77aee97
add information about giving application/json content-type or not for…
gmourier 93885e9
updates error codes, curl instructions
gmourier ed14253
moved behavior about missing content-type in explanation part
gmourier File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,132 @@ | ||
| - Title: Indexing NDJSON | ||
| - Start Date: 2021-04-12 | ||
| - Specification PR: [PR-#29](https://github.com/meilisearch/specifications/pull/29) | ||
| - MeiliSearch Tracking-Issues: TBD | ||
|
|
||
| # Indexing NDJSON | ||
|
|
||
| ## 1. Feature Description and Interaction | ||
|
|
||
| ### I. Summary | ||
|
|
||
| The initiation step of document indexing is to send some file matching a format to be parsed and tokenized in order to give search results to end-users. An [NDJSON](http://ndjson.org/) file is easier to use than a CSV file because it propose a convenient format for storing structured data. | ||
|
|
||
| ### II. Motivation | ||
|
|
||
| Currently, the engine only accepts JSON format as a data source. We want to give users the possibility of another simple file format to use. Thus, give them more versatility at the data source choices for the indexation step. | ||
|
|
||
| Writing performance is also considered as a motivation since JSON Lines file parsing is less CPU and memory intensive than parsing standard JSON because every new lines represent separate entries, making the NDJSON file streamable. Thus, more suited for indexing a consequent data set. | ||
gmourier marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| While we give the ability to Meilisearch to ingest CSV files for indexing in this [specification](https://github.com/meilisearch/specifications/pull/28), we are aware of the limitations of CSV so we also want to provide a format that is easy to validate. Handling the validity of a CSV can be frustrating and difficult. Only strings can be managed within a CSV. In addition, there is no official specification except [RFC 4180](https://tools.ietf.org/html/rfc4180) which is not sufficient for all data scheme. | ||
|
|
||
| Representing nested structures in a JSON object is easy and convenient. | ||
|
|
||
| ### III. Additional Materials | ||
|
|
||
| TBD | ||
|
|
||
| ### IV. Explanation | ||
|
|
||
| Newline-delimited JSON (`ndjson`), line-delimited JSON (`ldjson`), JSON lines (`jsonl`) are three terms expressing the same formats primarily intended for JSON streaming. | ||
|
|
||
| As of now, we will use `ndjson` in the next parts to refer to a file that represents JSON entries separated by a new line character. | ||
|
|
||
| - Each entries will represent a document for MeiliSearch. | ||
| - Each entries should be a valid JSON object. | ||
| - The file should be encoded in UTF-8. | ||
|
|
||
| #### Example of a valid NJSON | ||
|
|
||
| Given the NDJSON payload | ||
| ''' | ||
| {"id":1, "label": "t-shirt", "price": 4.99, "colors": ["red", "green", "blue"]} | ||
| {"id":499, "label": "hoodie", "price": 19.99, "colors": ["purple"]} | ||
| ''' | ||
| the search result should be displayed as | ||
| ```json | ||
| { | ||
| "hits": [ | ||
| { | ||
| "id": 1, | ||
| "label": "t-shirt", | ||
| "price": 4.99, | ||
| "colors": [ | ||
| "red", | ||
| "green", | ||
| "blue" | ||
| ], | ||
| }, | ||
| { | ||
| "id": 499, | ||
| "label": "hoodie", | ||
| "price": 19.99, | ||
| "colors": [ | ||
| "purple" | ||
| ], | ||
| } | ||
| ], | ||
| ... | ||
| } | ||
| ``` | ||
|
|
||
| #### API Endpoints | ||
|
|
||
| > Each API endpoints mentioned above will now require a `application/x-ndjson` as `Content-Type` header to process NDJSON data. | ||
gmourier marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| #### Add or Replace Documents [📎](https://docs.meilisearch.com/reference/api/documents.html#add-or-replace-documents) | ||
|
|
||
| ```curl | ||
| curl \ | ||
| -X POST 'http://localhost:7700/indexes/movies/documents' \ | ||
| -H 'Content-Type: application/x-ndjson' \ | ||
| --data ' | ||
| {"id":1, "label": "t-shirt", "price": 4.99, "colors": ["red", "green", "blue"]}\n | ||
| {"id":499, "label": "hoodie", "price": 19.99, "colors": ["purple"]} | ||
| ' | ||
| ``` | ||
| > Response code: 202 Accepted | ||
|
|
||
| ##### Error codes | ||
|
|
||
| > - Sending a different payload than the `Content-Type` header should return a `415 unsupported_media_type` error. | ||
| > - Too large payload according to the limit should return a `413 payload_too_large` error | ||
| > - Wrong file encoding should return a `420 unprocessable_entity` error | ||
| > - Invalid CSV data should return a `420 unprocessable_entity` error | ||
gmourier marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| ### Add or Update Documents [📎](https://docs.meilisearch.com/reference/api/documents.html#add-or-update-documents) | ||
|
|
||
| ```curl | ||
| curl \ | ||
| -X PUT 'http://localhost:7700/indexes/movies/documents' \ | ||
| -H 'Content-Type: application/x-ndjson' \ | ||
| --data ' | ||
| {"id":1, "label": "t-shirt", "price": 4.99, "colors": ["red", "green", "blue"]}\n | ||
| {"id":499, "label": "hoodie", "price": 19.99, "colors": ["purple"]} | ||
| ' | ||
| ``` | ||
| > Response code: 202 Accepted | ||
|
|
||
| ##### Errors handling | ||
|
|
||
| > - Sending a different payload than the `Content-Type` header should return a `415 unsupported_media_type` error. | ||
| > - Too large payload according to the limit should return a `413 payload_too_large` error | ||
| > - Wrong file encoding should return a `420 unprocessable_entity` error | ||
| > - Invalid NDJSON data should return a `420 unprocessable_entity` error | ||
|
|
||
| ### V. Impact on documentation | ||
|
|
||
| This feature should impact MeiliSearch users documentation by adding mention of `ndjson` capability inside Documents scope at [Add or replace documents](https://docs.meilisearch.com/reference/api/documents.html#add-or-replace-documents) and [Add or update documents](https://docs.meilisearch.com/reference/api/documents.html#add-or-update-documents). | ||
gmourier marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| We should also not only mention JSON format in `unsupported_media_type` section on the [errors page](https://docs.meilisearch.com/errors/#unsupported_media_type) and add `ndjson` format. | ||
bidoubiwa marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| Documentation should also guide the user in the correct way to properly format and send the ndjson file. Adding a dedicated page for the purpose of formatting and sending a ndjson file should be considered. | ||
|
|
||
| ### VI. Impact on SDKs | ||
|
|
||
| This feature should impact MeiliSearch SDK's in the future by adding the possibility to send a ndjson file to MeiliSearch on the previous explicited endpoints. | ||
|
|
||
| ## 2. Technical Aspects | ||
|
|
||
| ## 3. Future possibilities | ||
| - Provide an interface in the future dashboard to upload NDJSON data into an index. | ||
| - Set a payload limit directly related to the type of files. Currently, the payload size is equivalent to [JSON payload size](https://docs.meilisearch.com/reference/features/configuration.html#payload-limit-size). Metrics on feature usage and configuration update should help to choose a better suited value for this type of file. | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.