Skip to content

External Index API

Maxence Lange edited this page Jan 11, 2023 · 1 revision

Externally indexing your Nextcloud

FullTextSearch comes with a range of tools to allow an administrator to index the content of Nextcloud in an external search engine, and keep it up-to-date.

Collections

A collection is the list of all documents available on your Nextcloud and their current index status.
Each document is identified by the document provider and the id of the document itself.
The status is updated when

  • The document is indexed by your external search engine (marked as indexed)
  • The document is modified locally on your Nextcloud (marked as not indexed)

You can create as much collections as you want to link external search engines.
You manage your collections from the occ command:

To list all collections:

 ./occ fulltextsearch:collection:list

To create a new collection:

 ./occ fulltextsearch:collection:init <collectionName>

To destroy a collection:

 ./occ fulltextsearch:collection:delete <collectionName>

API

Important note: Using the OCS API require to identify as a Nextcloud account with admin rights.

Once a collection is created, you can start requesting your Nextcloud to:

  • get a list of documents which needs to be indexed by your external search engine,
  • get content of a document,
  • mark the document as indexed.

In the following examples, test is the name used at the creation of the collection

**Get list of documents that needs to be indexed (or re-indexed): **

curl -X GET "https://cloud.example.net/ocs/v2.php/apps/fulltextsearch/collection/test/index?format=json&length=50" -H "OCS-APIRequest: true" -u "admin:password"
{
  "ocs": {
    "meta": {
      "status": "ok",
      "statuscode": 200,
      "message": "OK"
    },
    "data": [
      {
        "url": "https://cloud.example.net/ocs/v2.php/apps/fulltextsearch/collection/test/document/files/597996",
        "status": 28
      }
    ]
  }
}
  • length is used to set a limit to the number of result to be returned by the request.

Each entry from data returned by this request represents a document that needs to be indexed:

  • url is the url to be requested to get content and metadata from the document (see next step)
  • status is a bitflag describing the difference between the current document and last time it was indexed:
    • 4 means metadata have been modified lately,
    • 8 means content have been modified lately,
    • 16 means sub-parts have been modified lately,
    • 28 means all data should be re-indexed,
    • 32 means document is not available anymore and index should be removed from the external search engine.

Get details about a document

Running a GET request using the url from the previous step will returns metadata, sub-parts and content from a document.

curl -X GET "https://cloud.example.net/ocs/v2.php/apps/fulltextsearch/collection/test/document/files/597996" -H "OCS-APIRequest: true" -u "admin:password"
{
  "ocs": {
    "meta": {
      "status": "ok",
      "statuscode": 200,
      "message": "OK"
    },
    "data": {
      "id": "597996",
      "providerId": "files",
      "access": {
        "ownerId": "cult",
        "viewerId": "",
        "users": ['test1', 'test2'],
        "groups": [],
        "circles": [],
        "links": []
      },
      "index": {
        "ownerId": "cult",
        "providerId": "files",
        "collection": "test",
        "source": "files_local",
        "documentId": "597996",
        "lastIndex": 0,
        "errors": [],
        "errorCount": 0,
        "status": 28,
        "options": []
      },
      "title": "640-240-max.png",
      "link": "http://cloud.example.net/index.php/f/597996",
      "parts": {
        "comments": "<test1> This is a comment !"
      },
      "content": "VGhlIHF1aWNrIGJyb3duIGZveApqdW1wcyBvdmVyCnRoZSBsYXp5IGRvZy4=",
      "isContentEncoded": 1
    }
  }
}

Notes:

  • if isContentEncoded is 1 then content is encoded with base64.
  • in case of Office document, the whole content of the file is sent; it is up to your search engine to extract its text content.
  • in case of image, and if files_fulltextsearch_tesseract is installed and configured, the image is OCR and the text content is returned.

Set document as indexed:

Once the document is indexed on the external search engine, you need FullTextSearch to tell about it;
run a POST request, using the url from the first step:

curl -X POST "https://cloud.example.net/ocs/v2.php/apps/fulltextsearch/collection/test/document/files/597996/done" -H "OCS-APIRequest: true" -u "admin:password"
{
  "ocs": {
    "meta": {
      "status": "ok",
      "statuscode": 200,
      "message": "OK"
    },
    "data": []
  }
}

After this, the document will not be listed when retrieving the list of documents that needs to be indexed, unless document is modified.