Skip to content

This repository handles the vectorization of data, its insertion into opensearch, and the search of it

Notifications You must be signed in to change notification settings

wizeline/clone-vector-search

Repository files navigation

Clone Vector Search

Overview

This service provides endpoints for handling and vectorizing S3 objects, and consume and populate an OpenSearch index, inspired by hexagonal architecture principles.

Table of Contents

Project structure

  • service: Contains the third party services access logic.
  • usecase: Contains business logic layer.
  • controller: Contains the Flask API endpoint handlers. ⇧ back to top

Tech Stack

Installation

  1. Clone the repository
git clone [email protected]:wizeline/clone-vector-search.git
  1. Create a Python virtual environment (recommended):
python3 -m venv env 
source env/bin/activate 
  1. Install Dependencies:
pip install -r requirements.txt

⇧ back to top

Running the Service

  1. Set Environment Variables (if applicable) in .env and .flaskenv files:
  2. Create the opensearch index. The application will create the needed mapping.
  3. In order to run this service locally, you'll need localstack in order to mock some AWS Services.
    • Once you have localstack installed and running, create a clone-ingestion-messages bucket: aws --endpoint-url=http://localhost:4566 s3 mb s3://clone-ingestion-messages
    • Add the required test files by running: aws --endpoint-url=http://localhost:4566 s3 cp /path/to/your/file/filename.json s3://clone-ingestion-messages/key/to/file.json
  4. Start the Flask Server:
flask run

⇧ back to top

Opensearch index

An opensearch index is required for running this service. You can create the index with the following mapping:

// PUT /clone-vector-index 
{
    "aliases": {},
    "mappings": {
        "properties": {
            "content": {
                "type": "text",
                "fields": {
                    "keyword": {
                        "type": "keyword",
                        "ignore_above": 256
                    }
                }
            },
            "embedding": {
                "type": "knn_vector",
                "dimension": 384
            },
            "metadata": {
                "properties": {
                    "_node_content": {
                        "type": "text",
                        "fields": {
                            "keyword": {
                                "type": "keyword",
                                "ignore_above": 256
                            }
                        }
                    },
                    "_node_type": {
                        "type": "text",
                        "fields": {
                            "keyword": {
                                "type": "keyword",
                                "ignore_above": 256
                            }
                        }
                    },
                    "doc_id": {
                        "type": "text",
                        "fields": {
                            "keyword": {
                                "type": "keyword",
                                "ignore_above": 256
                            }
                        }
                    },
                    "document_id": {
                        "type": "text",
                        "fields": {
                            "keyword": {
                                "type": "keyword",
                                "ignore_above": 256
                            }
                        }
                    },
                    "file_uuid": {
                        "type": "text",
                        "fields": {
                            "keyword": {
                                "type": "keyword",
                                "ignore_above": 256
                            }
                        }
                    },
                    "processed_user": {
                        "type": "text",
                        "fields": {
                            "keyword": {
                                "type": "keyword",
                                "ignore_above": 256
                            }
                        }
                    },
                    "raw_text": {
                        "type": "text",
                        "fields": {
                            "keyword": {
                                "type": "keyword",
                                "ignore_above": 256
                            }
                        }
                    },
                    "ref_doc_id": {
                        "type": "text",
                        "fields": {
                            "keyword": {
                                "type": "keyword",
                                "ignore_above": 256
                            }
                        }
                    },
                    "source_name": {
                        "type": "text",
                        "fields": {
                            "keyword": {
                                "type": "keyword",
                                "ignore_above": 256
                            }
                        }
                    },
                    "twin_id": {
                        "type": "text",
                        "fields": {
                            "keyword": {
                                "type": "keyword",
                                "ignore_above": 256
                            }
                        }
                    },
                    "user_name": {
                        "type": "text",
                        "fields": {
                            "keyword": {
                                "type": "keyword",
                                "ignore_above": 256
                            }
                        }
                    }
                }
            }
        }
    },
    "settings": {
        "index": {
            "replication": {
                "type": "DOCUMENT"
            },
            "number_of_shards": "1",
            "number_of_replicas": "1"
        }
    }
}

Building the Docker Image

docker compose up --build

⇧ back to top

Code Contribution

Ensure you adhere to the following conventions when working with code in the Clone Vector Search project:

  • Relate every commit to a ticket: If the commit is not related to a ticket, the branch name contains the related ticket.
  • Work on one feature for each PR: Do not crowd unrelated features in one PR.
  • Every line of code in your commits must be production-ready: Do not create incomplete, work-in-progress commits.
  • Ensure the branching strategy is simple:
    • Create a feature branch and then merge it with the main branch.
    • Do not create extra branches beside the feature or fix branches to merge with the main.
    • Remove any feature or fix branches after you merge the changes.

⇧ back to top

About

This repository handles the vectorization of data, its insertion into opensearch, and the search of it

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published