Skip to content

chrislevn/vector_database_study

Repository files navigation

Vector Database from scratch study

Unit Test HitCount

This project studies different index methods in vector databases and how one can create a simple vector database from scratch. Current supported index methods: FLAT, IVF, HNSW, PQ, SQ.

Currently data is loaded from SICK2014 dataset. You can change the data source in helper.py.

Blog post

Features:

  • ⭐ Supports FLAT, IVF, HNSW, PQ, SQ, and DiskANN (WIP) index.
  • ⭐ Vector Database from scratch with cache, embedding model from Transformers.
  • ⭐ Have unit test with time and memory breakdown.
  • ⭐ Support simple UI Test with Streamlit.
  • ⭐ Easy-to-get-started repo to study vector database and index.

Takeways:

  • FLAT is good but is slow.
  • HNSW is the fastest but the accuracy is not as good as IVF.
  • I don't use wrappers like Langchain. I want to study the concepts not build the tool at production level. The pro of this is full control of the code.

Current issues:

  • The loading time is long. I handled this with cache in the helper.py file. In returns, the accuracy might be affected. Need to figure out why.

Components

  • index.py: contains the implementation of the vector database indexes.
  • main.py :contains the implementation of the vector database.
  • app.py :contains the implementation of the UI.
  • helper.py :contains the helper functions like preparing data, check the time esplaped, etc.

How to use

  1. Install the dependencies
pip install -r requirements.txt
  1. Run the UI. It will run the streamlit UI
make run
  1. Optional: Run the tests. The tests also include some time and memory testing.
make test

Future development

  • Figure out ways to make VectorDB faster and less memory consuming. (although there are tradeoffs but it is work in progress depending on the use case).
  • Implement this in low-level languages like C++ or Rust!.
  • Finish DiskANN implementation.