This project studies different index methods in vector databases and how one can create a simple vector database from scratch. Current supported index methods: FLAT, IVF, HNSW, PQ, SQ.
Currently data is loaded from SICK2014 dataset. You can change the data source in helper.py.
- ⭐ Supports FLAT, IVF, HNSW, PQ, SQ, and DiskANN (WIP) index.
- ⭐ Vector Database from scratch with cache, embedding model from Transformers.
- ⭐ Have unit test with time and memory breakdown.
- ⭐ Support simple UI Test with Streamlit.
- ⭐ Easy-to-get-started repo to study vector database and index.
- FLAT is good but is slow.
- HNSW is the fastest but the accuracy is not as good as IVF.
- I don't use wrappers like Langchain. I want to study the concepts not build the tool at production level. The pro of this is full control of the code.
- The loading time is long. I handled this with cache in the helper.py file. In returns, the accuracy might be affected. Need to figure out why.
index.py
: contains the implementation of the vector database indexes.main.py
:contains the implementation of the vector database.app.py
:contains the implementation of the UI.helper.py
:contains the helper functions like preparing data, check the time esplaped, etc.
- Install the dependencies
pip install -r requirements.txt
- Run the UI. It will run the
streamlit
UI
make run
- Optional: Run the tests. The tests also include some time and memory testing.
make test
- Figure out ways to make VectorDB faster and less memory consuming. (although there are tradeoffs but it is work in progress depending on the use case).
- Implement this in low-level languages like C++ or Rust!.
- Finish DiskANN implementation.