Vector Database from scratch study

This project studies different index methods in vector databases and how one can create a simple vector database from scratch. Current supported index methods: FLAT, IVF, HNSW, PQ, SQ.

Currently data is loaded from SICK2014 dataset. You can change the data source in helper.py.

Blog post

Features:

⭐ Supports FLAT, IVF, HNSW, PQ, SQ, and DiskANN (WIP) index.
⭐ Vector Database from scratch with cache, embedding model from Transformers.
⭐ Have unit test with time and memory breakdown.
⭐ Support simple UI Test with Streamlit.
⭐ Easy-to-get-started repo to study vector database and index.

Takeways:

FLAT is good but is slow.
HNSW is the fastest but the accuracy is not as good as IVF.
I don't use wrappers like Langchain. I want to study the concepts not build the tool at production level. The pro of this is full control of the code.

Current issues:

The loading time is long. I handled this with cache in the helper.py file. In returns, the accuracy might be affected. Need to figure out why.

Components

index.py: contains the implementation of the vector database indexes.
main.py :contains the implementation of the vector database.
app.py :contains the implementation of the UI.
helper.py :contains the helper functions like preparing data, check the time esplaped, etc.

How to use

Install the dependencies

pip install -r requirements.txt

Run the UI. It will run the streamlit UI

make run

Optional: Run the tests. The tests also include some time and memory testing.

make test

Future development

Figure out ways to make VectorDB faster and less memory consuming. (although there are tradeoffs but it is work in progress depending on the use case).
Implement this in low-level languages like C++ or Rust!.
Finish DiskANN implementation.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.github/workflows		.github/workflows
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
app.py		app.py
helper.py		helper.py
index.py		index.py
main.py		main.py
requirements.txt		requirements.txt
unit_test.py		unit_test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vector Database from scratch study

Features:

Takeways:

Current issues:

Components

How to use

Future development

About

Languages

License

chrislevn/vector_database_study

Folders and files

Latest commit

History

Repository files navigation

Vector Database from scratch study

Features:

Takeways:

Current issues:

Components

How to use

Future development

About

Resources

License

Stars

Watchers

Forks

Languages