PureCPP

This is an open-source modular RAG (Retrieval-Augmented Generation) system developed by PureAI, to which I contributed during my internship, integrating C++ and Python, deploying to PyPI, and optimizing for high-performance vector processing and querying. It also includes careful memory control strategies to maximize performance, culminating in the development of a functional and scalable vector database engine. In addition to its modular architecture, the project features orchestration and optimization through build and deploy pipeline scripts, reducing processes that originally took ~3 hours down to just 30 minutes.

Note

As this project is extensive and contains many modular components, this documentation will initially focus on explaining the parts I developed in the ChunkDefault and ChunkQuery modules, as well as their integration with the VectorDatabase.

Later sections will extend the documentation to cover:

The C++ <-> Python bindings 🔗Purecpp Refactor: Scratch Pad ;
The modular CMake architecture🔗Purecpp Refactor: Scratch Pad ;
🔗 Automated PIP build and legacy deployment system.

Main Contributions

Implemented nearly all C++ <-> Python bindings, enabling seamless integration between the core C++ and Python.
Implemented and optimized the build-and-deploy pipeline, reducing process time from ~3 hours to just 30 minutes.
Created a Docker environment for isolated, reproducible development and usage.
Restructured the repository for better modularity, scalability, and maintainability.
Designed and implemented the core vector database engine, ensuring scalable, functional vector storage and retrieval.
Developed cosine similarity query implementations, integrated with FAISS for optimized similarity search.
Developed of the backbone architecture of the core query system, focusing on efficient construction and high-performance execution.

Related Files and Areas

The Preview:🔗Purecpp Refactor: Scratch Pad , under tests before its official release under the PURE Ecosystem.

All PIP_* folders:
- Scripts and configurations for Python packaging and deployment.
All automation scripts (*.sh):
- Build, deploy, and environment orchestration scripts.
Core modules:
- VectorDatabase
- ChunkDefault
- ChunkQuery
Documentation:
- All documentation related to my file/directories
Bindings:
- All files related to C++/Python bindings integration
CMake:
- All files and configurations with uppercase CMake in their name, including modular build configurations

Technologies Used

CMake — Modular build system configuration.
Conan — C++ package and dependency management.
Torch / PyTorch — Machine learning and tensor operations.
C++ — Core implementation language.
Rust — Auxiliary or experimental components.
Docker — Containerization for reproducible environments.
ManyLinux — Multi-Python-version builds and compatibility for pip wheels.
Ninja — Fast build system used with CMake for compilation.
PyBind11 — C++ <-> Python bindings.
FAISS — Vector similarity search (used in cosine similarity, L2, inner product queries).
RE2 — High-performance regular expressions.
OpenAI API (EmbeddingOpenAI) — For generating embeddings via models like text-embedding-ada-002.
OpenMP — Parallelization for multi-threaded C++ code.
OpenBLAS — Optimized linear algebra backend for matrix operations.
ONNX — Open Neural Network Exchange for model interoperability.

FUNCTIONALITY

Core Vector Database Engine

I engineered and integrated the fundamental data structures that gave rise to the engine powering the Vector Database — scalable and high-performance core.

Focusing on:

ChunkDefault
ChunkQuery
VectorDatabase

High-Level Functionality Demonstration

I have created a Jupyter Notebook that explains the high-level functionality of this components that I implemented This notebook demonstrates how these components work together and highlights the key aspects of the three main libraries involved. or Click here to access the notebook directory

Low-Level Functionality

NOTE:
Initially, I created these structures and functions in the directory Chunk/ChunkCommons/ChunkCommons.h.
I will provide further details and explanations about this part later.

Disclaimer

These contributions reflect technical solutions I personally implemented, outside the formal scope of my internship contract, using no internal company resources. No confidential or proprietary information is disclosed here; all content is based on public open-source work.

This repository contains code I developed during my time collaborating with PureAI. Officially, my role was focused on testing tasks; however, over time, I contributed far beyond what was initially requested, because is a startup.

I developed:

the backbone architecture for VDB engine,
C++ <-> Python bindings,
modular chunk and embedding systems,
all pypi integration
and automation scripts for build and deployment.

While formally an intern, I undertook responsibilities typically associated with intermediate-level engineering roles.

License & Visibility

The repository is public under the MIT license. Everything shared here reflects parts I personally worked on or played a significant role in developing. There is no confidential or proprietary information being revealed here:

The code is already publicly available,
My intent is to explain and clarify the parts I directly contributed to,
I aim to provide technical transparency and highlight the complexities often hidden behind raw code.

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
.gitignore		.gitignore
ChunkDefault.cpp		ChunkDefault.cpp
ChunkDefault.h		ChunkDefault.h
ChunkQuery.cpp		ChunkQuery.cpp
ChunkQuery.h		ChunkQuery.h
LICENSE		LICENSE
README.md		README.md
VectorDataBase.cpp		VectorDataBase.cpp
VectorDataBase.h		VectorDataBase.h

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PureCPP

Table of Contents

Main Contributions

Related Files and Areas

Technologies Used

FUNCTIONALITY

Core Vector Database Engine

High-Level Functionality Demonstration

Low-Level Functionality

Disclaimer

About

Uh oh!

Languages

License

bbzaffari/Open-Source-RAG-Engine-System-with-Modular-Vector-Processing

Folders and files

Latest commit

History

Repository files navigation

PureCPP

Table of Contents

Main Contributions

Related Files and Areas

Technologies Used

FUNCTIONALITY

Core Vector Database Engine

High-Level Functionality Demonstration

Low-Level Functionality

Disclaimer

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages