Code Engine - an efficient platform for you to search famous coding problems given a query topic. This is designed to help problem solvers enhance their learning experience and help them find relevant problems at their doorstep.
A semantic search engine built with Flask and a custom-trained Word2Vec model. This project enables users to perform intelligent, context-aware searches over a document corpus, returning the most relevant results based on semantic similarity rather than simple keyword matching.
-
Semantic Search: Uses Word2Vec embeddings for understanding context and synonyms.
-
Custom Model: Trained on your own document corpus for domain-specific relevance.
-
Flask Web Interface: User-friendly search form and API endpoint.
-
Fast Results: Precomputed document vectors enable rapid similarity search.
-
Easy Setup: Automated environment and dependency installation scripts.
Key Tech Stacks used in this project:
-
Python 3.11+ The primary programming language for all backend, data processing, and machine learning components.
-
Flask A lightweight Python web framework used to build the search engine’s web interface and API endpoints.
-
Gensim For training and utilizing Word2Vec models, enabling semantic understanding and vectorization of text data.
-
scikit-learn Used for computing cosine similarity between query and document vectors, as well as for other machine learning utilities.
-
NLTK (Natural Language Toolkit) For advanced text preprocessing, including tokenization and stopword removal.
-
NumPy Provides efficient numerical operations and array handling, crucial for vector computations.
-
Conda Environment and dependency management to ensure reproducibility and smooth scientific Python development.
-
Jinja2 & WTForms (Via Flask) For HTML templating and form handling in the web interface.
AZ-Hackathon/
├── LC Data/
│ ├── final_data.txt
│ └── final_links.txt
├── app.py
├── word2vec_utils.py
├── setup_word2vec.py
├── document_vectors.pkl
├── word2vec_model.bin
├── requirements.txt
├── environment.yml
├── setup.sh
├── setup.bat
├── sample.html
└── README.md
This section provides a brief description of the key files and their roles in the Word2Vec-based Flask Search Engine project:
-
app.py: The main Flask application file that initializes the web server, loads the Word2Vec model and document vectors, handles search queries, and renders the web interface.
-
word2vec_utils.py: Contains utility functions for loading and preprocessing documents, training the Word2Vec model, computing document vectors, and saving/loading these vectors.
-
setup_word2vec.py: A setup script to train the Word2Vec model on the document corpus and precompute document vectors. It automates the initial model training and vector preparation.
-
document_vectors.pkl: A pickle file storing the precomputed document vectors for fast similarity calculations during search.
-
word2vec_model.bin: The saved Word2Vec model file trained on the document corpus.
-
LC Data/final_data.txt: The text corpus file containing one document per line, used for training the Word2Vec model.
-
LC Data/final_links.txt: A file containing URLs or links corresponding to each document in final_data.txt, used to return search results.
-
requirements.txt: Lists all Python package dependencies required to run the project.
-
environment.yml: Conda environment specification file to create a reproducible environment with all necessary packages.
-
setup.sh and setup.bat: Shell and batch scripts to automate environment setup and dependency installation on Unix-like and Windows systems respectively.
-
sample.html: The HTML template used by Flask to render the search form and display search results.
-
README.md: Project documentation including setup instructions, usage, and technical details.
These files collectively enable the training, deployment, and usage of the semantic search engine powered by Word2Vec embeddings.
- Clone the repository
git clone https://github.com/Sayak0504/DSA-Search-engine.git
- Change to the project directory
cd DSA-Search-engine
- Setup Environment
bash setup.sh
- Train the Word2Vec Model (if not already present)
conda activate az-hackathon
python setup_word2vec.py
- Run the application
python app.py
- Open a web browser and navigate to
http://localhost:5000
to access the application
-
Corpus: Place your documents (one per line) in LC Data/final_data.txt.
-
Links: Ensure LC Data/final_links.txt contains a corresponding link for each document.
-
Model Parameters: Adjust vector size, window, min_count, and epochs in setup_word2vec.py as needed.
Python 3.11+
Flask
Gensim
scikit-learn
NLTK
NumPy
(See requirements.txt and environment.yml for complete list)