Code Engine

About

Code Engine - an efficient platform for you to search famous coding problems given a query topic. This is designed to help problem solvers enhance their learning experience and help them find relevant problems at their doorstep.

A semantic search engine built with Flask and a custom-trained Word2Vec model. This project enables users to perform intelligent, context-aware searches over a document corpus, returning the most relevant results based on semantic similarity rather than simple keyword matching.

Features

Semantic Search: Uses Word2Vec embeddings for understanding context and synonyms.
Custom Model: Trained on your own document corpus for domain-specific relevance.
Flask Web Interface: User-friendly search form and API endpoint.
Fast Results: Precomputed document vectors enable rapid similarity search.
Easy Setup: Automated environment and dependency installation scripts.

Key Tech Stacks used in this project:

Python 3.11+ The primary programming language for all backend, data processing, and machine learning components.
Flask A lightweight Python web framework used to build the search engine’s web interface and API endpoints.
Gensim For training and utilizing Word2Vec models, enabling semantic understanding and vectorization of text data.
scikit-learn Used for computing cosine similarity between query and document vectors, as well as for other machine learning utilities.
NLTK (Natural Language Toolkit) For advanced text preprocessing, including tokenization and stopword removal.
NumPy Provides efficient numerical operations and array handling, crucial for vector computations.
Conda Environment and dependency management to ensure reproducibility and smooth scientific Python development.
Jinja2 & WTForms (Via Flask) For HTML templating and form handling in the web interface.

Directory Structure

AZ-Hackathon/
├── LC Data/
│   ├── final_data.txt
│   └── final_links.txt
├── app.py
├── word2vec_utils.py
├── setup_word2vec.py
├── document_vectors.pkl
├── word2vec_model.bin
├── requirements.txt
├── environment.yml
├── setup.sh
├── setup.bat
├── sample.html
└── README.md

Files

This section provides a brief description of the key files and their roles in the Word2Vec-based Flask Search Engine project:

app.py: The main Flask application file that initializes the web server, loads the Word2Vec model and document vectors, handles search queries, and renders the web interface.
word2vec_utils.py: Contains utility functions for loading and preprocessing documents, training the Word2Vec model, computing document vectors, and saving/loading these vectors.
setup_word2vec.py: A setup script to train the Word2Vec model on the document corpus and precompute document vectors. It automates the initial model training and vector preparation.
document_vectors.pkl: A pickle file storing the precomputed document vectors for fast similarity calculations during search.
word2vec_model.bin: The saved Word2Vec model file trained on the document corpus.
LC Data/final_data.txt: The text corpus file containing one document per line, used for training the Word2Vec model.
LC Data/final_links.txt: A file containing URLs or links corresponding to each document in final_data.txt, used to return search results.
requirements.txt: Lists all Python package dependencies required to run the project.
environment.yml: Conda environment specification file to create a reproducible environment with all necessary packages.
setup.sh and setup.bat: Shell and batch scripts to automate environment setup and dependency installation on Unix-like and Windows systems respectively.
sample.html: The HTML template used by Flask to render the search form and display search results.
README.md: Project documentation including setup instructions, usage, and technical details.

These files collectively enable the training, deployment, and usage of the semantic search engine powered by Word2Vec embeddings.

Installation

Clone the repository

git clone https://github.com/Sayak0504/DSA-Search-engine.git

Change to the project directory
```
cd DSA-Search-engine
```
Setup Environment
```
bash setup.sh
```
Train the Word2Vec Model (if not already present)

conda activate az-hackathon
python setup_word2vec.py

Usage

Run the application
```
python app.py
```
Open a web browser and navigate to http://localhost:5000 to access the application

Customization

Corpus: Place your documents (one per line) in LC Data/final_data.txt.
Links: Ensure LC Data/final_links.txt contains a corresponding link for each document.
Model Parameters: Adjust vector size, window, min_count, and epochs in setup_word2vec.py as needed.

Dependencies

Python 3.11+

Flask

Gensim

scikit-learn

NLTK

NumPy

(See requirements.txt and environment.yml for complete list)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Code Engine

About

Features

Directory Structure

Files

Installation

Usage

Customization

Dependencies

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
LC Data		LC Data
__pycache__		__pycache__
templates		templates
README.md		README.md
app.py		app.py
document_vectors.pkl		document_vectors.pkl
environment.yml		environment.yml
g.gitignore		g.gitignore
requirements.txt		requirements.txt
setup.bat		setup.bat
setup.sh		setup.sh
setup_word2vec.py		setup_word2vec.py
word2vec_model.bin		word2vec_model.bin
word2vec_utils.py		word2vec_utils.py

Sayak0504/DSA-Search-engine

Folders and files

Latest commit

History

Repository files navigation

Code Engine

About

Features

Directory Structure

Files

Installation

Usage

Customization

Dependencies

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages