Similarity Vector Embedding

Overview

The Similarity Vector Embedding project utilizes Natural Language Processing (NLP) and vector databases to efficiently identify and recommend similar movies based on their descriptions and metadata. By leveraging PostgreSQL with the pgvector extension and advanced NLP models like BERT and Sentence Transformers, this project offers a scalable solution for performing similarity searches within large movie datasets. This system is ideal for enhancing recommendation engines, improving content discovery, and organizing extensive media collections.

Figure 1: Gradio App example.

Figure 2: Embedding generation process using Sentence Transformers.

Figure 3: Similarity search and recommendation pipeline Qdrant.

Implementing Cosine Similarity in PostgreSQL with pgvector

Pgvector supports several distance metrics, including cosine similarity (denoted as <=> in SQL). By utilizing this function, we can perform fast cosine distance calculations directly within SQL queries, which is critical for efficient similarity searches. Here’s how you can find similar movies based on cosine similarity:

Getting Started

Prerequisites

Python 3.8
PostgreSQL
pgvector Extension
Jupyter Notebook

Installation

Clone the Repository:

git clone https://github.com/AlgoETS/SimilityVectorEmbedding.git
cd SimilityVectorEmbedding

Install Dependencies:
```
pip install -r requirements.txt
```

Set Up PostgreSQL with pgvector:

Install PostgreSQL: Download here

Install pgvector Extension:

sudo apt install postgresql-14-pgvector

Or build from source:

git clone https://github.com/pgvector/pgvector.git
cd pgvector
make
sudo make install

Create Database and Enable pgvector:

CREATE DATABASE movies_db;
\c movies_db
CREATE EXTENSION vector;

Run the Jupyter Notebook:
```
jupyter notebook
```
Open Similarity_Vector_Embedding.ipynb and follow the instructions to generate embeddings, insert data, and perform similarity queries.

Usage

Figure 4: System architecture integrating PostgreSQL, pgvector, and NLP models.

Figure 5: Example of cosine similarity results for the movie "Inception".

Implementing Cosine Similarity in PostgreSQL with pgvector

Pgvector supports several distance metrics, including cosine similarity (denoted as <=> in SQL). By utilizing this function, we can perform fast cosine distance calculations directly within SQL queries, which is critical for efficient similarity searches. Here’s how you can find similar movies based on cosine similarity:

SELECT title, embedding
FROM movies
ORDER BY embedding <=> (SELECT embedding FROM movies WHERE title = %s) ASC
LIMIT 10;

This SQL command retrieves the ten most similar movies to a given movie based on their embeddings' cosine similarity.

Other Distance Functions Supported by pgvector

Pgvector also supports other distance metrics such as L2 (Euclidean), L1 (Manhattan), and Dot Product. Each of these metrics can be selected based on the specific needs of your query or the characteristics of your data. Here’s how you might use these metrics:

L2 Distance (Euclidean): Suitable for measuring the absolute differences between vectors.
L1 Distance (Manhattan): Useful in high-dimensional data spaces.

Database Schema

CREATE TABLE movies (
    id SERIAL PRIMARY KEY,
    title VARCHAR(255) NOT NULL,
    year INT,
    country VARCHAR(100),
    language VARCHAR(100),
    duration INT,
    summary TEXT,
    genres TEXT[],
    director JSONB,
    screenwriters TEXT[],
    roles JSONB,
    poster_url TEXT,
    embedding VECTOR(768) -- Adjust dimension based on NLP model
);

Data Example

{
  "title": "Inception",
  "year": "2010",
  "country": "USA",
  "language": "English",
  "duration": "148",
  "summary": "A skilled thief is given a chance at redemption if he can successfully perform an inception.",
  "genres": ["Action", "Sci-Fi", "Thriller"],
  "director": {"_id": "123456", "__text": "Christopher Nolan"},
  "screenwriters": ["Christopher Nolan"],
  "roles": [
    {"actor": {"_id": "78910", "__text": "Leonardo DiCaprio"}, "character": "Cobb"},
    {"actor": {"_id": "111213", "__text": "Joseph Gordon-Levitt"}, "character": "Arthur"}
  ],
  "poster_url": "https://m.media-amazon.com/images/I/51G8J1XnFQL._AC_SY445_.jpg",
  "id": "54321"
}

Generating Embeddings

Use Sentence Transformers to generate embeddings for movie descriptions:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

def generate_embedding(text):
    return model.encode(text).tolist()

Inserting Data into the Database

Populate the movies table with movie data and their embeddings:

import json
import psycopg2

# Connect to PostgreSQL
conn = psycopg2.connect(
    dbname="movies_db",
    user="your_username",
    password="your_password",
    host="localhost"
)
cursor = conn.cursor()

# Load movie data
with open('movies.json', 'r') as file:
    movies = json.load(file)

# Insert movies into the database
for movie in movies:
    cursor.execute("""
        INSERT INTO movies (title, year, country, language, duration, summary, genres, director, screenwriters, roles, poster_url, embedding)
        VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)
    """, (
        movie['titre'],
        movie['annee'],
        movie['pays'],
        movie['langue'],
        movie['duree'],
        movie['resume'],
        movie['genre'],
        json.dumps(movie['realisateur']),
        json.dumps(movie['scenariste']),
        json.dumps(movie['role']),
        movie['poster'],
        generate_embedding(movie['resume'])
    ))

conn.commit()
cursor.close()
conn.close()

Finding Similar Movies

Retrieve movies similar to a given title using cosine similarity:

import psycopg2

def find_similar_movies(movie_title, top_k=10):
    conn = psycopg2.connect(
        dbname="movies_db",
        user="your_username",
        password="your_password",
        host="localhost"
    )
    cursor = conn.cursor()
    query = """
    SELECT title
    FROM movies
    WHERE title != %s
    ORDER BY embedding <=> (
        SELECT embedding FROM movies WHERE title = %s
    ) ASC
    LIMIT %s;
    """
    cursor.execute(query, (movie_title, movie_title, top_k))
    results = cursor.fetchall()
    cursor.close()
    conn.close()
    return [movie[0] for movie in results]

# Example usage
similar_movies = find_similar_movies("Inception")
print(similar_movies)

IMDB databased

https://developer.imdb.com/non-commercial-datasets/ Figure 6: IMDB .****

Language Models Used

Model Name	Description	Source
BERT	Bidirectional Encoder Representations from Transformers.	BERT on Hugging Face
Sentence Transformers	Models optimized for generating sentence-level embeddings.	Sentence Transformers
all-MiniLM-L6-v2	A lightweight and efficient Sentence Transformer model.	all-MiniLM-L6-v2
RoBERTa	A robustly optimized BERT pretraining approach.	RoBERTa on Hugging Face
DistilBERT	A distilled version of BERT, smaller and faster while retaining performance.	DistilBERT on Hugging Face
XLNet	Generalized autoregressive pretraining for language understanding.	XLNet on Hugging Face
T5	Text-to-Text Transfer Transformer for various NLP tasks.	T5 on Hugging Face
Electra	Efficient pretraining approach replacing masked tokens with generators.	Electra on Hugging Face
Longformer	Transformer model optimized for long documents.	Longformer on Hugging Face
MiniLM-L12-v2	A compact and efficient model for sentence embeddings.	MiniLM-L12-v2 on Hugging Face
SBERT DistilRoBERTa	A distilled version of RoBERTa for efficient sentence embeddings.	SBERT DistilRoBERTa on Hugging Face
MPNet	Masked and Permuted Pre-training for Language Understanding.	MPNet on Hugging Face
ERNIE	Enhanced Representation through Knowledge Integration.	ERNIE on Hugging Face
DeBERTa	Decoding-enhanced BERT with disentangled attention.	DeBERTa on Hugging Face
SBERT paraphrase-MiniLM-L6-v2	A Sentence Transformer model fine-tuned for paraphrase identification.	paraphrase-MiniLM-L6-v2 on Hugging Face

Personal Preference:

I personally prefer using T5-small and the MiniLM series models due to their excellent balance between performance and computational efficiency.

References

Tutorials and Guides

Documentation

Videos

Additional Resources

License

This project is licensed under the MIT License. See the LICENSE file for details.

Contact

For any questions or support, please open an issue on the GitHub repository or contact [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
bm25		bm25
neo4j		neo4j
postgres		postgres
qdrant		qdrant
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Similarity Vector Embedding

Overview

Implementing Cosine Similarity in PostgreSQL with pgvector

Getting Started

Prerequisites

Installation

Usage

Implementing Cosine Similarity in PostgreSQL with pgvector

Other Distance Functions Supported by pgvector

Database Schema

Data Example

Generating Embeddings

Inserting Data into the Database

Finding Similar Movies

IMDB databased

Language Models Used

References

Tutorials and Guides

Documentation

Videos

Additional Resources

License

Contact

About

Releases

Packages

Languages

License

AlgoETS/SimilityVectorEmbedding

Folders and files

Latest commit

History

Repository files navigation

Similarity Vector Embedding

Overview

Implementing Cosine Similarity in PostgreSQL with pgvector

Getting Started

Prerequisites

Installation

Usage

Implementing Cosine Similarity in PostgreSQL with pgvector

Other Distance Functions Supported by pgvector

Database Schema

Data Example

Generating Embeddings

Inserting Data into the Database

Finding Similar Movies

IMDB databased

Language Models Used

References

Tutorials and Guides

Documentation

Videos

Additional Resources

License

Contact

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages