What is Model2Vec?

This document provides a high-level overview of how Model2Vec works.

The base model2vec technique works by passing a vocabulary through a sentence transformer model, then reducing the dimensionality of the resulting embeddings using PCA, and finally weighting the embeddings using SIF weighting (previously zipf weighting). During inference, we simply take the mean of all token embeddings occurring in a sentence.

Our potion models are pre-trained using tokenlearn, a technique to pre-train model2vec distillation models. These models are created with the following steps:

Distillation: We distill a Model2Vec model from a Sentence Transformer model, using the method described above.
Sentence Transformer inference: We use the Sentence Transformer model to create mean embeddings for a large number of texts from a corpus.
Training: We train a model to minimize the cosine distance between the mean embeddings generated by the Sentence Transformer model and the mean embeddings generated by the Model2Vec model.
Post-training re-regularization: We re-regularize the trained embeddings by first performing PCA, and then weighting the embeddings using smooth inverse frequency (SIF) weighting using the following formula: w = 1e-3 / (1e-3 + proba). Here, proba is the probability of the token in the corpus we used for training.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

what_is_model2vec.md

what_is_model2vec.md

What is Model2Vec?

Files

what_is_model2vec.md

Latest commit

History

what_is_model2vec.md

File metadata and controls

What is Model2Vec?