This document provides a high-level overview of how Model2Vec works.
The base model2vec technique works by passing a vocabulary through a sentence transformer model, then reducing the dimensionality of the resulting embeddings using PCA, and finally weighting the embeddings using SIF weighting (previously zipf weighting). During inference, we simply take the mean of all token embeddings occurring in a sentence.
Our potion models are pre-trained using tokenlearn, a technique to pre-train model2vec distillation models. These models are created with the following steps:
- Distillation: We distill a Model2Vec model from a Sentence Transformer model, using the method described above.
- Sentence Transformer inference: We use the Sentence Transformer model to create mean embeddings for a large number of texts from a corpus.
- Training: We train a model to minimize the cosine distance between the mean embeddings generated by the Sentence Transformer model and the mean embeddings generated by the Model2Vec model.
- Post-training re-regularization: We re-regularize the trained embeddings by first performing PCA, and then weighting the embeddings using
smooth inverse frequency (SIF)
weighting using the following formula:w = 1e-3 / (1e-3 + proba)
. Here,proba
is the probability of the token in the corpus we used for training.