Spanish Sentence Embeddings trained using sent2vec on the Spanish Unannotated Corpora.
The data used was already preprocessed in Spanish Unannotated Corpora to lowercase, remove multiple spaces, remove urls and others. We also used the script to split on punctuation included in the previous repository.
According to that tokenization, the 2.6B words corpus got into 3.4B tokens.
We set default parameters of sent2vec to train a unigram + bigram model.
Spanish sent2vec (700 dim sentence embeddings, unigram+bigram model, 14.4 GB)
Matteo Pagliardini, Prakhar Gupta, Martin Jaggi, Unsupervised Learning of Sentence Embeddings using Compositional n-Gram Features NAACL 2018