Official repo for our AAAI 2024 paper: DenoSent: A Denoising Objective for Self-Supervised Sentence Representation Learning.
Run pip install -r requirements.txt
to prepare the environment.
Use the script from the SimCSE repo to download the datasets for SentEval evaluation:
cd SentEval/data/downstream/
bash download_dataset.sh
Both our model checkpoint and dataset are available on 🤗.
Generate embeddings with DenoSent:
from transformers import AutoModel
model = AutoModel.from_pretrained("Singhoo/denosent-bert-base", trust_remote_code=True)
sentences = [
"The curious cat tiptoed across the creaky wooden floor, pausing to inspect a fluttering curtain.",
"A lone hiker stood atop the misty mountain, marveling at the tapestry of stars unfolding above."
]
embeddings = model.encode(sentences)
print(embeddings)
# Excepted output
# tensor([[ 0.3314, -0.2520, 0.4150, ..., 0.1575, -0.1235, -0.1226],
# [ 0.5128, -0.0051, 0.2179, ..., 0.1010, 0.1654, -0.3872]])
python eval_senteval.py \
--model_name_or_path Singhoo/denosent-bert-base \
--task_set sts \
--mode test \
This checkpoint has slightly higher STS results than those reported in the paper.
------ test ------
+-------+-------+-------+-------+-------+--------------+-----------------+-------+
| STS12 | STS13 | STS14 | STS15 | STS16 | STSBenchmark | SICKRelatedness | Avg. |
+-------+-------+-------+-------+-------+--------------+-----------------+-------+
| 75.48 | 83.82 | 77.54 | 84.76 | 80.16 | 81.20 | 73.97 | 79.56 |
+-------+-------+-------+-------+-------+--------------+-----------------+-------+
python eval_mteb.py \
--model_name_or_path Singhoo/denosent-bert-base \
Evaluation results for MTEB will appear in a separate directory mteb_results
.
Run the following command to train your own models. Try out different hyperparameters as you like. The dataset will be automatically downloaded from Huggingface.
python \
train.py \
--train_dataset Singhoo/denosent_data \
--torch_compile True \
--model_name_or_path bert-base-uncased \
--max_length 32 \
--decoder_num_layers 16 \
--decoder_num_heads 1 \
--decoder_target_dropout 0.825 \
--pooler mask \
--output_dir results \
--overwrite_output_dir \
--per_device_train_batch_size 64 \
--per_device_eval_batch_size 256 \
--learning_rate 4e-5 \
--lr_scheduler_type constant_with_warmup \
--do_train \
--do_eval \
--evaluation_strategy steps \
--eval_steps 50 \
--save_strategy steps \
--save_steps 50 \
--num_train_epochs 1 \
--metric_for_best_model eval_avg_sts \
--prompt_format '"[X]" means [MASK].' \
--do_contrastive \
--do_generative \
--save_total_limit 1 \
--contrastive_temp 0.05 \
--warmup_steps 500 \
--contrastive_weight 5 \
--generative_weight 7 \
--max_steps 5000 \
--load_best_model_at_end \
We use the SentEval toolkit and the MTEB toolkit for evaluations, and we adopt the modified version of SentEval from the SimCSE repository.