Zun Wang, Jialu Li, Yicong Hong, Songze Li, Kunchang Li, Shoubin Yu, Yi Wang, Yu Qiao, Yali Wang, Mohit Bansal, Limin Wang
Please note that the original full dataset used in our final experiments is no longer recoverable due to an unexpectedly early and complete deletion of the first author's institutional account and personal storage. This deletion occurred after the author's departure (which was prior to paper submission) and before the paper was accepted, also is accidental during internal system maintenance.
As a result, we are only able to release an intermediate version of the dataset saved during the development phase. Although this version may yield slightly lower performance for training the best model (~85% SR and ~78% SPL on R2R val unseen, ~1% performance drop) compared to the results reported in the paper, it still significantly outperforms strong baselines such as ScaleVLN (81% SR and 70% SPL). We also released our final pretrained model.
Available Data:
mantis.hm3d_round0_topk.3_enc.json– Generated instructions via sampling on HM3D in the first round.mantis.hm3d_round3_greedy_ndtw0.9_ranked_414k_rouge0.85.json– A subset of generated instructions via greedy decoding from the final-round generator (25.7 SPICE).
Unfortunately, the generated instructions on MP3D and the final refined dataset are no longer available.
We sincerely appreciate your understanding and are happy to address any questions regarding reproducibility.
Please follow ScaleVLN to set up the environment and training source code.
We release our final pretrained model and available data here. Details:
Model:
model_step_170000.pt– The final pretrained model for downstream finetuning.
Data:
mantis.hm3d_round0_topk.3_enc.json– Generated instructions via sampling on HM3D in the first round.mantis.hm3d_round3_greedy_ndtw0.9_ranked_414k_rouge0.85.json– A subset of generated instructions via greedy decoding from the final-round generator (25.7 SPICE).
Features:
internvit_6b_fov60_mp3d.hdf5– InternViT features on MP3D environments.internvit_6b_fov60_mp3d_panogen.hdf5– InternViT features on Panogen-augmented MP3D environments.scans_internvit_6b/– Contains InternViT features for all HM3D + MP3D environments.
Our training process follows ScaleVLN with minimal modifications.
Pretraining:
Update the pretraining config file to use:
- InternViT features from
features/scans_internvit_6b/ - Our sampling-generated instructions:
data/mantis.hm3d_round0_topk.3_enc.json
Empirically, training for 1–2 epochs is sufficient.
(Note: You may need to transfer data/mantis.hm3d_round0_topk.3_enc.json to a jsonl file where each item is aligned with ScaleVLN’s pretraining input format. (like R2R/annotations/pretrain_map/R2R_hm3d_aug_envdrop_generated_enc.jsonl in ScaleVLN's processed data))
Finetuning:
To finetune on downstream tasks, modify:
- The augmented environment path to
features/internvit_6b_fov60_mp3d_panogen.hdf5 args.featuresto take infeatures/internvit_6b_fov60_mp3d.hdf5- The training script to:
- Set feature dim to 3200
- Use our pretrained checkpoint:
model/model_step_170000.ptor your own pretrained checkpoint - Use our augmented data (Only for R2R finetuning):
data/mantis.hm3d_round3_greedy_ndtw0.9_ranked_414k_rouge0.85.json
If you find our project useful in your research, please cite the following paper:
@article{zun2024srdf,
author = { Wang, Zun and Li, Jialu and Hong, Yicong and Li, Songze and Li, Kunchang and Yu, Shoubin and Wang, Yi and Qiao, Yu and Wang, Yali and Bansal, Mohit and Wang, Limin},
title = {Bootstrapping Language-Guided Navigation Learning with Self-Refining Data Flywheel},
journal = {arxiv},
year = {2024},
url = {https://arxiv.org/abs/2412.08467}
}