Skip to content

MINT-1T: A one trillion token multimodal interleaved dataset.

Notifications You must be signed in to change notification settings

mlfoundations/MINT-1T

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 

Repository files navigation

MINT-1T:
Scaling Open-Source Multimodal Data by 10x:
A Multimodal Dataset with One Trillion Tokens

MINT-1T is an open-source Multimodal INTerleaved dataset with one trillion text tokens and three billion images, a 10x scale-up from existing open-source datasets. Additionally, we include previously untapped sources such as PDFs and ArXiv papers. We are putting the final touches on MINT-1T and will open-source the dataset soon!

Updates

Citation

If you found our work useful, please consider citing:

@article{awadalla2024mint1t,
      title={MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens}, 
      author={Anas Awadalla and Le Xue and Oscar Lo and Manli Shu and Hannah Lee and Etash Kumar Guha and Matt Jordan and Sheng Shen and Mohamed Awadalla and Silvio Savarese and Caiming Xiong and Ran Xu and Yejin Choi and Ludwig Schmidt},
      year={2024}
}

About

MINT-1T: A one trillion token multimodal interleaved dataset.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published