Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Train data #25

Open
APiaoG opened this issue Feb 27, 2024 · 1 comment
Open

Train data #25

APiaoG opened this issue Feb 27, 2024 · 1 comment

Comments

@APiaoG
Copy link

APiaoG commented Feb 27, 2024

您好,感谢您的开源和杰出的工作!我想问一下在SEED/MultiModalLLM/configs/data/caption_torchdata_preprocess.yaml中
data_dir:

  • ${oc.env:PROJECT_ROOT}/data/unsplash_resize/webdataset
  • CC3M/webdataset/gcc3m_shards

我想问一下这里的数据集从哪里下载呢?我关注到论文里有说“We filtered the samples in these datasets based on image resolution, aspect ratio, and visual-textual similarity. We randomly place images or text at the forefront, in order to achieve the generation of captions based on images and vice versa.”
如果可以的话,是否可以开源训练数据呢?非常感谢!

@geyuying
Copy link
Collaborator

由于这些数据的版权不归我们所有,所以我们无法提供下载好的数据集,可以去相应的官网下载这些公开数据集。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants