download_and_clean_data_scripts

Feb 23, 2019

1d9c09c · Feb 23, 2019

Name	Name	Last commit message	Last commit date
parent directory ..
pixabay	pixabay	dataset scripts	Feb 23, 2019
yahoo_fcc100m_dataset	yahoo_fcc100m_dataset	dataset scripts	Feb 23, 2019
Copy_50_files_to_another_folder.ipynb	Copy_50_files_to_another_folder.ipynb	dataset scripts	Feb 23, 2019
Create_random_cropped_images_from_large_images.ipynb	Create_random_cropped_images_from_large_images.ipynb	dataset scripts	Feb 23, 2019
README.md	README.md	Update README.md	Feb 23, 2019
Remove_Black_and_White_Photos.ipynb	Remove_Black_and_White_Photos.ipynb	dataset scripts	Feb 23, 2019
Remove_clipart_images_from_dataset.ipynb	Remove_clipart_images_from_dataset.ipynb	dataset scripts	Feb 23, 2019
Remove_dups.ipynb	Remove_dups.ipynb	dataset scripts	Feb 23, 2019
Remove_too_small_images_from_dataset.ipynb	Remove_too_small_images_from_dataset.ipynb	dataset scripts	Feb 23, 2019
multi_crop_images.py	multi_crop_images.py	dataset scripts	Feb 23, 2019
multi_resize.py	multi_resize.py	dataset scripts	Feb 23, 2019

README.md

Get access here: https://webscope.sandbox.yahoo.com/catalog.php?datatype=i&did=67
Once you register you will get an instant email how to download the image urls (check your spam folder)
Once you have the text file split it with 1M lines in each: split -l 1000000 mybigfile.txt
Edit the 'parse100m.py' file and choose which keywords to download and how many CPUs you have
Run the script
Known bug: I had a memory problem after downloading 150K images
The downloading speed is high ~10-20M images in a day

pip install beautifulsoup4 tqdm
Edit the 'pixabay_main_custom.py' file to decide which key words to download
The downloading speed is slow ~30K images in a day
Either create a multiprocess version or run several scripts with different keywords using Tmux