-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
data_pipeline.py
needs more changes than suggested in README to support ImageFolder
datasets
#8
Comments
Hi @josephrocca, great to hear that you are using the repository for training! |
Thanks for your work on this repo! To reproduce, download the image below and name it
Then, per the stylgan2 training readme, run:
It goes from about 3.5GB to more than 200GB. |
Hi @josephrocca, thanks for the info! |
@matthias-wright Yep I suspected it was because they're being stored as raw data. Seems like a bad idea for large image datasets though, given the huge size inflation from a jpg? Since JPG decoders on modern hardware are really fast, would the jpg decoding step actually be a bottleneck in training? In any case, the image folder approach works nicely for me with the changes I mentioned in the original post. This issue was more about some changes needed to the stylegan2 training code (specifically Thanks again for your work on this repo! (Crossing my fingers that you'll work on stylegan3-flax next :P) |
I agree that this is not very efficient for storage purposes. I guess most people are willing to trade of training speed for storage capacity to some extent. Great to hear that you got it working! I linked from the readme to this thread, thanks for that! Haha, I hope that I will find some time to work on stylegan3, but not sure yet. |
Some problems I ran into:
tfds.ImageFolder
working with a "flat" folder of images. I had to nest a dummy label folder inside a dummy split folder. I followed the instructions here: https://www.tensorflow.org/datasets/api_docs/python/tfds/folder_dataset/ImageFoldernum_examples
property intfds.core.DatasetInfo
, so I had to usebuilder.info.splits['fake_split'].num_examples
wherefake_split
is the name of my dummy split folder. It does look like there's atotal_num_examples
property, but I'm not sure how to access it - maybe it's a private field (though I'm not sure if those are possible in Python)?pre_process
because it was expecting protobufs instead of{image, label}
objects.Note that the reason I am using the
ImageFolder
approach is because the tfrecords approach blew my 3GB dataset up to 200GB, since I think it's storing the raw tensor data? I'm new to this, but it seems like it'd make more sense to just store the data in jpg format since jpg decoding is so fast? That said, even if the tfrecords approach used a reasonable amount of space, I'd probably still prefer to store theImageFolder
approach since it just seems nicer and more portable. Even better, from my (newbie) perspective, would be the ability to load atar
of images with any internal directory structure.Below is my new
data_pipeline.py
so far. It seems to work okay now, but I haven't got training to work yet as I'm still debugging some stuff. Will update this post if I run into any more problems withdata_pipeline.py
.The text was updated successfully, but these errors were encountered: