Cannot load the cache when mapping the dataset #7261

zhangn77 · 2024-10-29T08:29:40Z

Describe the bug

I'm training the flux controlnet. The train_dataset.map() takes long time to finish. However, when I killed one training process and want to restart a new training with the same dataset. I can't reuse the mapped result even I defined the cache dir for the dataset.

with accelerator.main_process_first():
from datasets.fingerprint import Hasher

    # fingerprint used by the cache for the other processes to load the result
    # details: https://github.com/huggingface/diffusers/pull/4038#discussion_r1266078401
    new_fingerprint = Hasher.hash(args)
    train_dataset = train_dataset.map(
        compute_embeddings_fn, batched=True, new_fingerprint=new_fingerprint, batch_size=10,
    )

Steps to reproduce the bug

train flux controlnet and start again

Expected behavior

will not map again

Environment info

latest diffusers

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot load the cache when mapping the dataset #7261

Cannot load the cache when mapping the dataset #7261

zhangn77 commented Oct 29, 2024

Cannot load the cache when mapping the dataset #7261

Cannot load the cache when mapping the dataset #7261

Comments

zhangn77 commented Oct 29, 2024

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info