Question: How to construct wids-meta.json for SanaWebDataset? #121

Pevernow · 2024-12-29T11:00:46Z

Could you please tell me how to use non-square images for training in this project or the original project?

The official example only has the ImgDataset type, but according to an issue, this does not support non-square images.

The official also did not provide a dataset example or documentation for SanaWebDatasetMS.

I spent a whole day trying to construct wids_meta.json and the dataset structure by reading the code,
but I got stuck on Error detail: ".json" and I couldn't solve it no matter how hard I tried.

Can you help me? Thank you very much.

The text was updated successfully, but these errors were encountered:

Pevernow · 2024-12-30T12:17:22Z

Anyone here?

lawrence-cj · 2025-01-02T13:29:14Z

We will update the wids-meta json related code soon.

Pevernow · 2025-01-04T12:02:01Z

We will update the wids-meta json related code soon.

@lawrence-cj

Thank you so much.
I have been working on this problem for a week, and I have sought help from many people during this period, but I have encountered various problems whether I try to construct a data set or use a third-party trainer.
To be honest, this is definitely the most difficult model I have encountered to train.
I even got a little discouraged about it.

My original plan was just to train on a simple text image format like ImgDataset (since that's how my data is stored locally), but I'm stuck with a Webdataset. I had to try various methods without documentation to convert this local data into Webdataset format while still meeting Sana's reading needs.

It's very frustrating, Sana is such a good project but so dizzying on the minutiae.

lawrence-cj · 2025-01-05T09:54:50Z

Is there any problem with ImgDataset which turned you to Webdataset? @Pevernow

Pevernow · 2025-01-05T09:58:42Z

@lawrence-cj

Solved.

After reading the source code and with the help of enthusiastic developers, I constructed the correct format of the multi-scale training set.

Here is an example.

wids-meta.json

{
  "wids_version": 1,
  "name": "",
  "description": "WIDS metafile for tar archives in ./",
  "shardlist": [
    {
      "url": "output.tar",
      "nsamples": 25453
    }
  ]
}

Internal structure of a Tar file:

Internal structure of Json annotation:

Added, the screenshot is old, you need to use "prompt" instead of "caption"!

I also provide a simple script to convert the image-text dataset into SanaWebDatasetMS

from PIL import PngImagePlugin
PngImagePlugin.MAX_TEXT_CHUNK = 100 * 1024 * 1024  # Increase maximum size for text chunks
import os
import json
import tarfile
from PIL import Image

def process_data(input_dir, output_tar_name="output.tar"):
    """
    Processes a directory containing PNG files, generates corresponding JSON files,
    and packages all files into a TAR file. It also counts the number of processed PNG images,
    and saves the height and width of each PNG file to the JSON.

    Args:
        input_dir (str): The input directory containing PNG files.
        output_tar_name (str): The name of the output TAR file (default is "output.tar").
    """
    png_count = 0
    json_files_created = []

    for filename in os.listdir(input_dir):
        if filename.lower().endswith(".png"):
            png_count += 1
            base_name = filename[:-4]  # Remove the ".png" extension
            txt_filename = os.path.join(input_dir, base_name + ".txt")
            json_filename = base_name + ".json"
            json_filepath = os.path.join(input_dir, json_filename)
            png_filepath = os.path.join(input_dir, filename)

            if os.path.exists(txt_filename):
                try:
                    # Get the dimensions of the PNG image
                    with Image.open(png_filepath) as img:
                        width, height = img.size

                    with open(txt_filename, 'r', encoding='utf-8') as f:
                        caption_content = f.read().strip()

                    data = {
                        "file_name": filename,
                        "prompt": caption_content,
                        "width": width,
                        "height": height
                    }

                    with open(json_filepath, 'w', encoding='utf-8') as outfile:
                        json.dump(data, outfile, indent=4, ensure_ascii=False)

                    print(f"Generated: {json_filename}")
                    json_files_created.append(json_filepath)

                except Exception as e:
                    print(f"Error processing file {filename}: {e}")
            else:
                print(f"Warning: No corresponding TXT file found for {filename}.")

    # Create a TAR file and include all files
    with tarfile.open(output_tar_name, 'w') as tar:
        for item in os.listdir(input_dir):
            item_path = os.path.join(input_dir, item)
            tar.add(item_path, arcname=item)  # arcname maintains the relative path of the file in the tar

    print(f"\nAll files have been packaged into: {output_tar_name}")
    print(f"Number of PNG images processed: {png_count}")

if __name__ == "__main__":
    input_directory = input("Please enter the directory path containing PNG and TXT files: ")
    output_tar_filename = input("Please enter the name of the output TAR file (default is output.tar): ") or "output.tar"
    process_data(input_directory, output_tar_filename)

I hope that the week I wasted can bring convenience and practicality to other developers.

lawrence-cj · 2025-01-05T10:04:17Z

Thanks @Pevernow . If you want, we would appreciate it if you could pull a PR for your effort.

Also, we will update the metadata.json construction script later, but these do not conflict, It would be great to have a conversion script to convert the image-text dataset into SanaWebDatasetMS, which we don't have it.

lawrence-cj added the working working on this issue label Jan 2, 2025

Pevernow mentioned this issue Jan 5, 2025

[Tool]ImgDataset2WebDatasetMS #130

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: How to construct wids-meta.json for SanaWebDataset? #121

Question: How to construct wids-meta.json for SanaWebDataset? #121

Pevernow commented Dec 29, 2024 •

edited

Loading

Pevernow commented Dec 30, 2024

lawrence-cj commented Jan 2, 2025

Pevernow commented Jan 4, 2025

lawrence-cj commented Jan 5, 2025 •

edited

Loading

Pevernow commented Jan 5, 2025 •

edited

Loading

lawrence-cj commented Jan 5, 2025

Question: How to construct wids-meta.json for SanaWebDataset? #121

Question: How to construct wids-meta.json for SanaWebDataset? #121

Comments

Pevernow commented Dec 29, 2024 • edited Loading

Pevernow commented Dec 30, 2024

lawrence-cj commented Jan 2, 2025

Pevernow commented Jan 4, 2025

lawrence-cj commented Jan 5, 2025 • edited Loading

Pevernow commented Jan 5, 2025 • edited Loading

lawrence-cj commented Jan 5, 2025

Pevernow commented Dec 29, 2024 •

edited

Loading

lawrence-cj commented Jan 5, 2025 •

edited

Loading

Pevernow commented Jan 5, 2025 •

edited

Loading