Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: How to construct wids-meta.json for SanaWebDataset? #121

Open
Pevernow opened this issue Dec 29, 2024 · 6 comments
Open

Question: How to construct wids-meta.json for SanaWebDataset? #121

Pevernow opened this issue Dec 29, 2024 · 6 comments
Labels
working working on this issue

Comments

@Pevernow
Copy link
Contributor

Pevernow commented Dec 29, 2024

Could you please tell me how to use non-square images for training in this project or the original project?

The official example only has the ImgDataset type, but according to an issue, this does not support non-square images.

The official also did not provide a dataset example or documentation for SanaWebDatasetMS.

I spent a whole day trying to construct wids_meta.json and the dataset structure by reading the code,
but I got stuck on Error detail: ".json" and I couldn't solve it no matter how hard I tried.

Can you help me? Thank you very much.

@Pevernow
Copy link
Contributor Author

Anyone here?

@lawrence-cj
Copy link
Collaborator

We will update the wids-meta json related code soon.

@lawrence-cj lawrence-cj added the working working on this issue label Jan 2, 2025
@Pevernow
Copy link
Contributor Author

Pevernow commented Jan 4, 2025

We will update the wids-meta json related code soon.

@lawrence-cj

Thank you so much.
I have been working on this problem for a week, and I have sought help from many people during this period, but I have encountered various problems whether I try to construct a data set or use a third-party trainer.
To be honest, this is definitely the most difficult model I have encountered to train.
I even got a little discouraged about it.

My original plan was just to train on a simple text image format like ImgDataset (since that's how my data is stored locally), but I'm stuck with a Webdataset. I had to try various methods without documentation to convert this local data into Webdataset format while still meeting Sana's reading needs.

It's very frustrating, Sana is such a good project but so dizzying on the minutiae.

@lawrence-cj
Copy link
Collaborator

lawrence-cj commented Jan 5, 2025

Is there any problem with ImgDataset which turned you to Webdataset? @Pevernow

@Pevernow
Copy link
Contributor Author

Pevernow commented Jan 5, 2025

@lawrence-cj

Solved.

After reading the source code and with the help of enthusiastic developers, I constructed the correct format of the multi-scale training set.

Here is an example.

wids-meta.json

{
  "wids_version": 1,
  "name": "",
  "description": "WIDS metafile for tar archives in ./",
  "shardlist": [
    {
      "url": "output.tar",
      "nsamples": 25453
    }
  ]
}

Internal structure of a Tar file:
image

Internal structure of Json annotation:

Added, the screenshot is old, you need to use "prompt" instead of "caption"!
image

I also provide a simple script to convert the image-text dataset into SanaWebDatasetMS

from PIL import PngImagePlugin
PngImagePlugin.MAX_TEXT_CHUNK = 100 * 1024 * 1024  # Increase maximum size for text chunks
import os
import json
import tarfile
from PIL import Image

def process_data(input_dir, output_tar_name="output.tar"):
    """
    Processes a directory containing PNG files, generates corresponding JSON files,
    and packages all files into a TAR file. It also counts the number of processed PNG images,
    and saves the height and width of each PNG file to the JSON.

    Args:
        input_dir (str): The input directory containing PNG files.
        output_tar_name (str): The name of the output TAR file (default is "output.tar").
    """
    png_count = 0
    json_files_created = []

    for filename in os.listdir(input_dir):
        if filename.lower().endswith(".png"):
            png_count += 1
            base_name = filename[:-4]  # Remove the ".png" extension
            txt_filename = os.path.join(input_dir, base_name + ".txt")
            json_filename = base_name + ".json"
            json_filepath = os.path.join(input_dir, json_filename)
            png_filepath = os.path.join(input_dir, filename)

            if os.path.exists(txt_filename):
                try:
                    # Get the dimensions of the PNG image
                    with Image.open(png_filepath) as img:
                        width, height = img.size

                    with open(txt_filename, 'r', encoding='utf-8') as f:
                        caption_content = f.read().strip()

                    data = {
                        "file_name": filename,
                        "prompt": caption_content,
                        "width": width,
                        "height": height
                    }

                    with open(json_filepath, 'w', encoding='utf-8') as outfile:
                        json.dump(data, outfile, indent=4, ensure_ascii=False)

                    print(f"Generated: {json_filename}")
                    json_files_created.append(json_filepath)

                except Exception as e:
                    print(f"Error processing file {filename}: {e}")
            else:
                print(f"Warning: No corresponding TXT file found for {filename}.")

    # Create a TAR file and include all files
    with tarfile.open(output_tar_name, 'w') as tar:
        for item in os.listdir(input_dir):
            item_path = os.path.join(input_dir, item)
            tar.add(item_path, arcname=item)  # arcname maintains the relative path of the file in the tar

    print(f"\nAll files have been packaged into: {output_tar_name}")
    print(f"Number of PNG images processed: {png_count}")

if __name__ == "__main__":
    input_directory = input("Please enter the directory path containing PNG and TXT files: ")
    output_tar_filename = input("Please enter the name of the output TAR file (default is output.tar): ") or "output.tar"
    process_data(input_directory, output_tar_filename)

I hope that the week I wasted can bring convenience and practicality to other developers.

@lawrence-cj
Copy link
Collaborator

Thanks @Pevernow . If you want, we would appreciate it if you could pull a PR for your effort.

Also, we will update the metadata.json construction script later, but these do not conflict, It would be great to have a conversion script to convert the image-text dataset into SanaWebDatasetMS, which we don't have it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
working working on this issue
Projects
None yet
Development

No branches or pull requests

2 participants