addressing issue #5, SD3 training #11

bigbraindump · 2024-11-19T19:55:40Z

I'm attempting to use the existing training setup with Stable Diffusion 3. Here's the history of errors encountered, please let me know how to proceed with error 4 -

Error 1:
TypeError: argument of type 'PosixPath' is not iterable

Solution 1:
Added str conversion in dataset.py
output_path = str(output_path) # after mkdir

Error 2:
train_controlnet_sd3.py: error: unrecognized arguments: --enable_xformers_memory_efficient_attention
Solution 2:
Removed arg

Error 3:

Traceback (most recent call last):
  File "/data/stariq/stariq/signwriting_illustration/controlnet_huggingface/diffusers/examples/controlnet/train_controlnet_sd3.py", line 1412, in <module>
    main(args)
  File "/data/stariq/stariq/signwriting_illustration/controlnet_huggingface/diffusers/examples/controlnet/train_controlnet_sd3.py", line 989, in main
    controlnet = SD3ControlNetModel.from_transformer(transformer)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/stariq/data/conda/envs/diffusers/lib/python3.11/site-packages/diffusers/models/controlnets/controlnet_sd3.py", line 251, in from_transformer
    controlnet = cls(**config)
                 ^^^^^^^^^^^^^
  File "/home/stariq/data/conda/envs/diffusers/lib/python3.11/site-packages/diffusers/configuration_utils.py", line 665, in inner_init
    init(self, *args, **init_kwargs)
TypeError: SD3ControlNetModel.__init__() got an unexpected keyword argument 'dual_attention_layers'

Solution 3: Modified controlnet_SD3.py to add additional parameters --dual_attention_layers and --qk_norm.

class SD3ControlNetModel(ModelMixin, ConfigMixin, PeftAdapterMixin, FromOriginalModelMixin):
    _supports_gradient_checkpointing = True

    @register_to_config
    def __init__(
        self,
        sample_size: int = 128,
        patch_size: int = 2,
        in_channels: int = 16,
        num_layers: int = 18,
        attention_head_dim: int = 64,
        num_attention_heads: int = 18,
        joint_attention_dim: int = 4096,
        caption_projection_dim: int = 1152,
        pooled_projection_dim: int = 2048,
        out_channels: int = 16,
        pos_embed_max_size: int = 96,
        extra_conditioning_channels: int = 0,
        dual_attention_layers=(),
        qk_norm=None,
        **kwargs
    ):
        super().__init__()
        default_out_channels = in_channels
        self.out_channels = out_channels if out_channels is not None else default_out_channels
        self.inner_dim = num_attention_heads * attention_head_dim
        self.dual_attention_layers = dual_attention_layers
        self.qk_norm = qk_norm

Error 4:

Traceback (most recent call last):
  File "/data/stariq/stariq/signwriting_illustration/controlnet_huggingface/diffusers/examples/controlnet/train_controlnet_sd3.py", line 1412, in <module>
    main(args)
  File "/data/stariq/stariq/signwriting_illustration/controlnet_huggingface/diffusers/examples/controlnet/train_controlnet_sd3.py", line 1100, in main
    train_dataset = make_train_dataset(args, tokenizer_one, tokenizer_two, tokenizer_three, accelerator)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/stariq/stariq/signwriting_illustration/controlnet_huggingface/diffusers/examples/controlnet/train_controlnet_sd3.py", line 664, in make_train_dataset
    raise ValueError(
ValueError: `--image_column` value 'image' not found in dataset columns. Dataset columns are: _data_files, _fingerprint, _format_columns, _format_kwargs, _format_type, _output_all_columns, _split

P.S. i did try to include only the relevant files for this branch comparison but even with a local reset, file changes from other open PRs still show up in this request.

merge branches

…gnwriting-illustration

AmitMY · 2024-11-19T21:13:54Z

There are multiple issues with this PR:

Please do not create a PR with all features at once. I can't review the changes properly. Go to main branch, reset to the latest commit on this repository, then branch out to a new branch from that commit. then, you can do whatever you want, and the PR will only include those changes.
For "error 3" - while it is an OKish intermediate solution - did you open an issue on huggingface for it? It needs to be fixed in source, not by copying.
The technical issue I can spot, is that it seems like there is some sort of issue for you with environment variables - I don't know why, but for example on line 26 you change the environment variable to a cat. Then, in line 57, you still do use environment variables.

I would suggest you dont use cat, and instead solve the problem at its source - why can't you use variables?

bigbraindump · 2024-11-20T13:18:26Z

i followed the below steps to create this branch 'clean-SD3' and wasn't able to determine the cause of the problem

git checkout -b clean-SD3
git reset --hard upstream/main
git stash #for unwanted changes
git add controlnet_sd3.py # the original controlnet_sd3.py file 
git commit -m "message" # commit <7b250ff>
git push origin clean-SD3

# further changes made to dataset.py and train.sh
git add dataset.py train.sh controlnet_sd3.py
git commit -m "message" # commit <aaf3b83>
git push origin clean-SD3

done. (issue)
no particular reason, it was just a preference to save the key and token in the dir files. however, even after reverting to the original environment variable usage with no 'cat', Error 4 persists. below are some more debugging attempts detailed with no success yet -

issue seems to stem from the load_dataset() func that the SD3 training script uses.
-printed the column names generated from dataset.py to confirm correct creation
-tested alternate loading methods - load_from_disk ()

DEBUG: Trying different dataset loading methods...
Method 1 - load_from_disk:
Failed: argument of type 'NoneType' is not iterable
Method 2 - load_dataset:
Failed: Dataset 'data' doesn't exist on the Hub or cannot be accessed.

-possible that the issue lies in the process_train() func of the training script. tried passing both image path strings and PIL image objects
-attempted to convert the dataset into the ImageFolder format so a custom dataset would not be required
-also tried passing trust_remote_code=True as suggested in the legacy method on huggingface dataset loading page
-testing this PR approach as per the last meeting, specified the controlnet model in training parameters

AmitMY · 2024-11-20T16:29:02Z

Not sure exactly what you did, but this branch, at the moment, is still full of other files
Great. left a note on Slack.
Alright, so now you are using environment variables -
Note that when you call load_dataset here you should be seeing the dataset, since you are loading it from disk. For debugging, I'd recommend printing the train_data_dir to see it is correct from the environement variable, then verify the directory exists (os.path.exists(path)) then print the resulting dataset. This is exactly the same code as the normal controlnet, so it should work exactly the same.

bigbraindump · 2024-11-20T19:36:56Z

thanks!
debugging -

import os
from datasets import load_dataset, load_from_disk

train_data_dir = os.getenv("HF_DATASET_DIR")  
print(f"Train data directory: {train_data_dir}")

if not os.path.exists(train_data_dir):
    raise FileNotFoundError(f"The directory {train_data_dir} does not exist.")

print("\nTrying load_from_disk:")
try:
    dataset1 = load_from_disk(train_data_dir)
    print(f"Dataset loaded with load_from_disk: {dataset1}")
except Exception as e:
    print(f"Error with load_from_disk: {str(e)}")

print("\nTrying load_dataset:")
try:
    dataset2 = load_dataset("imagefolder", data_dir=train_data_dir)
    print(f"Dataset loaded with load_dataset: {dataset2}")
except Exception as e:
    print(f"Error with load_dataset: {str(e)}")

debugging output -

+ python debug.py
Train data directory: /scratch/stariq/signwriting-illustration

Trying load_from_disk:
Dataset loaded with load_from_disk: DatasetDict({
    train: Dataset({
        features: ['conditioning_image', 'ground_truth_image', 'caption'],
        num_rows: 2601
    })
})

Trying load_dataset:
Resolving data files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5205/5205 [00:00<00:00, 114163.40it/s]
Downloading data: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5204/5204 [00:00<00:00, 75849.32files/s]
Generating train split: 0 examples [00:00, ? examples/s]
Error with load_dataset: An error occurred while generating the dataset

AmitMY · 2024-11-21T11:27:44Z

I suggest you debug directly in this script: https://github.com/huggingface/diffusers/blob/main/examples/controlnet/train_controlnet_sd3.py#L644-L649

Specifically, where marked, you print the directories, and check they exist, and print the dataset, and print the error, etc.

If there is an error loading the dataset, but the directory is correct, make sure the dataset is generated correctly in the dataset.py script

bigbraindump · 2024-11-21T13:24:10Z

here's the edited function (train_controlnet_sd3.py):

def make_train_dataset(args, tokenizer_one, tokenizer_two, tokenizer_three, accelerator):
    try:
        if args.train_data_dir is not None:
            print(f"Loading: {args.train_data_dir}")
            dataset = load_dataset(args.train_data_dir, cache_dir=args.cache_dir)
            print("Dataset loaded:", dataset)
            print("Train split features:", dataset["train"].features)
            print("Column names:", dataset["train"].column_names)
            
            # Verify image column
            if args.image_column not in dataset["train"].column_names:
                print(f"\nERROR: Image column '{args.image_column}' not found in columns: {dataset['train'].column_names}")
                
            return dataset["train"]
    except Exception as e:
        print(f"Error loading dataset: {e}")
        raise

    raise ValueError("No dataset provided")

edited function (train_controlnet_sd3.py):

Loading: /scratch/stariq/signwriting-illustration
Dataset loaded: DatasetDict({
    train: Dataset({
        features: ['_data_files', '_fingerprint', '_format_columns', '_format_kwargs', '_format_type', '_output_all_columns', '_split'],
        num_rows: 1
    })
})
Train split features: {'_data_files': [{'filename': Value(dtype='string', id=None)}], '_fingerprint': Value(dtype='string', id=None), '_format_columns': Value(dtype='null', id=None), '_format_kwargs': {}, '_format_type': Value(dtype='null', id=None), '_output_all_columns': Value(dtype='bool', id=None), '_split': Value(dtype='string', id=None)} 
Column names: ['_data_files', '_fingerprint', '_format_columns', '_format_kwargs', '_format_type', '_output_all_columns', '_split']

ERROR: Image column 'image' not found in columns: ['_data_files', '_fingerprint', '_format_columns', '_format_kwargs', '_format_type', '_output_all_columns', '_split']      
Map:   0%|                                                                                                                                      | 0/1 [00:00<?, ? examples/s]

dataset generation check (dataset.py):

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--train-path", type=str, required=True)
    parser.add_argument("--output-path", type=str, required=True)
    args = parser.parse_args()

    train_path = Path(args.train_path)
    output_path = Path(args.output_path)

    output_path.mkdir(parents=True, exist_ok=True)
    output_path = str(output_path)

    dataset = SignWritingIllustrationDataset(train_path)
    dataset.download_and_prepare(output_path)
    
    # verify
    ds = dataset.as_dataset()
    print(ds)
    print("Columns:", ds['train'].column_names)
        
    print("\nFirst example:")
    first_example = ds['train'][0]
    print("Keys:", list(first_example.keys()))
    print("Caption:", first_example['caption'])
    print("Image size:", first_example['image'].size)
    print("Control image size:", first_example['control_image'].size)
        
    ds.save_to_disk(output_path)

    print(f"Skipped {dataset.get_skipped_images()} images")

dataset generation check (dataset.py) output :(

+ python dataset.py --train-path=../../train --output-path=/scratch/stariq/signwriting-illustration
Repo card metadata block was not found. Setting CardData to empty.
DatasetDict({
    train: Dataset({
        features: ['conditioning_image', 'ground_truth_image', 'caption'],
        num_rows: 2601
    })
})
Columns: ['conditioning_image', 'ground_truth_image', 'caption']

First example:
Keys: ['conditioning_image', 'ground_truth_image', 'caption']
Caption: An illustration of a man with short hair, with orange arrows. Watermark text: signecriture.org i.
Traceback (most recent call last):
  File "/data/stariq/stariq/signwriting_illustration/controlnet_huggingface/dataset.py", line 81, in <module>
    print("Image size:", first_example['image'].size)
                         ~~~~~~~~~~~~~^^^^^^^^^
KeyError: 'image'

AmitMY · 2024-11-22T10:58:01Z

Great. You see that the columns are:

Columns: ['conditioning_image', 'ground_truth_image', 'caption']

No image column available, so you need to change the argument

bigbraindump · 2024-11-22T13:52:01Z

i have also tried that without success -

once the arguments are changed in train.sh -

--conditioning_image_column=conditioning_image \
--image_column=ground_truth_image \
--caption_column=caption

same error persists -

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/stariq/stariq/signwriting_illustration/controlnet_huggingface/diffusers/examples/controlnet/train_controlnet_sd3.py", line 667, in make_train_dataset raise ValueError( ValueError: --image_column value 'ground_truth_image' not found in dataset columns. Dataset columns are: _data_files, _fingerprint, _format_columns, _format_kwargs, _format_type, _output_all_columns, _split

AmitMY · 2024-11-23T12:53:27Z

The dataset.py file says the ground_truth_image column exists.
But the train file says the following columns:

['_data_files', '_fingerprint', '_format_columns', '_format_kwargs', '_format_type', '_output_all_columns', '_split']

Maybe, the path to the dataset you are creating and loading is different?

Either the path is wrong, or the specific method to load the dataset is wrong. Have you tried load_from_disk instead of load_dataset?

bigbraindump · 2024-11-23T13:36:10Z

yes i've tried load_from_disk as well as verified the path and env variables. although not ideal, adding load_from_disk to the training script partially solves the issue of loading the data.

loading data issue was solved by editing the train_controlnet_sd3.py script in the section you've previously highlighted, keeping the original parameters -

    if args.dataset_name is not None:
        # Downloading and loading a dataset from the hub.
        dataset = load_dataset(
            args.dataset_name,
            args.dataset_config_name,
            cache_dir=args.cache_dir,
        )
    else:
        if args.train_data_dir is not None:
            dataset = load_from_disk(
                args.train_data_dir,
            )

Available columns: ['control_image', 'image', 'caption']
Map: 100%|██████████| 2694/2694 [00:49<00:00, 54.53 examples/s]

however, another error comes up in data processing with channel mismatch -

Steps:   0%|          | 0/337000 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/data/stariq/stariq/signwriting_illustration/controlnet_huggingface/diffusers/examples/controlnet/train_controlnet_sd3.py", line 1413, in <module>
    main(args)

           ^^^^^^^^^
RuntimeError: Given groups=1, weight of size [1536, 17, 2, 2], expected input[1, 16, 64, 64] to have 17 channels, but got 16 channels instead

this is odd because the dataset creation already handles 3 channel RGB (here for res. 512). i have also checked
3. the channel outputs at different points in the training setup -

Steps:   0%|          | 0/337000 [00:00<?, ?it/s]11/23/2024 00:26:02 - INFO - __main__ - noisy_model_input shape: torch.Size([1, 16, 64, 64]), dtype: torch.float16
11/23/2024 00:26:02 - INFO - __main__ - Raw controlnet_image shape: torch.Size([1, 3, 512, 512]), dtype: torch.float16
11/23/2024 00:26:02 - INFO - __main__ - VAE encoded controlnet_image shape: torch.Size([1, 16, 64, 64]), dtype: torch.float16
11/23/2024 00:26:02 - INFO - __main__ - Scaled controlnet_image shape: torch.Size([1, 16, 64, 64]), dtype: torch.float16
11/23/2024 00:26:02 - INFO - __main__ - First conv layer weight shape: torch.Size([1536, 16, 2, 2])

so the channel mismatch issue lies with ControlNet initialization for SD3 specifically. I'm looking into the control for sd3 from diffusers more closely atm

AmitMY · 2024-11-23T18:12:05Z

Alright.
Now that huggingface addressed your issue huggingface/diffusers#9974 (which can now be closed), I recommend you open a new issue: what command you are running exactly, and what is your error, including a stack trace.

It is to be expected that examples on their repo should work directly

bigbraindump and others added 7 commits May 16, 2024 22:18

Merge pull request #1 from bigbraindump/resolution_change

0b7d9b8

merge branches

clip_aligned

48c394b

clip_aligned

b5ad057

img2img

28952db

Merge branch 'main' of https://github.com/sign-language-processing/si…

8f266e6

…gnwriting-illustration

original controlnet_sd3 from diffusers

7b250ff

training setup

aaf3b83

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

addressing issue #5, SD3 training #11

addressing issue #5, SD3 training #11

bigbraindump commented Nov 19, 2024 •

edited

Loading

AmitMY commented Nov 19, 2024

bigbraindump commented Nov 20, 2024 •

edited

Loading

AmitMY commented Nov 20, 2024

bigbraindump commented Nov 20, 2024

AmitMY commented Nov 21, 2024

bigbraindump commented Nov 21, 2024

AmitMY commented Nov 22, 2024

bigbraindump commented Nov 22, 2024 •

edited

Loading

AmitMY commented Nov 23, 2024

bigbraindump commented Nov 23, 2024

AmitMY commented Nov 23, 2024

addressing issue #5, SD3 training #11

Are you sure you want to change the base?

addressing issue #5, SD3 training #11

Conversation

bigbraindump commented Nov 19, 2024 • edited Loading

AmitMY commented Nov 19, 2024

bigbraindump commented Nov 20, 2024 • edited Loading

AmitMY commented Nov 20, 2024

bigbraindump commented Nov 20, 2024

AmitMY commented Nov 21, 2024

bigbraindump commented Nov 21, 2024

AmitMY commented Nov 22, 2024

bigbraindump commented Nov 22, 2024 • edited Loading

AmitMY commented Nov 23, 2024

bigbraindump commented Nov 23, 2024

AmitMY commented Nov 23, 2024

bigbraindump commented Nov 19, 2024 •

edited

Loading

bigbraindump commented Nov 20, 2024 •

edited

Loading

bigbraindump commented Nov 22, 2024 •

edited

Loading