Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

addressing issue #5, SD3 training #11

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

bigbraindump
Copy link
Contributor

@bigbraindump bigbraindump commented Nov 19, 2024

I'm attempting to use the existing training setup with Stable Diffusion 3. Here's the history of errors encountered, please let me know how to proceed with error 4 -

Error 1:
TypeError: argument of type 'PosixPath' is not iterable

Solution 1:
Added str conversion in dataset.py
output_path = str(output_path) # after mkdir

Error 2:
train_controlnet_sd3.py: error: unrecognized arguments: --enable_xformers_memory_efficient_attention
Solution 2:
Removed arg

Error 3:

Traceback (most recent call last):
  File "/data/stariq/stariq/signwriting_illustration/controlnet_huggingface/diffusers/examples/controlnet/train_controlnet_sd3.py", line 1412, in <module>
    main(args)
  File "/data/stariq/stariq/signwriting_illustration/controlnet_huggingface/diffusers/examples/controlnet/train_controlnet_sd3.py", line 989, in main
    controlnet = SD3ControlNetModel.from_transformer(transformer)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/stariq/data/conda/envs/diffusers/lib/python3.11/site-packages/diffusers/models/controlnets/controlnet_sd3.py", line 251, in from_transformer
    controlnet = cls(**config)
                 ^^^^^^^^^^^^^
  File "/home/stariq/data/conda/envs/diffusers/lib/python3.11/site-packages/diffusers/configuration_utils.py", line 665, in inner_init
    init(self, *args, **init_kwargs)
TypeError: SD3ControlNetModel.__init__() got an unexpected keyword argument 'dual_attention_layers'

Solution 3: Modified controlnet_SD3.py to add additional parameters --dual_attention_layers and --qk_norm.

class SD3ControlNetModel(ModelMixin, ConfigMixin, PeftAdapterMixin, FromOriginalModelMixin):
    _supports_gradient_checkpointing = True

    @register_to_config
    def __init__(
        self,
        sample_size: int = 128,
        patch_size: int = 2,
        in_channels: int = 16,
        num_layers: int = 18,
        attention_head_dim: int = 64,
        num_attention_heads: int = 18,
        joint_attention_dim: int = 4096,
        caption_projection_dim: int = 1152,
        pooled_projection_dim: int = 2048,
        out_channels: int = 16,
        pos_embed_max_size: int = 96,
        extra_conditioning_channels: int = 0,
        dual_attention_layers=(),
        qk_norm=None,
        **kwargs
    ):
        super().__init__()
        default_out_channels = in_channels
        self.out_channels = out_channels if out_channels is not None else default_out_channels
        self.inner_dim = num_attention_heads * attention_head_dim
        self.dual_attention_layers = dual_attention_layers
        self.qk_norm = qk_norm

Error 4:

Traceback (most recent call last):
  File "/data/stariq/stariq/signwriting_illustration/controlnet_huggingface/diffusers/examples/controlnet/train_controlnet_sd3.py", line 1412, in <module>
    main(args)
  File "/data/stariq/stariq/signwriting_illustration/controlnet_huggingface/diffusers/examples/controlnet/train_controlnet_sd3.py", line 1100, in main
    train_dataset = make_train_dataset(args, tokenizer_one, tokenizer_two, tokenizer_three, accelerator)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/stariq/stariq/signwriting_illustration/controlnet_huggingface/diffusers/examples/controlnet/train_controlnet_sd3.py", line 664, in make_train_dataset
    raise ValueError(
ValueError: `--image_column` value 'image' not found in dataset columns. Dataset columns are: _data_files, _fingerprint, _format_columns, _format_kwargs, _format_type, _output_all_columns, _split 

P.S. i did try to include only the relevant files for this branch comparison but even with a local reset, file changes from other open PRs still show up in this request.

@AmitMY
Copy link
Contributor

AmitMY commented Nov 19, 2024

There are multiple issues with this PR:

  1. Please do not create a PR with all features at once. I can't review the changes properly. Go to main branch, reset to the latest commit on this repository, then branch out to a new branch from that commit. then, you can do whatever you want, and the PR will only include those changes.

  2. For "error 3" - while it is an OKish intermediate solution - did you open an issue on huggingface for it? It needs to be fixed in source, not by copying.

  3. The technical issue I can spot, is that it seems like there is some sort of issue for you with environment variables - I don't know why, but for example on line 26 you change the environment variable to a cat. Then, in line 57, you still do use environment variables.
    image
    I would suggest you dont use cat, and instead solve the problem at its source - why can't you use variables?

@bigbraindump
Copy link
Contributor Author

bigbraindump commented Nov 20, 2024

  1. i followed the below steps to create this branch 'clean-SD3' and wasn't able to determine the cause of the problem
git checkout -b clean-SD3
git reset --hard upstream/main
git stash #for unwanted changes
git add controlnet_sd3.py # the original controlnet_sd3.py file 
git commit -m "message" # commit <7b250ff>
git push origin clean-SD3

# further changes made to dataset.py and train.sh
git add dataset.py train.sh controlnet_sd3.py
git commit -m "message" # commit <aaf3b83>
git push origin clean-SD3
  1. done. (issue)

  2. no particular reason, it was just a preference to save the key and token in the dir files. however, even after reverting to the original environment variable usage with no 'cat', Error 4 persists. below are some more debugging attempts detailed with no success yet -

issue seems to stem from the load_dataset() func that the SD3 training script uses.
-printed the column names generated from dataset.py to confirm correct creation
-tested alternate loading methods - load_from_disk ()

DEBUG: Trying different dataset loading methods...
Method 1 - load_from_disk:
Failed: argument of type 'NoneType' is not iterable
Method 2 - load_dataset:
Failed: Dataset 'data' doesn't exist on the Hub or cannot be accessed.

-possible that the issue lies in the process_train() func of the training script. tried passing both image path strings and PIL image objects
-attempted to convert the dataset into the ImageFolder format so a custom dataset would not be required
-also tried passing trust_remote_code=True as suggested in the legacy method on huggingface dataset loading page
-testing this PR approach as per the last meeting, specified the controlnet model in training parameters

@AmitMY
Copy link
Contributor

AmitMY commented Nov 20, 2024

  1. Not sure exactly what you did, but this branch, at the moment, is still full of other files
  2. Great. left a note on Slack.
  3. Alright, so now you are using environment variables -
    Note that when you call load_dataset here you should be seeing the dataset, since you are loading it from disk. For debugging, I'd recommend printing the train_data_dir to see it is correct from the environement variable, then verify the directory exists (os.path.exists(path)) then print the resulting dataset. This is exactly the same code as the normal controlnet, so it should work exactly the same.

@bigbraindump
Copy link
Contributor Author

  1. thanks!
  2. debugging -
import os
from datasets import load_dataset, load_from_disk

train_data_dir = os.getenv("HF_DATASET_DIR")  
print(f"Train data directory: {train_data_dir}")

if not os.path.exists(train_data_dir):
    raise FileNotFoundError(f"The directory {train_data_dir} does not exist.")

print("\nTrying load_from_disk:")
try:
    dataset1 = load_from_disk(train_data_dir)
    print(f"Dataset loaded with load_from_disk: {dataset1}")
except Exception as e:
    print(f"Error with load_from_disk: {str(e)}")

print("\nTrying load_dataset:")
try:
    dataset2 = load_dataset("imagefolder", data_dir=train_data_dir)
    print(f"Dataset loaded with load_dataset: {dataset2}")
except Exception as e:
    print(f"Error with load_dataset: {str(e)}")

debugging output -

+ python debug.py
Train data directory: /scratch/stariq/signwriting-illustration

Trying load_from_disk:
Dataset loaded with load_from_disk: DatasetDict({
    train: Dataset({
        features: ['conditioning_image', 'ground_truth_image', 'caption'],
        num_rows: 2601
    })
})

Trying load_dataset:
Resolving data files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5205/5205 [00:00<00:00, 114163.40it/s]
Downloading data: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5204/5204 [00:00<00:00, 75849.32files/s]
Generating train split: 0 examples [00:00, ? examples/s]
Error with load_dataset: An error occurred while generating the dataset

@AmitMY
Copy link
Contributor

AmitMY commented Nov 21, 2024

I suggest you debug directly in this script: https://github.com/huggingface/diffusers/blob/main/examples/controlnet/train_controlnet_sd3.py#L644-L649

Specifically, where marked, you print the directories, and check they exist, and print the dataset, and print the error, etc.

If there is an error loading the dataset, but the directory is correct, make sure the dataset is generated correctly in the dataset.py script

@bigbraindump
Copy link
Contributor Author

here's the edited function (train_controlnet_sd3.py):

def make_train_dataset(args, tokenizer_one, tokenizer_two, tokenizer_three, accelerator):
    try:
        if args.train_data_dir is not None:
            print(f"Loading: {args.train_data_dir}")
            dataset = load_dataset(args.train_data_dir, cache_dir=args.cache_dir)
            print("Dataset loaded:", dataset)
            print("Train split features:", dataset["train"].features)
            print("Column names:", dataset["train"].column_names)
            
            # Verify image column
            if args.image_column not in dataset["train"].column_names:
                print(f"\nERROR: Image column '{args.image_column}' not found in columns: {dataset['train'].column_names}")
                
            return dataset["train"]
    except Exception as e:
        print(f"Error loading dataset: {e}")
        raise

    raise ValueError("No dataset provided")

edited function (train_controlnet_sd3.py):

Loading: /scratch/stariq/signwriting-illustration
Dataset loaded: DatasetDict({
    train: Dataset({
        features: ['_data_files', '_fingerprint', '_format_columns', '_format_kwargs', '_format_type', '_output_all_columns', '_split'],
        num_rows: 1
    })
})
Train split features: {'_data_files': [{'filename': Value(dtype='string', id=None)}], '_fingerprint': Value(dtype='string', id=None), '_format_columns': Value(dtype='null', id=None), '_format_kwargs': {}, '_format_type': Value(dtype='null', id=None), '_output_all_columns': Value(dtype='bool', id=None), '_split': Value(dtype='string', id=None)} 
Column names: ['_data_files', '_fingerprint', '_format_columns', '_format_kwargs', '_format_type', '_output_all_columns', '_split']

ERROR: Image column 'image' not found in columns: ['_data_files', '_fingerprint', '_format_columns', '_format_kwargs', '_format_type', '_output_all_columns', '_split']      
Map:   0%|                                                                                                                                      | 0/1 [00:00<?, ? examples/s]

dataset generation check (dataset.py):

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--train-path", type=str, required=True)
    parser.add_argument("--output-path", type=str, required=True)
    args = parser.parse_args()

    train_path = Path(args.train_path)
    output_path = Path(args.output_path)

    output_path.mkdir(parents=True, exist_ok=True)
    output_path = str(output_path)

    dataset = SignWritingIllustrationDataset(train_path)
    dataset.download_and_prepare(output_path)
    
    # verify
    ds = dataset.as_dataset()
    print(ds)
    print("Columns:", ds['train'].column_names)
        
    print("\nFirst example:")
    first_example = ds['train'][0]
    print("Keys:", list(first_example.keys()))
    print("Caption:", first_example['caption'])
    print("Image size:", first_example['image'].size)
    print("Control image size:", first_example['control_image'].size)
        
    ds.save_to_disk(output_path)

    print(f"Skipped {dataset.get_skipped_images()} images")

dataset generation check (dataset.py) output :(

+ python dataset.py --train-path=../../train --output-path=/scratch/stariq/signwriting-illustration
Repo card metadata block was not found. Setting CardData to empty.
DatasetDict({
    train: Dataset({
        features: ['conditioning_image', 'ground_truth_image', 'caption'],
        num_rows: 2601
    })
})
Columns: ['conditioning_image', 'ground_truth_image', 'caption']

First example:
Keys: ['conditioning_image', 'ground_truth_image', 'caption']
Caption: An illustration of a man with short hair, with orange arrows. Watermark text: signecriture.org i.
Traceback (most recent call last):
  File "/data/stariq/stariq/signwriting_illustration/controlnet_huggingface/dataset.py", line 81, in <module>
    print("Image size:", first_example['image'].size)
                         ~~~~~~~~~~~~~^^^^^^^^^
KeyError: 'image'

@AmitMY
Copy link
Contributor

AmitMY commented Nov 22, 2024

Great. You see that the columns are:

Columns: ['conditioning_image', 'ground_truth_image', 'caption']

No image column available, so you need to change the argument

@bigbraindump
Copy link
Contributor Author

bigbraindump commented Nov 22, 2024

i have also tried that without success -

  1. once the arguments are changed in train.sh -
--conditioning_image_column=conditioning_image \
--image_column=ground_truth_image \
--caption_column=caption
  1. same error persists -

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/stariq/stariq/signwriting_illustration/controlnet_huggingface/diffusers/examples/controlnet/train_controlnet_sd3.py", line 667, in make_train_dataset raise ValueError( ValueError: --image_column value 'ground_truth_image' not found in dataset columns. Dataset columns are: _data_files, _fingerprint, _format_columns, _format_kwargs, _format_type, _output_all_columns, _split

@AmitMY
Copy link
Contributor

AmitMY commented Nov 23, 2024

The dataset.py file says the ground_truth_image column exists.
But the train file says the following columns:

['_data_files', '_fingerprint', '_format_columns', '_format_kwargs', '_format_type', '_output_all_columns', '_split']

Maybe, the path to the dataset you are creating and loading is different?

Either the path is wrong, or the specific method to load the dataset is wrong. Have you tried load_from_disk instead of load_dataset?

@bigbraindump
Copy link
Contributor Author

yes i've tried load_from_disk as well as verified the path and env variables. although not ideal, adding load_from_disk to the training script partially solves the issue of loading the data.

  1. loading data issue was solved by editing the train_controlnet_sd3.py script in the section you've previously highlighted, keeping the original parameters -
    if args.dataset_name is not None:
        # Downloading and loading a dataset from the hub.
        dataset = load_dataset(
            args.dataset_name,
            args.dataset_config_name,
            cache_dir=args.cache_dir,
        )
    else:
        if args.train_data_dir is not None:
            dataset = load_from_disk(
                args.train_data_dir,
            )
Available columns: ['control_image', 'image', 'caption']
Map: 100%|██████████| 2694/2694 [00:49<00:00, 54.53 examples/s]
  1. however, another error comes up in data processing with channel mismatch -
Steps:   0%|          | 0/337000 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/data/stariq/stariq/signwriting_illustration/controlnet_huggingface/diffusers/examples/controlnet/train_controlnet_sd3.py", line 1413, in <module>
    main(args)

           ^^^^^^^^^
RuntimeError: Given groups=1, weight of size [1536, 17, 2, 2], expected input[1, 16, 64, 64] to have 17 channels, but got 16 channels instead

this is odd because the dataset creation already handles 3 channel RGB (here for res. 512). i have also checked
3. the channel outputs at different points in the training setup -

Steps:   0%|          | 0/337000 [00:00<?, ?it/s]11/23/2024 00:26:02 - INFO - __main__ - noisy_model_input shape: torch.Size([1, 16, 64, 64]), dtype: torch.float16
11/23/2024 00:26:02 - INFO - __main__ - Raw controlnet_image shape: torch.Size([1, 3, 512, 512]), dtype: torch.float16
11/23/2024 00:26:02 - INFO - __main__ - VAE encoded controlnet_image shape: torch.Size([1, 16, 64, 64]), dtype: torch.float16
11/23/2024 00:26:02 - INFO - __main__ - Scaled controlnet_image shape: torch.Size([1, 16, 64, 64]), dtype: torch.float16
11/23/2024 00:26:02 - INFO - __main__ - First conv layer weight shape: torch.Size([1536, 16, 2, 2])

so the channel mismatch issue lies with ControlNet initialization for SD3 specifically. I'm looking into the control for sd3 from diffusers more closely atm

@AmitMY
Copy link
Contributor

AmitMY commented Nov 23, 2024

Alright.
Now that huggingface addressed your issue huggingface/diffusers#9974 (which can now be closed), I recommend you open a new issue: what command you are running exactly, and what is your error, including a stack trace.

It is to be expected that examples on their repo should work directly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants