Skip to content

Prepare for v0.1.0 release#322

Merged
a-r-r-o-w merged 10 commits into
mainfrom
docs/features
Mar 21, 2025
Merged

Prepare for v0.1.0 release#322
a-r-r-o-w merged 10 commits into
mainfrom
docs/features

Conversation

@a-r-r-o-w
Copy link
Copy Markdown
Contributor

@a-r-r-o-w a-r-r-o-w commented Mar 13, 2025

@neph1
Copy link
Copy Markdown
Contributor

neph1 commented Mar 14, 2025

I've come as far as to run a few tests with 'ptd', only a few steps. I hope to do a full run tomorrow. Here are some notes, I can convert them to issues if you'd like:

  1. precomputation will run to whatever value you set, even if you have less samples. There should probably be a "min" somewhere.

  2. When doing precomputation, both latents and conditions have the log output:
    Processing data on rank 0: 2%|▏ | 1/62 [00:01<01:27, 1.44s/it]�[A
    Previously it said:
    Precomputing conditions: 0%| | 0/62 [00:00<?, ?it/s]
    Maybe a difference between accelerate and ptd?

  3. I used gradient_accumulation = 8 but during the short test it 'felt' like it didn't accumulate anything, its/s were roughly 8 times faster.

  4. I couldn't run my .json datasets. I have done some changes and will make a PR, but I'm not entirely happy with the flow. Maybe some input will help.

  5. In initialize_dataset there's this code:

    try:
        does_repo_exist_on_hub = repo_exists(dataset_name_or_root, repo_type="dataset")
    except huggingface_hub.errors.HFValidationError:
        does_repo_exist_on_hub = False

which calls on hf.api. I haven't gone deeper than that, but does it call the hf hub actual to see if there's a dataset? In that case I find that to be a slight security issue, especially if it's a full local path being sent. I would prefer another way, or explicitly state whether it's a local or hub dataset in the dataset config.

Otherwise it seems to work well! Looking forward to a stable release, and especially the further steering options in the upcoming PR's. 🚀

@a-r-r-o-w
Copy link
Copy Markdown
Contributor Author

precomputation will run to whatever value you set, even if you have less samples. There should probably be a "min" somewhere.

Probably need some improvements around the docs for precomputation actually. The --precomputation_items flag should be configured manually for each dataset since the user knows exactly how many images/videos they're training with. Let me try to give some examples.

Firstly, any dataset provided is treated as an infinite buffer of data. For example, even if you have only 2 images in your dataset, they will be repeated as many times as are required for precomputation/training to be completed. Training is completed either when --train_steps is reached, or when --max_data_samples is reached.

For the examples below, I'm going to assume batch_size=1 training. Also, let's assume there are 20 data points.


If --enable_precomputation is not specified, no precomputation is done. All text encoders, transformer and vae are loaded onto GPU and training is done with on-the-fly data loading. Here's how it works:

  1. Load --dataset_shuffle_buffer_size data points into in-memory buffer
  2. Poll one item at a time from buffer randomly for computing text embeds and latents. Also fill the in-memory shuffle buffer with another data point
  3. Do forward and backward pass with the computed text embeds and latents
  4. Repeat from (2) as long as training is not completed

If the following options are specified:

--enable_precomputation
--precomputation_items 32
  1. Load --dataset_shuffle_buffer_size data points into in-memory buffer
  2. Poll one item at a time (with replacement similar to above) from the in-memory shuffle buffer.
  3. Load text encoders and VAE one-after-another. Precompute the prompt embeds and latents and save to disk.
  4. Repeat (2) until --precomputation_items precomputed items are available.
  5. Unload text encoders and VAE to save memory
  6. Perform training for --precomputation_items steps. For all training steps, data is loaded from the disk (using the previously saved data)
  7. Data is exhausted! To load more data and continue, repeat from (2) as long as training is not completed.

Note that in this case, we only have 20 data points available but specified precomputation items as 32. This means 12 data points will be repeated each time precomputation is run. It is highly wasteful for compute resources if not using --precomputation_once on small datasets.

If the dataset was larger, say 100k data points, then the benefit of this method is that instead of precomputing the entire dataset at once (which would use a large amount of disk space), we only sample --precomputation_items (a more sensible value would be 1024 in this case`) and perform training -- this helps achieve minimal disk memory usage while also allows running bigger datasets easily.

Currently though, the data saving/loading is blocking on the main thread, which makes training slightly slower. This can be further improved with many ideas, but I haven't got to doing it yet.


If the following options are specified:

--enable_precomputation
--precomputation_items 20
--precomputation_once
  1. Load --dataset_shuffle_buffer_size data points into in-memory buffer
  2. Poll one item at a time (with replacement similar to above) from the in-memory shuffle buffer.
  3. Load text encoders and VAE one-after-another. Precompute the prompt embeds and latents and save to disk.
  4. Repeat (2) until --precomputation_items precomputed items are available.
  5. Unload text encoders and VAE to save memory
  6. Perform training for --precomputation_items steps. For all training steps, data is loaded from the disk (using the previously saved data)
  7. Data is exhausted! But wait! As --precomputation_once was specified, we can just re-use the data already available on disk instead of precomputing again. Repeat from (6) as long as training is not completed

I believe the last explanation covers your use case. LMK if I haven't understood what you were trying to convey exactly

@a-r-r-o-w
Copy link
Copy Markdown
Contributor Author

When doing precomputation, both latents and conditions have the log output:
Processing data on rank 0: 2%|▏ | 1/62 [00:01<01:27, 1.44s/it]�[A
Previously it said:
Precomputing conditions: 0%| | 0/62 [00:00<?, ?it/s]
Maybe a difference between accelerate and ptd?

Both will follow the exact same code paths with the latest codebase (once I fix accelerate support that is). The messages were indeed changed, but I can look into updating that for better clarity

@a-r-r-o-w
Copy link
Copy Markdown
Contributor Author

I used gradient_accumulation = 8 but during the short test it 'felt' like it didn't accumulate anything, its/s were roughly 8 times faster.

I believe what you would have noticed is the progress bar moving a lot quicker, but the training take the same amount of time. This is because the training code is running asynchronously, i.e. CPU executing and queueing instructions to the GPU much faster than the GPU can actually finish processing them.

So, the CPU ends up completing 8 steps of instruction queuing and updating the progress bar very quickly. However, it would still take time for the GPU to finish its computation, so at the end of 8 steps, there will be a synchronization during which the progress bar will not update.

https://github.com/a-r-r-o-w/finetrainers/blob/b8bf0fcfc693ef135b7191d8d2c4bac985478e58/finetrainers/trainer/sft_trainer/trainer.py#L541

Now, the tricky part is why you still see it roughly 8 times faster. This is because tqdm uses a moving average to report it/s. Essentially, the following happens:

  • Quickly update and queue 8 steps of instructions for the GPU to execute and then wait for sync to happen.
  • During the instruction queueing, progress bar updates very quickly, so it looks like faster it/s
  • At the 8th step, a gradient update needs to be performed. This introduces a synchronization. The progress bar is not updating and simply waiting for GPU to finish doing its part before the CPU can further advance to doing the queueing.
  • Gradient update finishes. But oh wait! The CPU again quickly queues up 8 more steps of execution instructions updating the progress bar. But as tqdm does a moving average, it essentially "forgets" that it was waiting for a bit previously. And on and on this goes.

There are actually a lot more synchronizations that just the gradient step, but those are typically on the device-side (GPU), which don't stop the CPU from queueing instructions. However, if you were to explicitly force a CPU synchronization before the progress bar update, with say torch.cpu.synchronize, it'll make the 8x faster it/s be the same as if you were just doing normal training with gas=1

@a-r-r-o-w
Copy link
Copy Markdown
Contributor Author

I couldn't run my .json datasets. I have done some changes and will make a PR, but I'm not entirely happy with the flow. Maybe some input will help.

Could you share an example dataset for me to test with on the HF Hub? That way I can expand our dataset detection support

@a-r-r-o-w
Copy link
Copy Markdown
Contributor Author

which calls on hf.api. I haven't gone deeper than that, but does it call the hf hub actual to see if there's a dataset? In that case I find that to be a slight security issue, especially if it's a full local path being sent. I would prefer another way, or explicitly state whether it's a local or hub dataset in the dataset config.

It does call the HF Hub API to figure out if it's an existing repository or not.

Will doing the check by first checking if it's a local path be better? Additionally, can add a check to call HF Hub API only if there is one slash / in the path, since HF Hub datasets strictly have the convention of user_name/dataset_name

@neph1
Copy link
Copy Markdown
Contributor

neph1 commented Mar 15, 2025

can add a check to call HF Hub API only if there is one slash / in the path, since HF Hub datasets strictly have the convention of user_name/dataset_name

Yes, that was my thought, as well, if not putting it in the config. I think local first would be a good option, too. I think many datasets may have a root for the config file, and then a sub directory (or many) for the data

@neph1
Copy link
Copy Markdown
Contributor

neph1 commented Mar 15, 2025

Could you share an example dataset for me to test with on the HF Hub? That way I can expand our dataset detection support

I'll prepare my draft PR during the weekend, then you can decide what to do it. It utilizes the existing hiker vids and has an example .json in the assets folder.

@neph1
Copy link
Copy Markdown
Contributor

neph1 commented Mar 16, 2025

One other thing I noticed was that it seemed to make a wandb report, even if report_to was set to None. This happened when the training failed (which it did alot until I figured out the json loading). I don't think it did before the dataset changes, so it might be regression.

@a-r-r-o-w
Copy link
Copy Markdown
Contributor Author

One other thing I noticed was that it seemed to make a wandb report, even if report_to was set to None. This happened when the training failed (which it did alot until I figured out the json loading). I don't think it did before the dataset changes, so it might be regression.

Ah yes, seems to be regression. Will fix asap

@a-r-r-o-w a-r-r-o-w merged commit 5ea0457 into main Mar 21, 2025
@a-r-r-o-w a-r-r-o-w deleted the docs/features branch March 21, 2025 11:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants