Prepare for v0.1.0 release#322
Conversation
|
I've come as far as to run a few tests with 'ptd', only a few steps. I hope to do a full run tomorrow. Here are some notes, I can convert them to issues if you'd like:
which calls on hf.api. I haven't gone deeper than that, but does it call the hf hub actual to see if there's a dataset? In that case I find that to be a slight security issue, especially if it's a full local path being sent. I would prefer another way, or explicitly state whether it's a local or hub dataset in the dataset config. Otherwise it seems to work well! Looking forward to a stable release, and especially the further steering options in the upcoming PR's. 🚀 |
Probably need some improvements around the docs for precomputation actually. The Firstly, any dataset provided is treated as an infinite buffer of data. For example, even if you have only 2 images in your dataset, they will be repeated as many times as are required for precomputation/training to be completed. Training is completed either when For the examples below, I'm going to assume batch_size=1 training. Also, let's assume there are 20 data points. If
If the following options are specified:
Note that in this case, we only have 20 data points available but specified precomputation items as 32. This means 12 data points will be repeated each time precomputation is run. It is highly wasteful for compute resources if not using If the dataset was larger, say 100k data points, then the benefit of this method is that instead of precomputing the entire dataset at once (which would use a large amount of disk space), we only sample Currently though, the data saving/loading is blocking on the main thread, which makes training slightly slower. This can be further improved with many ideas, but I haven't got to doing it yet. If the following options are specified:
I believe the last explanation covers your use case. LMK if I haven't understood what you were trying to convey exactly |
Both will follow the exact same code paths with the latest codebase (once I fix accelerate support that is). The messages were indeed changed, but I can look into updating that for better clarity |
I believe what you would have noticed is the progress bar moving a lot quicker, but the training take the same amount of time. This is because the training code is running asynchronously, i.e. CPU executing and queueing instructions to the GPU much faster than the GPU can actually finish processing them. So, the CPU ends up completing 8 steps of instruction queuing and updating the progress bar very quickly. However, it would still take time for the GPU to finish its computation, so at the end of 8 steps, there will be a synchronization during which the progress bar will not update. Now, the tricky part is why you still see it roughly 8 times faster. This is because tqdm uses a moving average to report it/s. Essentially, the following happens:
There are actually a lot more synchronizations that just the gradient step, but those are typically on the device-side (GPU), which don't stop the CPU from queueing instructions. However, if you were to explicitly force a CPU synchronization before the progress bar update, with say |
Could you share an example dataset for me to test with on the HF Hub? That way I can expand our dataset detection support |
It does call the HF Hub API to figure out if it's an existing repository or not. Will doing the check by first checking if it's a local path be better? Additionally, can add a check to call HF Hub API only if there is one slash |
Yes, that was my thought, as well, if not putting it in the config. I think local first would be a good option, too. I think many datasets may have a root for the config file, and then a sub directory (or many) for the data |
I'll prepare my draft PR during the weekend, then you can decide what to do it. It utilizes the existing hiker vids and has an example .json in the assets folder. |
|
One other thing I noticed was that it seemed to make a wandb report, even if report_to was set to None. This happened when the training failed (which it did alot until I figured out the json loading). I don't think it did before the dataset changes, so it might be regression. |
Ah yes, seems to be regression. Will fix asap |
TODO: