-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Streaming SFT support #3101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Streaming SFT support #3101
Changes from all commits
Commits
Show all changes
24 commits
Select commit
Hold shift + click to select a range
1dd08ae
working
djsaunde cf316a3
fixes
djsaunde 0d0a810
deprecate --iterable; cleanup
djsaunde 9881efd
pretrain_multipack_buffer_size -> streaming_multipack_buffer_size
djsaunde 6939055
improvements
djsaunde ee8f718
tests
djsaunde 35cd0be
remove unused
djsaunde f35ca3b
docs, examples
djsaunde a560ce7
nit
djsaunde bf0a427
nit
djsaunde b5084b5
add val_set_size validation
djsaunde f4e059d
val
djsaunde 3653f1c
nit
djsaunde 7231428
min
djsaunde 138c03e
coderabbito
djsaunde 909269a
cleanup
djsaunde 9405ceb
nit
djsaunde 5b9dca9
add depr warning, cleanup
djsaunde 645f10a
nit
djsaunde c59eede
fix test, fix quarto
djsaunde 4c97309
fix
djsaunde 34b74e4
review comments
djsaunde 528070b
review comments
djsaunde 4d1a47b
fix
djsaunde File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,120 @@ | ||
| --- | ||
| title: Streaming Datasets | ||
| description: How to use streaming mode for large-scale datasets and memory-efficient training | ||
| order: 10 | ||
| --- | ||
|
|
||
| Streaming enables memory-efficient training with large datasets by loading data | ||
| incrementally rather than loading the entire dataset into memory at once. | ||
|
|
||
| Use streaming when: | ||
|
|
||
| - Your dataset is too large to fit in memory (e.g. when you're doing pretraining with massive text corpora) | ||
| - You want to start training immediately without preprocessing the entire dataset | ||
|
|
||
| Streaming works with both remote and locally stored datasets! | ||
|
|
||
| ::: {.callout-note} | ||
| Streaming currently only supports a single dataset. Multi-dataset support will be added soon. | ||
| ::: | ||
|
|
||
|
|
||
| ## Configuration | ||
|
|
||
| ### Basic Streaming | ||
|
|
||
| Enable streaming mode by setting the `streaming` flag: | ||
|
|
||
| ```yaml | ||
| streaming: true | ||
| ``` | ||
|
|
||
| ### Pretraining with Streaming | ||
|
|
||
| For pretraining tasks, streaming is automatically enabled when using `pretraining_dataset`: | ||
|
|
||
| ```yaml | ||
| pretraining_dataset: | ||
| - path: HuggingFaceFW/fineweb-edu | ||
| type: pretrain | ||
| text_column: text | ||
| split: train | ||
|
djsaunde marked this conversation as resolved.
|
||
|
|
||
| # Optionally, enable sample packing | ||
| streaming_multipack_buffer_size: 10000 | ||
| sample_packing: true | ||
| ``` | ||
|
|
||
| ### SFT with Streaming | ||
|
|
||
| For supervised fine-tuning with streaming: | ||
|
djsaunde marked this conversation as resolved.
|
||
|
|
||
| ```yaml | ||
| streaming: true | ||
| datasets: | ||
| - path: tatsu-lab/alpaca | ||
| type: alpaca | ||
| split: train | ||
|
|
||
| # Optionally, enable sample packing | ||
| streaming_multipack_buffer_size: 10000 | ||
| sample_packing: true | ||
| ``` | ||
|
|
||
| ## Configuration Options | ||
|
|
||
| ### `streaming_multipack_buffer_size` | ||
|
|
||
| Controls the buffer size for multipack streaming (default: 10,000). This determines how | ||
| many samples are buffered before packing. Larger buffers can improve packing efficiency | ||
| but use more memory. | ||
|
|
||
| ### `shuffle_merged_datasets` | ||
|
|
||
| When enabled, shuffles the streaming dataset using the buffer. This requires additional | ||
| memory for the shuffle buffer. | ||
|
|
||
| ## Sample Packing with Streaming | ||
|
|
||
| Sample packing is supported for streaming datasets. When enabled, multiple samples are | ||
| packed into a single sequence to maximize GPU utilization: | ||
|
|
||
| ```yaml | ||
| sample_packing: true | ||
| streaming_multipack_buffer_size: 10000 | ||
|
|
||
| # For SFT: attention is automatically isolated between packed samples | ||
| # For pretraining: control with pretrain_multipack_attn | ||
| pretrain_multipack_attn: true # prevent cross-attention between packed samples | ||
| ``` | ||
|
djsaunde marked this conversation as resolved.
|
||
|
|
||
| For more information, see our [documentation](multipack.qmd) on multipacking. | ||
|
|
||
| ## Important Considerations | ||
|
|
||
| ### Memory Usage | ||
|
|
||
| While streaming reduces memory usage compared to loading entire datasets, you still need | ||
| to consider: | ||
|
|
||
| - You can control the memory usage by adjusting `streaming_multipack_buffer_size` | ||
| - Sample packing requires buffering multiple samples | ||
| - Shuffling requires additional memory for the shuffle buffer | ||
|
|
||
| ### Performance | ||
|
|
||
| - Streaming may have slightly higher latency compared to preprocessed datasets, as samples are processed on-the-fly | ||
| - Network speed and disk read speed are important when streaming from remote sources or a local dataset, respectively | ||
| - Consider using `axolotl preprocess` for smaller or more frequently used datasets | ||
|
djsaunde marked this conversation as resolved.
|
||
|
|
||
| ### Evaluation Datasets | ||
|
|
||
| Evaluation datasets are not streamed to ensure consistent evaluation metrics. They're | ||
| loaded normally even when training uses streaming. | ||
|
djsaunde marked this conversation as resolved.
|
||
|
|
||
| ## Examples | ||
|
|
||
| See the `examples/streaming/` directory for complete configuration examples: | ||
|
|
||
| - `pretrain.yaml`: Pretraining with streaming dataset | ||
| - `sft.yaml`: Supervised fine-tuning with streaming | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,50 @@ | ||
| # Streaming Dataset Examples | ||
|
|
||
| This directory contains example configurations for using Axolotl's streaming dataset | ||
| functionality, which enables memory-efficient training with large datasets. | ||
|
|
||
| ## Examples | ||
|
|
||
| Run the following examples with e.g. `axolotl train examples/streaming/sft.yaml`; no | ||
| `axolotl preprocess` required! | ||
|
|
||
| ### Pretraining (`pretrain.yaml`) | ||
|
|
||
| Demonstrates streaming configuration for pretraining tasks using the fineweb-edu dataset | ||
| with SmolLM2-135M. | ||
|
|
||
| - Uses `pretraining_dataset` configuration for automatic streaming | ||
| - Multipack attention control to prevent cross-attention between packed sequences | ||
| - Buffer size configuration for memory management | ||
|
|
||
| ### SFT (`sft.yaml`) | ||
|
|
||
| Shows how to use streaming for supervised fine-tuning with the Alpaca dataset. | ||
|
|
||
| - Explicit `streaming: true` flag for SFT datasets | ||
| - Memory-efficient training on instruction datasets | ||
| - Evaluation datasets are currently not streamed | ||
|
|
||
| ## Key Configuration Options | ||
|
|
||
| ### `streaming` | ||
| - Enables streaming mode for standard datasets | ||
| - Automatically enabled for `pretraining_dataset` | ||
|
|
||
| ### `streaming_multipack_buffer_size` | ||
| - Controls buffer size for sample packing (default: 10,000) | ||
| - Larger values improve packing efficiency but use more memory | ||
| - Adjust based on available memory | ||
|
|
||
| ### `shuffle_merged_datasets` | ||
| - Enables shuffling of streaming datasets | ||
| - Requires additional memory for shuffle buffer | ||
|
|
||
| ### `sample_packing` | ||
| - Packs multiple samples into single sequences | ||
| - Minimize per-step padding tokens | ||
|
|
||
| ## Performance Tips | ||
|
|
||
| - Download small / frequently-used datasets locally for better performance | ||
| - Larger buffer sizes improve packing efficiency |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,57 @@ | ||
| base_model: HuggingFaceTB/SmolLM2-135M | ||
|
|
||
| # Streaming pretraining configuration | ||
| pretraining_dataset: | ||
| - path: HuggingFaceFW/fineweb-edu | ||
| name: sample-10BT | ||
| type: pretrain | ||
| text_column: text | ||
| split: train | ||
|
|
||
| # Streaming-specific settings | ||
| streaming_multipack_buffer_size: 10000 | ||
| shuffle_merged_datasets: true | ||
|
|
||
| # Training configuration | ||
| max_steps: 1000 | ||
| output_dir: ./outputs/smollm2-135m-pretrain-streaming | ||
|
|
||
| # Sequence and packing settings | ||
| sequence_len: 1024 | ||
| sample_packing: true | ||
| pretrain_multipack_attn: true # Prevent cross-attention between packed sequences | ||
| flash_attention: true | ||
|
|
||
| # Batch size settings | ||
| gradient_accumulation_steps: 8 | ||
| micro_batch_size: 1 | ||
|
|
||
| # Optimizer and scheduler | ||
| optimizer: adamw_torch | ||
| lr_scheduler: cosine | ||
| learning_rate: 5e-4 | ||
| warmup_ratio: 0.1 | ||
| weight_decay: 0.01 | ||
|
|
||
| # Precision and performance | ||
| bf16: auto | ||
| tf32: true | ||
|
|
||
| # Logging and checkpointing | ||
| logging_steps: 10 | ||
| save_strategy: steps | ||
| save_steps: 250 | ||
| save_total_limit: 3 | ||
|
|
||
| # Weights & Biases (optional) | ||
| wandb_project: | ||
| wandb_entity: | ||
| wandb_watch: | ||
| wandb_name: | ||
| wandb_log_model: | ||
|
|
||
| # Special tokens | ||
| special_tokens: | ||
| pad_token: "<|endoftext|>" | ||
|
|
||
| # save_first_step: true # uncomment this to validate checkpoint saving works with your config |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,55 @@ | ||
| base_model: HuggingFaceTB/SmolLM2-135M | ||
|
|
||
| # Dataset configuration | ||
| datasets: | ||
| - path: tatsu-lab/alpaca | ||
| type: alpaca | ||
| split: train | ||
|
|
||
| # Streaming-specific settings | ||
| streaming: true | ||
| streaming_multipack_buffer_size: 10000 | ||
| shuffle_merged_datasets: true | ||
|
|
||
| # Training configuration | ||
| max_steps: 1000 | ||
| output_dir: ./outputs/smollm2-135m-sft-streaming | ||
|
|
||
| # Sequence and packing settings | ||
| sequence_len: 1024 | ||
| sample_packing: true | ||
| flash_attention: true | ||
|
|
||
| # Batch size settings | ||
| gradient_accumulation_steps: 4 | ||
| micro_batch_size: 1 | ||
|
|
||
| # Optimizer and scheduler | ||
| optimizer: adamw_torch | ||
| lr_scheduler: cosine | ||
| learning_rate: 2e-4 | ||
| warmup_ratio: 0.1 | ||
| weight_decay: 0.0 | ||
|
|
||
| # Precision and performance | ||
| bf16: auto | ||
| tf32: true | ||
|
|
||
| # Logging and checkpointing | ||
| logging_steps: 10 | ||
| save_strategy: steps | ||
| save_steps: 100 | ||
| save_total_limit: 3 | ||
|
|
||
| # Weights & Biases (optional) | ||
| wandb_project: | ||
| wandb_entity: | ||
| wandb_watch: | ||
| wandb_name: | ||
| wandb_log_model: | ||
|
|
||
| # Special tokens | ||
| special_tokens: | ||
| pad_token: "<|endoftext|>" | ||
|
|
||
| # save_first_step: true # uncomment this to validate checkpoint saving works with your config |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.