Improve Dataset Processing Multiprocessing, Sharding, and Qwen Tokenizer Bug Fix.#2918
Conversation
… limiter on multiprocessing during tokenization, and a bug fix of qwen tokenizer
WalkthroughThe changes adjust how process counts and sharding are configured and used during dataset processing and saving. They add an optional configuration for dataset save sharding, update process count defaults and limits, and improve special token handling in tokenizers. No public API signatures were changed, but internal logic and configuration options were refined. Changes
Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (3)
🚧 Files skipped from review as they are similar to previous changes (3)
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
Codecov ReportAttention: Patch coverage is
📢 Thoughts on this report? Let us know! |
winglian
left a comment
There was a problem hiding this comment.
There changes all seem sane enough. iirc, we had the hard cap in place because we had noticed that because a lot of modern CPUs use hyperthreading, using os.count actually was worse/slower than simply capping. One thing we could consider is using os.cpu_count // 2 instead
| default=min( | ||
| int(os.environ.get("AXOLOTL_DATASET_PROCESSES", 32)), os.cpu_count() | ||
| ), # type: ignore[type-var] | ||
| default=int(os.environ.get("AXOLOTL_DATASET_PROCESSES", os.cpu_count())), # type: ignore[type-var] |
There was a problem hiding this comment.
Hm, back to this. I notice we try to catch this (due to old code) in few places.
- Here
config.py normalize_config- Within the ds
processsrc/axolotl/datasets.py - Chat ds parser
src/axolotl/core/datasets/chat.py(need to also remove limiter here)
I think we can remove all except first as it'll run first in our validator.
There was a problem hiding this comment.
I just noticed that another Issue (#2753) also discusses this issue. Tagging it here.
Btw, my fix already addresses (3) in process src/axolotl/datasets.py.
I made the modifications for (2) and (4). Pushing it now.
There was a problem hiding this comment.
@NanoCode012, I have made the all discussed fixes and pushed them along in a new commit. Please review it soon, as this feature will really help me and lot others as well.
| json_schema_extra={"description": "Index of shard to use for whole dataset"}, | ||
| ) | ||
| skip_prepare_dataset: bool | None = False | ||
| num_save_shards: int | None = Field( |
There was a problem hiding this comment.
I'm thinking whether we want ds in the name to signify it's for datasets only.
There was a problem hiding this comment.
Since this defaults to None, does dataset.save_to_disk support None if we pass it? Just to be sure we're not overriding their defaults.
There was a problem hiding this comment.
yes, I think the name num_save_ds_shards is better and to be more specific prepared_dataset_num_shards. I would go with the second, as it is more explicit as well. What's your opinion?
There was a problem hiding this comment.
@NanoCode012 , Here are some numbers around the num_shards and num_proc argument on a 24 core CPU. Thanks for the reminder, I realized, we have to set the max_shard_size=None to avoid issues later.
Saving the dataset (15/15 shards): 100%|██████████████████████████████████████████████████████████████████| 349317/349317 [00:06<00:00, 56775.48 examples/s]
| > Dataset saved in 6.16 seconds. (num_shards=None, num_proc=None, max_shard_size=None)
Saving the dataset (24/24 shards): 100%|█████████████████████████████████████████████████████████████████| 349317/349317 [00:00<00:00, 490461.43 examples/s]
| > Dataset saved in 0.74 seconds. (num_proc=24, num_shards=None, max_shard_size=None)
Saving the dataset (5/5 shards): 100%|███████████████████████████████████████████████████████████████████| 349317/349317 [00:01<00:00, 240478.69 examples/s]
| > Dataset saved in 1.47 seconds. (num_proc=24, num_shards=5, max_shard_size=None)There was a problem hiding this comment.
Thanks for the numbers, By default, how many shards would it attempt to save?
There was a problem hiding this comment.
That depends. I think the max_shard_size is default at "500Mb". So, for a very long context dataset with ~9M data points for me, it creates 1000+ shards. In the example shown before, it saves 15 by default.
| token_names = ["bos_token", "eos_token", "pad_token", "unk_token"] | ||
| for attr_name in token_names: |
There was a problem hiding this comment.
I think we may need to change the "<|endoftext|>" below to the eos_token value?
However, now that I step back, I'm wondering if we even need this section. The user should be setting it under special_tokens: instead of us handling it like this.
There was a problem hiding this comment.
Actually true, I feel the whole block can be removed as the newer Qwen models don't use it at all, and I am not happy hardcoding all the unavailable tokens as eos/eos_id. Again, older Qwen models might need this block.
There was a problem hiding this comment.
Hm, could we change this check to, hasattr(tokenizer, "eod_id"): <- sign of qwen base before the 2 for loops. Then, this section would only run for the old qwen base. We can then keep the original implementation without having to fallback to eos.
There was a problem hiding this comment.
yes, that seems like a good idea.
The best if case here could be if cfg.is_qwen_derived_model and hasattr(tokenizer, "eod_id"), and revert the code block to what we had earlier. But maybe the "<|endoftext|>" should also be changed to tokenizer.eos_token below
There was a problem hiding this comment.
But maybe the "<|endoftext|>" should also be changed to tokenizer.eos_token below
In this block, we are checking and maybe setting eos_token too (so tokenizer.eos_token may be empty). I'll say to just keep it for legacy unless another reviewer think we can just remove this.
There was a problem hiding this comment.
yes, you are absolutely correct! I have kept it as is in the latest commit, and just converted the check to if cfg.is_qwen_derived_model and hasattr(tokenizer, "eod_id"). This avoids the error if someone adds is_qwen_derived_model to their config for the newer qwen models.
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
src/axolotl/core/datasets/chat.py (1)
48-48: Consider adding input validation for process_count parameter.While the change correctly uses the determined process count, consider adding validation to ensure
process_countis a positive integer when provided by the user to prevent potential issues with invalid inputs.def __init__( self, data: Dataset, model_transform: Union[PreTrainedTokenizer, Callable], *args, message_transform: Optional[Callable] = None, formatter=None, process_count: Optional[int] = None, keep_in_memory: Optional[bool] = False, **kwargs, ): + if process_count is not None and process_count <= 0: + raise ValueError("process_count must be a positive integer") + def map_fn(ex): # ... existing code ...
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
src/axolotl/core/datasets/chat.py(1 hunks)src/axolotl/loaders/tokenizer.py(1 hunks)src/axolotl/utils/config/__init__.py(0 hunks)src/axolotl/utils/data/shared.py(2 hunks)src/axolotl/utils/schemas/config.py(2 hunks)
💤 Files with no reviewable changes (1)
- src/axolotl/utils/config/init.py
🚧 Files skipped from review as they are similar to previous changes (3)
- src/axolotl/utils/data/shared.py
- src/axolotl/loaders/tokenizer.py
- src/axolotl/utils/schemas/config.py
🔇 Additional comments (1)
src/axolotl/core/datasets/chat.py (1)
44-44: LGTM: Process count determination improved.The change correctly removes the artificial cap on process count, allowing full utilization of available CPU cores. This aligns with the PR objective of removing hard limits on multiprocessing during dataset tokenization.
NanoCode012
left a comment
There was a problem hiding this comment.
Thanks, this looks almost good to go. Could you revert your changes to Colab too and check the comment below?
| process_count or os.cpu_count() # type: ignore[assignment] | ||
| ) | ||
| num_proc = min(32, process_or_cpu_count) | ||
| process_or_cpu_count = process_count if process_count else os.cpu_count() |
There was a problem hiding this comment.
We pass cfg.dataset_processes to process_count already.
cfg.dataset_processes would default to the os cpu count as well in the validator stage, so we don't need to have this check here. Just pass process_count to the num_proc below.
There was a problem hiding this comment.
Sure, I will make this change.
| def process(self, dataset): | ||
| features = dataset.features.keys() | ||
| num_proc = min(64, self.process_count if self.process_count else os.cpu_count()) | ||
| num_proc = self.process_count if self.process_count else os.cpu_count() |
There was a problem hiding this comment.
Sorry, I did not get this. Do you mean I just pass self.process_count instead of having an if case in line 49?
|
|
||
| # Qwen base only has single token, so we need to set the special tokens | ||
| if cfg.is_qwen_derived_model: | ||
| # the following check is for Qwen1 series models of base models |
There was a problem hiding this comment.
| # the following check is for Qwen1 series models of base models | |
| # the following check is for Qwen1 base models |
There was a problem hiding this comment.
fixed! Please take a look at the final commit.
|
@winglian and @NanoCode012 , can you please take a look at the changes and please let me know if you need any more information. Please approve it if you all is good. |
|
Let's wait for CI a bit and I'll run some manual checks on my end as well |
NanoCode012
left a comment
There was a problem hiding this comment.
Fixed lint and updated handling of default dataset_processes
| if data.get("dataset_processes") is None: | ||
| if axolotl_dataset_processes := os.environ.get("AXOLOTL_DATASET_PROCESSES"): | ||
| data["dataset_processes"] = int(axolotl_dataset_processes) | ||
| elif runpod_cpu_count := os.environ.get("RUNPOD_CPU_COUNT"): |
This pull request introduces the following improvements and fixes:
Description
Removes the hard limiter on multiprocessing during dataset tokenization, allowing full utilization of available CPU cores or user-specified process count.
Improves the saving speed of processed datasets by enabling configurable multiprocessing and sharding, making these parameters flexible and part of the configuration.
Fixes a bug in the Qwen tokenizer integration where a missing or incorrect attribute (
eod_id) caused preprocessing failures when using Qwen-derived models.Motivation and Context
Multiprocessing Limiter Removal: Previously, tokenization was limited to a maximum of 64 processes, even on machines with more cores. This change removes the min(64, ...) limiter, enabling all available or user-specified CPU cores to be used, resulting in significant speedups for large-scale, long-context datasets. (Remove Limiter on Multiprocessing during Dataset Tokenization #2914)
Configurable Sharding and Multiprocessing on Save: Saving large processed datasets was slow due to excessive sharding and non-configurable multiprocessing. By exposing num_proc and num_shards as configuration options, users can now optimize saving speed for their hardware and dataset size. (Improve Processed Dataset Saving Speed and Shards #2913)
Qwen Tokenizer Attribute Bug: The Qwen tokenizer integration tried to access a non-existent
eod_idattribute, causing failures during preprocessing. The logic has been corrected to use the appropriate token attributes, ensuring compatibility and successful preprocessing for Qwen-derived models. (Qwen Tokenizer Missing Attribute #2912)How has this been tested?
Unit and integration tests were run for dataset tokenization and saving, verifying that:
Manual testing was performed on:
Screenshots (if appropriate)
Types of changes
Social Handles (Optional)
Summary by CodeRabbit
New Features
Improvements