Improve Dataset Processing Multiprocessing, Sharding, and Qwen Tokenizer Bug Fix. by VarunGumma · Pull Request #2918 · axolotl-ai-cloud/axolotl

VarunGumma · 2025-07-14T13:01:20Z

This pull request introduces the following improvements and fixes:

Description

Removes the hard limiter on multiprocessing during dataset tokenization, allowing full utilization of available CPU cores or user-specified process count.
Improves the saving speed of processed datasets by enabling configurable multiprocessing and sharding, making these parameters flexible and part of the configuration.
Fixes a bug in the Qwen tokenizer integration where a missing or incorrect attribute (eod_id) caused preprocessing failures when using Qwen-derived models.

Motivation and Context

Multiprocessing Limiter Removal: Previously, tokenization was limited to a maximum of 64 processes, even on machines with more cores. This change removes the min(64, ...) limiter, enabling all available or user-specified CPU cores to be used, resulting in significant speedups for large-scale, long-context datasets. (Remove Limiter on Multiprocessing during Dataset Tokenization #2914)
Configurable Sharding and Multiprocessing on Save: Saving large processed datasets was slow due to excessive sharding and non-configurable multiprocessing. By exposing num_proc and num_shards as configuration options, users can now optimize saving speed for their hardware and dataset size. (Improve Processed Dataset Saving Speed and Shards #2913)
Qwen Tokenizer Attribute Bug: The Qwen tokenizer integration tried to access a non-existent eod_id attribute, causing failures during preprocessing. The logic has been corrected to use the appropriate token attributes, ensuring compatibility and successful preprocessing for Qwen-derived models. (Qwen Tokenizer Missing Attribute #2912)

How has this been tested?

Unit and integration tests were run for dataset tokenization and saving, verifying that:
- The correct number of processes are spawned during tokenization, matching the user configuration or system CPU count.
- Datasets are saved with the specified number of shards and processes, and the saved data is consistent and readable.
Manual testing was performed on:
- A machine with more than 64 CPU cores to confirm all cores are utilized during tokenization.
- Qwen SFT YAML configurations to ensure preprocessing completes without attribute errors and the correct tokens are set.

Screenshots (if appropriate)

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update

Social Handles (Optional)

Summary by CodeRabbit

New Features
- Added support for specifying the number of shards when saving prepared datasets.
Improvements
- Enhanced dataset saving to allow parallel processing and sharding for improved performance.
- Updated configuration options for dataset processing to provide more flexibility in resource usage.
- Improved handling of special token IDs for certain tokenizers, ensuring more robust error messaging.
- Removed upper limit on process count for dataset processing to better utilize available CPU resources.
- Refined default process count determination to better adapt to different runtime environments.

… limiter on multiprocessing during tokenization, and a bug fix of qwen tokenizer

coderabbitai · 2025-07-14T13:01:27Z

Walkthrough

The changes adjust how process counts and sharding are configured and used during dataset processing and saving. They add an optional configuration for dataset save sharding, update process count defaults and limits, and improve special token handling in tokenizers. No public API signatures were changed, but internal logic and configuration options were refined.

Changes

File(s)	Change Summary
src/axolotl/datasets.py	Removed upper limit on process count; directly use configured `process_count` in dataset filtering and mapping.
src/axolotl/core/datasets/chat.py	Removed upper limit on process count; directly use provided `process_count` in parallel dataset mapping.
src/axolotl/loaders/tokenizer.py	Refined Qwen tokenizer special token ID assignment by checking for `eod_id` attribute before setting tokens.
src/axolotl/utils/data/shared.py	Unified `num_workers` usage; added `num_proc`, `max_shard_size`, and `num_shards` parameters to `save_to_disk`.
src/axolotl/utils/schemas/config.py	Added optional `num_dataset_shards_to_save` field; changed `dataset_processes` default to remove min cap logic.
src/axolotl/utils/config/init.py	Removed fallback setting of `dataset_processes` to CPU count if unset during config normalization.

Poem

A hop and a skip, the code takes a leap,
More shards and more workers, no limits to keep!
Tokens are tidied, configs are new,
Datasets save faster—what else can we do?
With whiskers a-twitch, I code and I cheer,
These changes bring progress, and carrots are near! 🥕

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9142aac and ff012af.

📒 Files selected for processing (3)

src/axolotl/core/datasets/chat.py (1 hunks)
src/axolotl/datasets.py (1 hunks)
src/axolotl/utils/schemas/config.py (3 hunks)

🚧 Files skipped from review as they are similar to previous changes (3)

src/axolotl/datasets.py
src/axolotl/core/datasets/chat.py
src/axolotl/utils/schemas/config.py

✨ Finishing Touches

📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

codecov · 2025-07-14T13:57:03Z

Codecov Report

Attention: Patch coverage is 80.00000% with 3 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/axolotl/utils/schemas/config.py	81.81%	2 Missing ⚠️
src/axolotl/utils/data/shared.py	66.66%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

winglian

There changes all seem sane enough. iirc, we had the hard cap in place because we had noticed that because a lot of modern CPUs use hyperthreading, using os.count actually was worse/slower than simply capping. One thing we could consider is using os.cpu_count // 2 instead

NanoCode012 · 2025-07-15T04:16:31Z

-        default=min(
-            int(os.environ.get("AXOLOTL_DATASET_PROCESSES", 32)), os.cpu_count()
-        ),  # type: ignore[type-var]
+        default=int(os.environ.get("AXOLOTL_DATASET_PROCESSES", os.cpu_count())),  # type: ignore[type-var]


Hm, back to this. I notice we try to catch this (due to old code) in few places.

Here config.py

normalize_config

Within the ds process src/axolotl/datasets.py

Chat ds parser src/axolotl/core/datasets/chat.py (need to also remove limiter here)

I think we can remove all except first as it'll run first in our validator.

I just noticed that another Issue (#2753) also discusses this issue. Tagging it here.
Btw, my fix already addresses (3) in process src/axolotl/datasets.py.
I made the modifications for (2) and (4). Pushing it now.

@NanoCode012, I have made the all discussed fixes and pushed them along in a new commit. Please review it soon, as this feature will really help me and lot others as well.

NanoCode012 · 2025-07-15T04:16:56Z

        json_schema_extra={"description": "Index of shard to use for whole dataset"},
    )
    skip_prepare_dataset: bool | None = False
+    num_save_shards: int | None = Field(


I'm thinking whether we want ds in the name to signify it's for datasets only.

Since this defaults to None, does dataset.save_to_disk support None if we pass it? Just to be sure we're not overriding their defaults.

yes, I think the name num_save_ds_shards is better and to be more specific prepared_dataset_num_shards. I would go with the second, as it is more explicit as well. What's your opinion?

@NanoCode012 , Here are some numbers around the num_shards and num_proc argument on a 24 core CPU. Thanks for the reminder, I realized, we have to set the max_shard_size=None to avoid issues later.

Saving the dataset (15/15 shards): 100%|██████████████████████████████████████████████████████████████████| 349317/349317 [00:06<00:00, 56775.48 examples/s] | > Dataset saved in 6.16 seconds. (num_shards=None, num_proc=None, max_shard_size=None) Saving the dataset (24/24 shards): 100%|█████████████████████████████████████████████████████████████████| 349317/349317 [00:00<00:00, 490461.43 examples/s] | > Dataset saved in 0.74 seconds. (num_proc=24, num_shards=None, max_shard_size=None) Saving the dataset (5/5 shards): 100%|███████████████████████████████████████████████████████████████████| 349317/349317 [00:01<00:00, 240478.69 examples/s] | > Dataset saved in 1.47 seconds. (num_proc=24, num_shards=5, max_shard_size=None)

Thanks for the numbers, By default, how many shards would it attempt to save?

That depends. I think the max_shard_size is default at "500Mb". So, for a very long context dataset with ~9M data points for me, it creates 1000+ shards. In the example shown before, it saves 15 by default.

NanoCode012 · 2025-07-15T04:20:30Z

        token_names = ["bos_token", "eos_token", "pad_token", "unk_token"]
        for attr_name in token_names:


I think we may need to change the "<|endoftext|>" below to the eos_token value?

However, now that I step back, I'm wondering if we even need this section. The user should be setting it under special_tokens: instead of us handling it like this.

Actually true, I feel the whole block can be removed as the newer Qwen models don't use it at all, and I am not happy hardcoding all the unavailable tokens as eos/eos_id. Again, older Qwen models might need this block.

Hm, could we change this check to, hasattr(tokenizer, "eod_id"): <- sign of qwen base before the 2 for loops. Then, this section would only run for the old qwen base. We can then keep the original implementation without having to fallback to eos.

yes, that seems like a good idea.

The best if case here could be if cfg.is_qwen_derived_model and hasattr(tokenizer, "eod_id"), and revert the code block to what we had earlier. But maybe the "<|endoftext|>" should also be changed to tokenizer.eos_token below

But maybe the "<|endoftext|>" should also be changed to tokenizer.eos_token below

In this block, we are checking and maybe setting eos_token too (so tokenizer.eos_token may be empty). I'll say to just keep it for legacy unless another reviewer think we can just remove this.

yes, you are absolutely correct! I have kept it as is in the latest commit, and just converted the check to if cfg.is_qwen_derived_model and hasattr(tokenizer, "eod_id"). This avoids the error if someone adds is_qwen_derived_model to their config for the newer qwen models.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

src/axolotl/core/datasets/chat.py (1)
48-48: Consider adding input validation for process_count parameter.

While the change correctly uses the determined process count, consider adding validation to ensure process_count is a positive integer when provided by the user to prevent potential issues with invalid inputs.
def __init__(
    self,
    data: Dataset,
    model_transform: Union[PreTrainedTokenizer, Callable],
    *args,
    message_transform: Optional[Callable] = None,
    formatter=None,
    process_count: Optional[int] = None,
    keep_in_memory: Optional[bool] = False,
    **kwargs,
):
+    if process_count is not None and process_count <= 0:
+        raise ValueError("process_count must be a positive integer")
+    
    def map_fn(ex):
        # ... existing code ...

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5d9db53 and 871be42.

📒 Files selected for processing (5)

src/axolotl/core/datasets/chat.py (1 hunks)
src/axolotl/loaders/tokenizer.py (1 hunks)
src/axolotl/utils/config/__init__.py (0 hunks)
src/axolotl/utils/data/shared.py (2 hunks)
src/axolotl/utils/schemas/config.py (2 hunks)

💤 Files with no reviewable changes (1)

src/axolotl/utils/config/init.py

🚧 Files skipped from review as they are similar to previous changes (3)

src/axolotl/utils/data/shared.py
src/axolotl/loaders/tokenizer.py
src/axolotl/utils/schemas/config.py

🔇 Additional comments (1)

src/axolotl/core/datasets/chat.py (1)

44-44: LGTM: Process count determination improved.

The change correctly removes the artificial cap on process count, allowing full utilization of available CPU cores. This aligns with the PR objective of removing hard limits on multiprocessing during dataset tokenization.

NanoCode012

Thanks, this looks almost good to go. Could you revert your changes to Colab too and check the comment below?

NanoCode012 · 2025-07-16T05:45:14Z

-            process_count or os.cpu_count()  # type: ignore[assignment]
-        )
-        num_proc = min(32, process_or_cpu_count)
+        process_or_cpu_count = process_count if process_count else os.cpu_count()


We pass cfg.dataset_processes to process_count already.

axolotl/src/axolotl/utils/data/wrappers.py

Line 84 in 2c408b5

"process_count": cfg.dataset_processes,

cfg.dataset_processes would default to the os cpu count as well in the validator stage, so we don't need to have this check here. Just pass process_count to the num_proc below.

Sure, I will make this change.

NanoCode012 · 2025-07-16T05:45:25Z

    def process(self, dataset):
        features = dataset.features.keys()
-        num_proc = min(64, self.process_count if self.process_count else os.cpu_count())
+        num_proc = self.process_count if self.process_count else os.cpu_count()


Same as above.

Sorry, I did not get this. Do you mean I just pass self.process_count instead of having an if case in line 49?

NanoCode012 · 2025-07-16T05:45:46Z


    # Qwen base only has single token, so we need to set the special tokens
-    if cfg.is_qwen_derived_model:
+    # the following check is for Qwen1 series models of base models


Suggested change

# the following check is for Qwen1 series models of base models

# the following check is for Qwen1 base models

fixed! Please take a look at the final commit.

VarunGumma · 2025-07-17T02:45:27Z

@winglian and @NanoCode012 , can you please take a look at the changes and please let me know if you need any more information. Please approve it if you all is good.

NanoCode012 · 2025-07-17T03:51:01Z

Let's wait for CI a bit and I'll run some manual checks on my end as well

NanoCode012

Fixed lint and updated handling of default dataset_processes

winglian · 2025-07-17T13:43:40Z

+        if data.get("dataset_processes") is None:
+            if axolotl_dataset_processes := os.environ.get("AXOLOTL_DATASET_PROCESSES"):
+                data["dataset_processes"] = int(axolotl_dataset_processes)
+            elif runpod_cpu_count := os.environ.get("RUNPOD_CPU_COUNT"):


Added a feature to save prepared dataset in specified shards, removed…

8a091f0

… limiter on multiprocessing during tokenization, and a bug fix of qwen tokenizer

This was referenced Jul 14, 2025

Qwen Tokenizer Missing Attribute #2912

Closed

Improve Processed Dataset Saving Speed and Shards #2913

Closed

Remove Limiter on Multiprocessing during Dataset Tokenization #2914

Closed

Merge branch 'axolotl-ai-cloud:main' into main

5e45031

Merge branch 'axolotl-ai-cloud:main' into main

2820036

winglian requested a review from NanoCode012 July 15, 2025 00:19

winglian approved these changes Jul 15, 2025

View reviewed changes

NanoCode012 reviewed Jul 15, 2025

View reviewed changes

VarunGumma and others added 2 commits July 15, 2025 09:58

Merge branch 'axolotl-ai-cloud:main' into main

5d9db53

removed limiters and fixed config variable name

871be42

coderabbitai Bot reviewed Jul 15, 2025

View reviewed changes

VarunGumma requested review from NanoCode012 and winglian July 15, 2025 14:01

black lint

3b7f9da

NanoCode012 requested changes Jul 16, 2025

View reviewed changes

fixed process_count

9142aac

winglian requested a review from NanoCode012 July 16, 2025 21:04

Merge remote-tracking branch 'upstream/main'

3b35eb6

NanoCode012 added 2 commits July 17, 2025 12:40

chore: lint

941d4ba

feat: update handling of dataset_processes

ff012af

NanoCode012 approved these changes Jul 17, 2025

View reviewed changes

winglian reviewed Jul 17, 2025

View reviewed changes

winglian added the ready to merge label Jul 17, 2025

winglian merged commit 9f2bb18 into axolotl-ai-cloud:main Jul 17, 2025
15 of 16 checks passed

winglian mentioned this pull request Jul 19, 2025

limit num_proc when saving datasets to disk #2948

Merged

coderabbitai Bot mentioned this pull request Aug 15, 2025

feat:add support dataset_num_processes #3071

Closed

winglian removed the ready to merge label Aug 18, 2025

coderabbitai Bot mentioned this pull request Sep 4, 2025

feat:add support dataset_num_processes #3129

Merged

		token_names = ["bos_token", "eos_token", "pad_token", "unk_token"]
		for attr_name in token_names:

	# the following check is for Qwen1 series models of base models
	# the following check is for Qwen1 base models

Uh oh!

Conversation

VarunGumma commented Jul 14, 2025 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

How has this been tested?

Screenshots (if appropriate)

Types of changes

Social Handles (Optional)

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

codecov Bot commented Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

winglian left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

VarunGumma Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

VarunGumma Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

VarunGumma Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

NanoCode012 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

VarunGumma commented Jul 14, 2025 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jul 14, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

codecov Bot commented Jul 14, 2025 •

edited

Loading

VarunGumma Jul 15, 2025 •

edited

Loading

VarunGumma Jul 15, 2025 •

edited

Loading

VarunGumma Jul 15, 2025 •

edited

Loading