Skip to content

Improve Dataset Processing Multiprocessing, Sharding, and Qwen Tokenizer Bug Fix.#2918

Merged
winglian merged 10 commits into
axolotl-ai-cloud:mainfrom
VarunGumma:main
Jul 17, 2025
Merged

Improve Dataset Processing Multiprocessing, Sharding, and Qwen Tokenizer Bug Fix.#2918
winglian merged 10 commits into
axolotl-ai-cloud:mainfrom
VarunGumma:main

Conversation

@VarunGumma
Copy link
Copy Markdown
Contributor

@VarunGumma VarunGumma commented Jul 14, 2025

This pull request introduces the following improvements and fixes:

Description

  • Removes the hard limiter on multiprocessing during dataset tokenization, allowing full utilization of available CPU cores or user-specified process count.

  • Improves the saving speed of processed datasets by enabling configurable multiprocessing and sharding, making these parameters flexible and part of the configuration.

  • Fixes a bug in the Qwen tokenizer integration where a missing or incorrect attribute (eod_id) caused preprocessing failures when using Qwen-derived models.

Motivation and Context

  • Multiprocessing Limiter Removal: Previously, tokenization was limited to a maximum of 64 processes, even on machines with more cores. This change removes the min(64, ...) limiter, enabling all available or user-specified CPU cores to be used, resulting in significant speedups for large-scale, long-context datasets. (Remove Limiter on Multiprocessing during Dataset Tokenization #2914)

  • Configurable Sharding and Multiprocessing on Save: Saving large processed datasets was slow due to excessive sharding and non-configurable multiprocessing. By exposing num_proc and num_shards as configuration options, users can now optimize saving speed for their hardware and dataset size. (Improve Processed Dataset Saving Speed and Shards #2913)

  • Qwen Tokenizer Attribute Bug: The Qwen tokenizer integration tried to access a non-existent eod_id attribute, causing failures during preprocessing. The logic has been corrected to use the appropriate token attributes, ensuring compatibility and successful preprocessing for Qwen-derived models. (Qwen Tokenizer Missing Attribute #2912)

How has this been tested?

  • Unit and integration tests were run for dataset tokenization and saving, verifying that:

    • The correct number of processes are spawned during tokenization, matching the user configuration or system CPU count.
    • Datasets are saved with the specified number of shards and processes, and the saved data is consistent and readable.
  • Manual testing was performed on:

    • A machine with more than 64 CPU cores to confirm all cores are utilized during tokenization.
    • Qwen SFT YAML configurations to ensure preprocessing completes without attribute errors and the correct tokens are set.

Screenshots (if appropriate)

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update

Social Handles (Optional)

Summary by CodeRabbit

  • New Features

    • Added support for specifying the number of shards when saving prepared datasets.
  • Improvements

    • Enhanced dataset saving to allow parallel processing and sharding for improved performance.
    • Updated configuration options for dataset processing to provide more flexibility in resource usage.
    • Improved handling of special token IDs for certain tokenizers, ensuring more robust error messaging.
    • Removed upper limit on process count for dataset processing to better utilize available CPU resources.
    • Refined default process count determination to better adapt to different runtime environments.

… limiter on multiprocessing during tokenization, and a bug fix of qwen tokenizer
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jul 14, 2025

Walkthrough

The changes adjust how process counts and sharding are configured and used during dataset processing and saving. They add an optional configuration for dataset save sharding, update process count defaults and limits, and improve special token handling in tokenizers. No public API signatures were changed, but internal logic and configuration options were refined.

Changes

File(s) Change Summary
src/axolotl/datasets.py Removed upper limit on process count; directly use configured process_count in dataset filtering and mapping.
src/axolotl/core/datasets/chat.py Removed upper limit on process count; directly use provided process_count in parallel dataset mapping.
src/axolotl/loaders/tokenizer.py Refined Qwen tokenizer special token ID assignment by checking for eod_id attribute before setting tokens.
src/axolotl/utils/data/shared.py Unified num_workers usage; added num_proc, max_shard_size, and num_shards parameters to save_to_disk.
src/axolotl/utils/schemas/config.py Added optional num_dataset_shards_to_save field; changed dataset_processes default to remove min cap logic.
src/axolotl/utils/config/init.py Removed fallback setting of dataset_processes to CPU count if unset during config normalization.

Poem

A hop and a skip, the code takes a leap,
More shards and more workers, no limits to keep!
Tokens are tidied, configs are new,
Datasets save faster—what else can we do?
With whiskers a-twitch, I code and I cheer,
These changes bring progress, and carrots are near! 🥕


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9142aac and ff012af.

📒 Files selected for processing (3)
  • src/axolotl/core/datasets/chat.py (1 hunks)
  • src/axolotl/datasets.py (1 hunks)
  • src/axolotl/utils/schemas/config.py (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • src/axolotl/datasets.py
  • src/axolotl/core/datasets/chat.py
  • src/axolotl/utils/schemas/config.py
✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@codecov
Copy link
Copy Markdown

codecov Bot commented Jul 14, 2025

Codecov Report

Attention: Patch coverage is 80.00000% with 3 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/axolotl/utils/schemas/config.py 81.81% 2 Missing ⚠️
src/axolotl/utils/data/shared.py 66.66% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

@winglian winglian requested a review from NanoCode012 July 15, 2025 00:19
Copy link
Copy Markdown
Collaborator

@winglian winglian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There changes all seem sane enough. iirc, we had the hard cap in place because we had noticed that because a lot of modern CPUs use hyperthreading, using os.count actually was worse/slower than simply capping. One thing we could consider is using os.cpu_count // 2 instead

Comment thread src/axolotl/utils/schemas/config.py Outdated
default=min(
int(os.environ.get("AXOLOTL_DATASET_PROCESSES", 32)), os.cpu_count()
), # type: ignore[type-var]
default=int(os.environ.get("AXOLOTL_DATASET_PROCESSES", os.cpu_count())), # type: ignore[type-var]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, back to this. I notice we try to catch this (due to old code) in few places.

  1. Here config.py
  2. normalize_config
  3. Within the ds process src/axolotl/datasets.py
  4. Chat ds parser src/axolotl/core/datasets/chat.py (need to also remove limiter here)

I think we can remove all except first as it'll run first in our validator.

Copy link
Copy Markdown
Contributor Author

@VarunGumma VarunGumma Jul 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed that another Issue (#2753) also discusses this issue. Tagging it here.
Btw, my fix already addresses (3) in process src/axolotl/datasets.py.
I made the modifications for (2) and (4). Pushing it now.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NanoCode012, I have made the all discussed fixes and pushed them along in a new commit. Please review it soon, as this feature will really help me and lot others as well.

Comment thread src/axolotl/utils/schemas/config.py Outdated
json_schema_extra={"description": "Index of shard to use for whole dataset"},
)
skip_prepare_dataset: bool | None = False
num_save_shards: int | None = Field(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking whether we want ds in the name to signify it's for datasets only.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this defaults to None, does dataset.save_to_disk support None if we pass it? Just to be sure we're not overriding their defaults.

Copy link
Copy Markdown
Contributor Author

@VarunGumma VarunGumma Jul 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I think the name num_save_ds_shards is better and to be more specific prepared_dataset_num_shards. I would go with the second, as it is more explicit as well. What's your opinion?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NanoCode012 , Here are some numbers around the num_shards and num_proc argument on a 24 core CPU. Thanks for the reminder, I realized, we have to set the max_shard_size=None to avoid issues later.

Saving the dataset (15/15 shards): 100%|██████████████████████████████████████████████████████████████████| 349317/349317 [00:06<00:00, 56775.48 examples/s]
 | > Dataset saved in 6.16 seconds. (num_shards=None, num_proc=None, max_shard_size=None)
Saving the dataset (24/24 shards): 100%|█████████████████████████████████████████████████████████████████| 349317/349317 [00:00<00:00, 490461.43 examples/s]
 | > Dataset saved in 0.74 seconds. (num_proc=24, num_shards=None, max_shard_size=None)
Saving the dataset (5/5 shards): 100%|███████████████████████████████████████████████████████████████████| 349317/349317 [00:01<00:00, 240478.69 examples/s]
 | > Dataset saved in 1.47 seconds. (num_proc=24, num_shards=5, max_shard_size=None)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the numbers, By default, how many shards would it attempt to save?

Copy link
Copy Markdown
Contributor Author

@VarunGumma VarunGumma Jul 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That depends. I think the max_shard_size is default at "500Mb". So, for a very long context dataset with ~9M data points for me, it creates 1000+ shards. In the example shown before, it saves 15 by default.

Comment on lines 202 to 203
token_names = ["bos_token", "eos_token", "pad_token", "unk_token"]
for attr_name in token_names:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we may need to change the "<|endoftext|>" below to the eos_token value?

However, now that I step back, I'm wondering if we even need this section. The user should be setting it under special_tokens: instead of us handling it like this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually true, I feel the whole block can be removed as the newer Qwen models don't use it at all, and I am not happy hardcoding all the unavailable tokens as eos/eos_id. Again, older Qwen models might need this block.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, could we change this check to, hasattr(tokenizer, "eod_id"): <- sign of qwen base before the 2 for loops. Then, this section would only run for the old qwen base. We can then keep the original implementation without having to fallback to eos.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that seems like a good idea.

The best if case here could be if cfg.is_qwen_derived_model and hasattr(tokenizer, "eod_id"), and revert the code block to what we had earlier. But maybe the "<|endoftext|>" should also be changed to tokenizer.eos_token below

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But maybe the "<|endoftext|>" should also be changed to tokenizer.eos_token below

In this block, we are checking and maybe setting eos_token too (so tokenizer.eos_token may be empty). I'll say to just keep it for legacy unless another reviewer think we can just remove this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, you are absolutely correct! I have kept it as is in the latest commit, and just converted the check to if cfg.is_qwen_derived_model and hasattr(tokenizer, "eod_id"). This avoids the error if someone adds is_qwen_derived_model to their config for the newer qwen models.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
src/axolotl/core/datasets/chat.py (1)

48-48: Consider adding input validation for process_count parameter.

While the change correctly uses the determined process count, consider adding validation to ensure process_count is a positive integer when provided by the user to prevent potential issues with invalid inputs.

def __init__(
    self,
    data: Dataset,
    model_transform: Union[PreTrainedTokenizer, Callable],
    *args,
    message_transform: Optional[Callable] = None,
    formatter=None,
    process_count: Optional[int] = None,
    keep_in_memory: Optional[bool] = False,
    **kwargs,
):
+    if process_count is not None and process_count <= 0:
+        raise ValueError("process_count must be a positive integer")
+    
    def map_fn(ex):
        # ... existing code ...
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5d9db53 and 871be42.

📒 Files selected for processing (5)
  • src/axolotl/core/datasets/chat.py (1 hunks)
  • src/axolotl/loaders/tokenizer.py (1 hunks)
  • src/axolotl/utils/config/__init__.py (0 hunks)
  • src/axolotl/utils/data/shared.py (2 hunks)
  • src/axolotl/utils/schemas/config.py (2 hunks)
💤 Files with no reviewable changes (1)
  • src/axolotl/utils/config/init.py
🚧 Files skipped from review as they are similar to previous changes (3)
  • src/axolotl/utils/data/shared.py
  • src/axolotl/loaders/tokenizer.py
  • src/axolotl/utils/schemas/config.py
🔇 Additional comments (1)
src/axolotl/core/datasets/chat.py (1)

44-44: LGTM: Process count determination improved.

The change correctly removes the artificial cap on process count, allowing full utilization of available CPU cores. This aligns with the PR objective of removing hard limits on multiprocessing during dataset tokenization.

Copy link
Copy Markdown
Collaborator

@NanoCode012 NanoCode012 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this looks almost good to go. Could you revert your changes to Colab too and check the comment below?

Comment thread src/axolotl/core/datasets/chat.py Outdated
process_count or os.cpu_count() # type: ignore[assignment]
)
num_proc = min(32, process_or_cpu_count)
process_or_cpu_count = process_count if process_count else os.cpu_count()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We pass cfg.dataset_processes to process_count already.

"process_count": cfg.dataset_processes,

cfg.dataset_processes would default to the os cpu count as well in the validator stage, so we don't need to have this check here. Just pass process_count to the num_proc below.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I will make this change.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed!

Comment thread src/axolotl/datasets.py Outdated
def process(self, dataset):
features = dataset.features.keys()
num_proc = min(64, self.process_count if self.process_count else os.cpu_count())
num_proc = self.process_count if self.process_count else os.cpu_count()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I did not get this. Do you mean I just pass self.process_count instead of having an if case in line 49?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed!

Comment thread src/axolotl/loaders/tokenizer.py Outdated

# Qwen base only has single token, so we need to set the special tokens
if cfg.is_qwen_derived_model:
# the following check is for Qwen1 series models of base models
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# the following check is for Qwen1 series models of base models
# the following check is for Qwen1 base models

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed! Please take a look at the final commit.

@winglian winglian requested a review from NanoCode012 July 16, 2025 21:04
@VarunGumma
Copy link
Copy Markdown
Contributor Author

@winglian and @NanoCode012 , can you please take a look at the changes and please let me know if you need any more information. Please approve it if you all is good.

@NanoCode012
Copy link
Copy Markdown
Collaborator

Let's wait for CI a bit and I'll run some manual checks on my end as well

Copy link
Copy Markdown
Collaborator

@NanoCode012 NanoCode012 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed lint and updated handling of default dataset_processes

if data.get("dataset_processes") is None:
if axolotl_dataset_processes := os.environ.get("AXOLOTL_DATASET_PROCESSES"):
data["dataset_processes"] = int(axolotl_dataset_processes)
elif runpod_cpu_count := os.environ.get("RUNPOD_CPU_COUNT"):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥 🙌

@winglian winglian merged commit 9f2bb18 into axolotl-ai-cloud:main Jul 17, 2025
15 of 16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants