feat:add support dataset_num_processes by ved1beta · Pull Request #3129 · axolotl-ai-cloud/axolotl

ved1beta · 2025-09-04T15:03:09Z

Description

refactor for #1783 deprecate dataset_processes in preference of dataset_num_proc

Summary by CodeRabbit

New Features
- Introduced dataset_num_proc config to control dataset preprocessing parallelism.
- Added AXOLOTL_DATASET_NUM_PROC environment variable (takes precedence over other CPU-count envs).
Refactor
- Replaced uses of dataset_processes with dataset_num_proc across data processing and training paths.
- Deprecated dataset_processes; warning shown if used and error raised when both keys are set.
Documentation
- Updated debugging guidance and examples to use dataset_num_proc.
Chores
- CI images and single-GPU runner now export and pass AXOLOTL_DATASET_NUM_PROC by default.

coderabbitai · 2025-09-04T15:03:16Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

📝 Walkthrough

Walkthrough

Renames the dataset parallelism configuration from dataset_processes to dataset_num_proc across code, tests, and docs. Adds AXOLOTL_DATASET_NUM_PROC env handling and precedence over AXOLOTL_DATASET_PROCESSES. Updates schema validation to deprecate dataset_processes with compatibility and conflict checks.

Changes

Cohort / File(s)	Summary
Env and CI `cicd/Dockerfile.jinja`, `cicd/single_gpu.py`	Add AXOLOTL_DATASET_NUM_PROC="8" to Dockerfile and subprocess env; keep existing AXOLOTL_DATASET_PROCESSES="8".
Schema and defaults `src/axolotl/utils/schemas/config.py`	Deprecate dataset_processes; introduce validator default_dataset_num_proc with back-compat (map dataset_processes → dataset_num_proc, warn; error if both set); default via get_default_process_count.
Default process count/env precedence `src/axolotl/utils/datasets.py`	Make get_default_process_count prefer AXOLOTL_DATASET_NUM_PROC over AXOLOTL_DATASET_PROCESSES, then RUNPOD_CPU_COUNT, then os.cpu_count().
Builders `src/axolotl/core/builders/base.py`	Accept dataset_num_proc in training_args; remove aliasing from dataset_processes.
Data utilities `src/axolotl/utils/data/*` (`.../rl.py`, `.../shared.py`, `.../utils.py`, `.../wrappers.py`)	Replace uses of cfg.dataset_processes with cfg.dataset_num_proc for map/filter/save/wrapper worker counts.
Trainer `src/axolotl/utils/trainer.py`	Swap cfg.dataset_processes → cfg.dataset_num_proc in map/filter paths; update sampler/process count reference accordingly.
Docs `docs/debugging.qmd`	Update guidance and examples: dataset_processes → dataset_num_proc (YAML and CLI flag).
Tests `tests/core/test_builders.py`, `tests/e2e/patched/test_activation_checkpointing.py`, `tests/e2e/test_llama_pretrain.py`, `tests/test_datasets.py`, `tests/test_packed_dataset.py`	Rename test configs to use dataset_num_proc instead of dataset_processes; values unchanged.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Config doc autogen #2718 — Also modifies AxolotlInputConfig schema/validation; overlaps in config handling logic.
Improve Dataset Processing Multiprocessing, Sharding, and Qwen Tokenizer Bug Fix. #2918 — Adjusts multiprocessing defaults and preprocessing paths; touches similar utils and schema areas.
limit num_proc when saving datasets to disk #2948 — Alters num_proc handling when saving datasets; intersects with the same data utils paths.

Suggested labels

ready to merge

Suggested reviewers

winglian
NanoCode012

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

src/axolotl/utils/data/shared.py (1)
412-423: Enforce num_workers ≥ 1
Refactor the fallback so it can’t return 0 or None—otherwise Dataset.from_generator and _generate_from_iterable_dataset (which does idx % num_workers) will error.

Apply:
-    num_workers = cfg.dataset_num_proc or get_default_process_count()
+    num_workers_cfg = cfg.dataset_num_proc
+    num_workers = (
+        num_workers_cfg
+        if isinstance(num_workers_cfg, int) and num_workers_cfg > 0
+        else (get_default_process_count() or 1)
+    )
Add a unit test for cfg.dataset_num_proc = 0/None with no AXOLOTL_DATASET_NUM_PROC to confirm the fallback yields 1.
src/axolotl/utils/trainer.py (1)
461-472: Fix typo: use cfg.dataset_num_proc for MultipackBatchSampler
In src/axolotl/utils/trainer.py (around line 470), replace the undefined cfg.data_num_proc with cfg.dataset_num_proc:
-                num_processes=cfg.data_num_proc,
+                num_processes=cfg.dataset_num_proc,

🧹 Nitpick comments (7)

cicd/single_gpu.py (1)
67-70: Don’t override user-provided env; mirror old var only if absent.

Currently this forces both env vars to "8" even if the caller set different values. Respect existing env and mirror to the deprecated variable only when needed.
     sp_env = os.environ.copy()
-    sp_env["AXOLOTL_DATASET_NUM_PROC"] = "8"
-    sp_env["AXOLOTL_DATASET_PROCESSES"] = "8"
+    # Prefer caller-provided values; default to 8 if neither is set.
+    sp_env.setdefault("AXOLOTL_DATASET_NUM_PROC", os.environ.get("AXOLOTL_DATASET_NUM_PROC", "8"))
+    sp_env.setdefault(
+        "AXOLOTL_DATASET_PROCESSES",
+        os.environ.get("AXOLOTL_DATASET_PROCESSES", sp_env["AXOLOTL_DATASET_NUM_PROC"]),
+    )
cicd/Dockerfile.jinja (1)
12-13: Avoid baking a fixed default; make it configurable at build/run time.

Hardcoding "8" in the image can surprise downstream users. Expose as ARG or rely on runtime ENV, and keep the deprecated var only for compatibility.
-ENV AXOLOTL_DATASET_NUM_PROC="8"
-ENV AXOLOTL_DATASET_PROCESSES="8"
+ARG AXOLOTL_DATASET_NUM_PROC=8
+ENV AXOLOTL_DATASET_NUM_PROC="${AXOLOTL_DATASET_NUM_PROC}"
+# Back-compat: default to the same value unless overridden at runtime
+ARG AXOLOTL_DATASET_PROCESSES
+ENV AXOLOTL_DATASET_PROCESSES="${AXOLOTL_DATASET_PROCESSES:-${AXOLOTL_DATASET_NUM_PROC}}"
src/axolotl/utils/datasets.py (1)
7-13: Validate and sanitize process count; guard os.cpu_count() None.

Ensure positive integers and safe fallback to 1.
-    if axolotl_dataset_num_proc := os.environ.get("AXOLOTL_DATASET_NUM_PROC"):
-        return int(axolotl_dataset_num_proc)
+    if axolotl_dataset_num_proc := os.environ.get("AXOLOTL_DATASET_NUM_PROC"):
+        val = int(axolotl_dataset_num_proc)
+        return max(val, 1)
     if axolotl_dataset_processes := os.environ.get("AXOLOTL_DATASET_PROCESSES"):
-        return int(axolotl_dataset_processes)
+        val = int(axolotl_dataset_processes)
+        return max(val, 1)
     if runpod_cpu_count := os.environ.get("RUNPOD_CPU_COUNT"):
-        return int(runpod_cpu_count)
-    return os.cpu_count()
+        val = int(runpod_cpu_count)
+        return max(val, 1)
+    cpu = os.cpu_count() or 1
+    return max(cpu, 1)
src/axolotl/utils/data/utils.py (1)
224-228: Defensive default for num_proc (handle 0/None).

If config leaves num_proc unset or mis-set to 0, default to 1 to avoid runtime surprises.
-    if not isinstance(dataset, IterableDataset):
-        filter_map_kwargs["num_proc"] = cfg.dataset_num_proc
+    if not isinstance(dataset, IterableDataset):
+        # datasets.filter/map accept None, but enforce >=1 if provided
+        np_val = cfg.dataset_num_proc
+        if np_val is not None:
+            np_val = max(int(np_val), 1)
+        filter_map_kwargs["num_proc"] = np_val
         filter_map_kwargs["load_from_cache_file"] = not cfg.is_preprocess
docs/debugging.qmd (2)
32-33: Doc tip: mention env var override

Add that AXOLOTL_DATASET_NUM_PROC=1 also forces single-process preprocessing (it now takes precedence over AXOLOTL_DATASET_PROCESSES).

Apply near this bullet:
-    - Set `dataset_num_proc: 1` in your axolotl config or run the training command with `--dataset_num_proc=1`.
+    - Set `dataset_num_proc: 1` in your axolotl config, run with `--dataset_num_proc=1`, or set `AXOLOTL_DATASET_NUM_PROC=1`.
104-114: Mirror env var in VSCode args comment

Add a note about AXOLOTL_DATASET_NUM_PROC alongside the existing VSCode launch snippet to surface this quick-debug option:
 docs/debugging.qmd: lines 104–114
-                // with the debugging tips above.  Modify as needed.
+                // with the debugging tips above.  Modify as needed. Alternatively set AXOLOTL_DATASET_NUM_PROC=1.
tests/core/test_builders.py (1)
443-443: Remove redundant reassignment of dataset_num_proc.

All cfg fixtures copy base_cfg which already sets dataset_num_proc=4. This line is unnecessary.
-        cfg["dataset_num_proc"] = 4

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between c6ae5c4 and 09acac4.

📒 Files selected for processing (16)

cicd/Dockerfile.jinja (1 hunks)
cicd/single_gpu.py (1 hunks)
docs/debugging.qmd (2 hunks)
src/axolotl/core/builders/base.py (1 hunks)
src/axolotl/utils/data/rl.py (2 hunks)
src/axolotl/utils/data/shared.py (1 hunks)
src/axolotl/utils/data/utils.py (1 hunks)
src/axolotl/utils/data/wrappers.py (1 hunks)
src/axolotl/utils/datasets.py (1 hunks)
src/axolotl/utils/schemas/config.py (2 hunks)
src/axolotl/utils/trainer.py (5 hunks)
tests/core/test_builders.py (2 hunks)
tests/e2e/patched/test_activation_checkpointing.py (1 hunks)
tests/e2e/test_llama_pretrain.py (1 hunks)
tests/test_datasets.py (7 hunks)
tests/test_packed_dataset.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (7)

src/axolotl/utils/data/rl.py (1)

tests/test_exact_deduplication.py (1)

cfg (201-216)

src/axolotl/utils/data/wrappers.py (1)

tests/test_exact_deduplication.py (1)

cfg (201-216)

src/axolotl/utils/data/shared.py (2)

tests/test_exact_deduplication.py (1)

cfg (201-216)

src/axolotl/utils/datasets.py (1)

get_default_process_count (6-13)

src/axolotl/utils/data/utils.py (1)

tests/test_exact_deduplication.py (1)

cfg (201-216)

tests/core/test_builders.py (1)

tests/test_exact_deduplication.py (1)

cfg (201-216)

src/axolotl/utils/trainer.py (2)

src/axolotl/integrations/base.py (2)

cfg (339-340)

cfg (343-344)

tests/test_exact_deduplication.py (1)

cfg (201-216)

src/axolotl/utils/schemas/config.py (1)

src/axolotl/utils/datasets.py (1)

get_default_process_count (6-13)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)

GitHub Check: PyTest from Source Dist (3.11, 2.8.0)
GitHub Check: PyTest (3.11, 2.6.0)
GitHub Check: PyTest from Source Dist (3.11, 2.6.0)
GitHub Check: PyTest from Source Dist (3.11, 2.7.1)
GitHub Check: PyTest (3.11, 2.7.1)
GitHub Check: PyTest (3.11, 2.8.0)
GitHub Check: preview

🔇 Additional comments (10)

src/axolotl/utils/schemas/config.py (1)

222-232: Deprecation metadata reads well.

Marking dataset_processes as deprecated with clear description is good.

tests/test_packed_dataset.py (1)

58-58: LGTM on key rename.

Config now uses dataset_num_proc; matches the new canonical key.

tests/e2e/patched/test_activation_checkpointing.py (1)

72-72: LGTM on config rename in e2e.

Keeps behavior while aligning to dataset_num_proc.

Add a tiny param test to ensure:

dataset_processes emits a deprecation warning and still works.

dataset_num_processes (new alias) is accepted and normalized to dataset_num_proc.

tests/e2e/test_llama_pretrain.py (1)

32-32: LGTM: config key rename to dataset_num_proc

Switching to dataset_num_proc in this E2E test matches the new config and ensures single-process preprocessing for determinism.

src/axolotl/core/builders/base.py (1)

501-505: No action needed: dataset_num_proc is a valid training-argument field
It’s declared on the AxolotlTrainingMixins dataclass (so all AxolotlTrainingArguments variants accept it) and is exposed on config-driven classes—e.g. AxolotlORPOConfig lists dataset_num_proc as a parameter (huggingface.co)

src/axolotl/utils/data/rl.py (1)

116-120: LGTM: map uses dataset_num_proc

dataset.map(..., num_proc=cfg.dataset_num_proc, ...) aligns with the new key.

src/axolotl/utils/trainer.py (4)

279-281: LGTM: standardize num_proc for map/filter

Using cfg.dataset_num_proc in non-Iterable paths is consistent.

317-322: LGTM: group-by-length mapping uses dataset_num_proc

Consistent with the new key.

334-339: LGTM: PoSE train map uses dataset_num_proc

Matches the rest of the data pipeline changes.

343-348: LGTM: PoSE eval map uses dataset_num_proc

Consistent application.

coderabbitai · 2025-09-04T15:19:34Z

            split_datasets[i] = split_datasets[i].filter(
                drop_long,
-                num_proc=cfg.dataset_processes,
+                num_proc=cfg.dataset_num_proc,
                load_from_cache_file=not cfg.is_preprocess,
                desc="Dropping Long Sequences",
            )


💡 Verification agent

🧩 Analysis chain

LGTM: filter uses dataset_num_proc

Consistent use during long-seq dropping.

Quick check that the schema constrains this to >=1 to avoid HF num_proc=0 issues:

🏁 Script executed:

#!/bin/bash rg -nP 'dataset_num_proc.*Field\(|dataset_num_proc.*Annotated|model_validator.*default_dataset_num_proc' src/axolotl/utils/schemas -C2

Length of output: 136

🏁 Script executed:

#!/bin/bash # Search entire repository for occurrences of dataset_num_proc and inspect definitions rg -nH 'dataset_num_proc' -C3 .

Length of output: 17484

Enforce dataset_num_proc >= 1 in the schema: The pydantic schema for dataset_num_proc currently has no minimum constraint, so users could set it to 0 and trigger HF’s num_proc=0 errors. Add ge=1 to its Field in src/axolotl/utils/schemas/config.py or include a validator in default_dataset_num_proc to ensure it’s always at least 1.

🤖 Prompt for AI Agents

In src/axolotl/utils/data/rl.py around lines 235 to 240, the filter call passes cfg.dataset_num_proc to HF dataset's num_proc which can be 0; update src/axolotl/utils/schemas/config.py to enforce dataset_num_proc >= 1 by adding ge=1 to its pydantic Field (or alternatively add a validator in default_dataset_num_proc that coerces/raises and ensures a minimum of 1), so users cannot set 0 and HF won’t receive num_proc=0.

@ved1beta , edge case that I doubt someone would do, but should we add a minimum check in the validator?

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

coderabbitai · 2025-09-04T15:19:34Z

    dataset_kwargs: dict[str, Any] = {
-        "process_count": cfg.dataset_processes,
+        "process_count": cfg.dataset_num_proc,
        "keep_in_memory": cfg.dataset_keep_in_memory is True,
    }


💡 Verification agent

🧩 Analysis chain

Harden process_count fallback to avoid None/0.

If cfg.dataset_num_proc is unset or 0, process_count may be invalid for downstream .map/.filter. Use a safe fallback.

Apply:

+from axolotl.utils.datasets import get_default_process_count @@ - dataset_kwargs: dict[str, Any] = { - "process_count": cfg.dataset_num_proc, + process_count = ( + cfg.dataset_num_proc + if isinstance(cfg.dataset_num_proc, int) and cfg.dataset_num_proc > 0 + else (get_default_process_count() or 1) + ) + dataset_kwargs: dict[str, Any] = { + "process_count": process_count, "keep_in_memory": cfg.dataset_keep_in_memory is True, }

Please confirm schema normalization guarantees a positive int for dataset_num_proc when only deprecated keys (dataset_processes or dataset_num_processes) are set.

🏁 Script executed:

#!/bin/bash set -e echo "=== Occurrences of dataset_num_proc ===" rg -n "dataset_num_proc" -C3 echo echo "=== Occurrences of deprecated key dataset_processes ===" rg -n "dataset_processes" -C3 || true echo echo "=== Occurrences of deprecated key dataset_num_processes ===" rg -n "dataset_num_processes" -C3 || true echo echo "=== Scanning all config files for dataset normalization ===" fd -t f -g "*config*.py" | while read -r file; do echo echo ">> File: $file" rg -nP "dataset_.*process" -C2 "$file" || true done

Length of output: 24593

Harden process_count fallback in wrappers.py
Schema normalization only maps dataset_processes → dataset_num_proc without enforcing > 0 (and dataset_num_processes isn’t recognized), so cfg.dataset_num_proc can still be None or 0. Apply:

+from axolotl.utils.datasets import get_default_process_count @@ src/axolotl/utils/data/wrappers.py:82 - dataset_kwargs: dict[str, Any] = { - "process_count": cfg.dataset_num_proc, + process_count = ( + cfg.dataset_num_proc + if isinstance(cfg.dataset_num_proc, int) and cfg.dataset_num_proc > 0 + else (get_default_process_count() or 1) + ) + dataset_kwargs: dict[str, Any] = { + "process_count": process_count, "keep_in_memory": cfg.dataset_keep_in_memory is True, }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

dataset_kwargs: dict[str, Any] = {

"process_count": cfg.dataset_processes,

"process_count": cfg.dataset_num_proc,

"keep_in_memory": cfg.dataset_keep_in_memory is True,

}

from axolotl.utils.datasets import get_default_process_count

process_count = (

cfg.dataset_num_proc

if isinstance(cfg.dataset_num_proc, int) and cfg.dataset_num_proc > 0

else (get_default_process_count() or 1)

)

dataset_kwargs: dict[str, Any] = {

"process_count": process_count,

"keep_in_memory": cfg.dataset_keep_in_memory is True,

}

🤖 Prompt for AI Agents

In src/axolotl/utils/data/wrappers.py around lines 82 to 85, the dataset_kwargs assignment uses cfg.dataset_num_proc directly which can be None or 0; change the logic to ensure process_count is a positive integer fallbacking to 1 (e.g., compute a safe value by treating None or non-positive values as 1 and casting to int), and keep the existing keep_in_memory boolean coercion unchanged.

coderabbitai · 2025-09-04T15:19:35Z

+    def default_dataset_num_proc(cls, data):
+        if data.get("dataset_processes") is not None:
+            if data.get("dataset_num_proc") is None:
+                data["dataset_num_proc"] = data["dataset_processes"]
+                LOG.warning(
+                    "dataset_processes is deprecated and will be removed in a future version. "
+                    "Please use dataset_num_proc instead."
+                )
+            else:
+                raise ValueError(
+                    "Both dataset_processes and dataset_num_proc are set. "
+                    "Using dataset_num_proc and ignoring dataset_processes."
+                )
+            if data.get("dataset_num_proc") is None:
+                data["dataset_num_proc"] = get_default_process_count()

        return data


⚠️ Potential issue

Defaulting logic leaves dataset_num_proc unset when both fields are None; error message contradicts behavior.

If neither key is provided, dataset_num_proc stays None (no default applied).

When both are set, the ValueError message says it will ignore dataset_processes, but the code raises instead.

Proposed fix: always derive a sanitized dataset_num_proc; raise with a clear message only when both are set.

@model_validator(mode="before") @classmethod - def default_dataset_num_proc(cls, data): - if data.get("dataset_processes") is not None: - if data.get("dataset_num_proc") is None: - data["dataset_num_proc"] = data["dataset_processes"] - LOG.warning( - "dataset_processes is deprecated and will be removed in a future version. " - "Please use dataset_num_proc instead." - ) - else: - raise ValueError( - "Both dataset_processes and dataset_num_proc are set. " - "Using dataset_num_proc and ignoring dataset_processes." - ) - if data.get("dataset_num_proc") is None: - data["dataset_num_proc"] = get_default_process_count() - return data + def default_dataset_num_proc(cls, data): + has_old = data.get("dataset_processes") is not None + has_new = data.get("dataset_num_proc") is not None + if has_old and has_new: + raise ValueError( + "Both dataset_processes and dataset_num_proc are set. Remove dataset_processes and use dataset_num_proc." + ) + if not has_new: + if has_old: + data["dataset_num_proc"] = data["dataset_processes"] + LOG.warning( + "dataset_processes is deprecated and will be removed in a future version. " + "Please use dataset_num_proc instead." + ) + else: + data["dataset_num_proc"] = get_default_process_count() + # sanitize + try: + data["dataset_num_proc"] = max(int(data["dataset_num_proc"]), 1) + except (TypeError, ValueError): + raise ValueError("dataset_num_proc must be a positive integer.") + return data

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def default_dataset_num_proc(cls, data):

if data.get("dataset_processes") is not None:

if data.get("dataset_num_proc") is None:

data["dataset_num_proc"] = data["dataset_processes"]

LOG.warning(

"dataset_processes is deprecated and will be removed in a future version. "

"Please use dataset_num_proc instead."

)

else:

raise ValueError(

"Both dataset_processes and dataset_num_proc are set. "

"Using dataset_num_proc and ignoring dataset_processes."

)

if data.get("dataset_num_proc") is None:

data["dataset_num_proc"] = get_default_process_count()

return data

@model_validator(mode="before")

@classmethod

def default_dataset_num_proc(cls, data):

has_old = data.get("dataset_processes") is not None

has_new = data.get("dataset_num_proc") is not None

if has_old and has_new:

raise ValueError(

"Both dataset_processes and dataset_num_proc are set. "

"Remove dataset_processes and use dataset_num_proc."

)

if not has_new:

if has_old:

data["dataset_num_proc"] = data["dataset_processes"]

LOG.warning(

"dataset_processes is deprecated and will be removed in a future version. "

"Please use dataset_num_proc instead."

)

else:

data["dataset_num_proc"] = get_default_process_count()

# sanitize

try:

data["dataset_num_proc"] = max(int(data["dataset_num_proc"]), 1)

except (TypeError, ValueError):

raise ValueError("dataset_num_proc must be a positive integer.")

return data

🤖 Prompt for AI Agents

In src/axolotl/utils/schemas/config.py around lines 1307-1323, the defaulting logic can leave dataset_num_proc unset when both keys are None and the ValueError contradicts intended behavior when both are set; update the function so it first checks: if dataset_processes is set and dataset_num_proc is None, assign dataset_num_proc = dataset_processes and log the deprecation warning; if both dataset_processes and dataset_num_proc are set, raise a ValueError with a clear message that dataset_num_proc will be used and dataset_processes is ignored; after these checks, if dataset_num_proc is still None, set it to get_default_process_count(); finally return data.

coderabbitai · 2025-09-04T15:19:35Z

            "ddp_bucket_cap_mb": 25,
            "ddp_broadcast_buffers": False,
-            "dataset_processes": 4,
+            "dataset_num_proc": 4,


💡 Verification agent

🧩 Analysis chain

Rename to dataset_num_proc looks correct.

Matches the new canonical key; no logic change.

Run this to find any remaining uses of the deprecated key or accidental dual-specification elsewhere in the repo:

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "Deprecated key occurrences (should be 0 outside compat paths):" rg -nP '\bdataset_processes\b' -C2 echo echo "Check env var usages:" rg -nP 'AXOLOTL_DATASET_(NUM_PROC|PROCESSES)\b' -C2 echo echo "Canonical keys in use:" rg -nP '\bdataset_num_proc(?:esses)?\b' -C2 echo echo "Potential spots setting both keys in same mapping (should be 0):" rg -nPU '(?s)\{[^}]*\bdataset_processes\b[^}]*\bdataset_num_proc\b[^}]*\}' -n || true

Length of output: 17337

Remove deprecated dataset_processes references outside compatibility code

tests/test_exact_deduplication.py:213, tests/e2e/test_streaming.py:33 and devtools/dev_chat_template.yml:16 still use "dataset_processes":; update these to "dataset_num_proc":

Keep the deprecation handling in src/axolotl/utils/schemas/config.py (and the env-var fallbacks in src/axolotl/utils/datasets.py and CI scripts) for now

coderabbitai · 2025-09-04T15:19:35Z

+                    "dataset_num_proc": 4,
                }


💡 Verification agent

🧩 Analysis chain

LGTM on adopting dataset_num_proc across tests.

Renames are consistent and improve clarity.

There appear to be lingering uses of dataset_processes elsewhere (e.g., tests/test_exact_deduplication.py in provided snippet). Run to surface stragglers and verify alias support:

Also applies to: 183-184, 223-224, 255-256, 289-290, 373-374, 474-475

🏁 Script executed:

#!/bin/bash # Find deprecated/new keys and envs rg -nP -C2 '\bdataset_processes\b|\bdataset_num_processes\b|\bAXOLOTL_DATASET_PROCESSES\b' || true # Sanity: confirm canonical key usage sites rg -nP -C2 '\bdataset_num_proc\b' || true

Length of output: 16633

🏁 Script executed:

#!/bin/bash sed -n '200,240p' tests/test_exact_deduplication.py sed -n '20,80p' tests/e2e/test_streaming.py

Length of output: 3392

Replace all remaining dataset_processes occurrences with the new dataset_num_proc key
Alias support via default_dataset_num_proc and env-var fallbacks is in place. Update the following to use dataset_num_proc (and drop AXOLOTL_DATASET_PROCESSES):

tests/test_exact_deduplication.py (fixture at lines 211-214)

tests/e2e/test_streaming.py (cfg at lines 31-34)

devtools/dev_chat_template.yml (line 16)

cicd/single_gpu.py (remove AXOLOTL_DATASET_PROCESSES)

cicd/Dockerfile.jinja (remove AXOLOTL_DATASET_PROCESSES)

🤖 Prompt for AI Agents

In tests/test_datasets.py around lines 144-145, ensure the configuration uses "dataset_num_proc" (already present) and then update all remaining references of the old key "dataset_processes" to "dataset_num_proc" across the codebase: change the fixture in tests/test_exact_deduplication.py (lines ~211-214) to use dataset_num_proc, update the config in tests/e2e/test_streaming.py (lines ~31-34) to dataset_num_proc, and remove any use of the AXOLOTL_DATASET_PROCESSES env var/key from devtools/dev_chat_template.yml (line 16), cicd/single_gpu.py, and cicd/Dockerfile.jinja; keep alias/fallback logic via default_dataset_num_proc and env var fallbacks if present but do not reintroduce the AXOLOTL_DATASET_PROCESSES key.

codecov · 2025-09-04T15:23:02Z

Codecov Report

❌ Patch coverage is 50.00000% with 9 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/axolotl/utils/schemas/config.py	50.00%	5 Missing ⚠️
src/axolotl/utils/trainer.py	25.00%	3 Missing ⚠️
src/axolotl/utils/datasets.py	50.00%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

NanoCode012

Thanks!

salmanmohammadi · 2025-09-09T12:56:30Z

+                    "Please use dataset_num_proc instead."
+                )
+            else:
+                raise ValueError(


I don't think we need a ValueError here, just log a warning.

salmanmohammadi · 2025-09-09T12:57:46Z

                sequential=cfg.sample_packing_sequentially,
                drop_last=True,
-                num_processes=cfg.dataset_processes,
+                num_processes=cfg.data_num_proc,


salmanmohammadi · 2025-09-09T12:58:39Z

            "ddp_bucket_cap_mb": 25,
            "ddp_broadcast_buffers": False,
-            "dataset_processes": 4,
+            "dataset_num_proc": 4,


We should leave at least one of the tests with the old value to ensure the migration happens correctly.

salmanmohammadi · 2025-09-09T13:00:33Z

Sorry to request changes, just a few comments and didn't want to merge until they were addressed.

salmanmohammadi · 2025-09-15T11:57:32Z

+        deprecated="Use `dataset_num_proc` instead. This parameter will be removed in a future version.",
        json_schema_extra={
            "description": (
+                "DEPRECATED: Use `dataset_num_proc` instead. "


We don't need this when we're using deprecated above.

salmanmohammadi · 2025-09-15T12:10:46Z

-            data["dataset_processes"] = get_default_process_count()
+    def default_dataset_num_proc(cls, data):
+        if data.get("dataset_processes") is not None:
+            if data.get("dataset_num_proc") is not None:


Could you check this again? The conditions for this statement are reversed. You're checking for the existence of dataset_num_proc but then overriding it with dataset_processes.

…a_num_proc2

salmanmohammadi

LGTM after CI is green.

NanoCode012

Just did a re-check and noticed this.

NanoCode012 · 2025-10-10T04:02:19Z

 def get_default_process_count():
+    if axolotl_dataset_num_proc := os.environ.get("AXOLOTL_DATASET_NUM_PROC"):
+        return int(axolotl_dataset_num_proc)
    if axolotl_dataset_processes := os.environ.get("AXOLOTL_DATASET_PROCESSES"):


I missed this. Should this include a LOG warning if this is set, in the schema.config.py?

Co-authored-by: NanoCode012 <kevinvong@rocketmail.com>

* feat:add support dataset_num_processes * chore * required changes * requested chnages * required chnages * required changes * required changes * elif get_default_process_count() * add:del data * Update cicd/Dockerfile.jinja Co-authored-by: NanoCode012 <kevinvong@rocketmail.com> * Update cicd/single_gpu.py Co-authored-by: NanoCode012 <kevinvong@rocketmail.com> --------- Co-authored-by: salman <salman.mohammadi@outlook.com> Co-authored-by: NanoCode012 <kevinvong@rocketmail.com> (cherry picked from commit cd856b4)

feat:add support dataset_num_processes

db5761e

ved1beta added 2 commits September 4, 2025 20:34

chore

09acac4

required changes

87dab3c

coderabbitai Bot reviewed Sep 4, 2025

View reviewed changes

NanoCode012 approved these changes Sep 5, 2025

View reviewed changes

salmanmohammadi suggested changes Sep 9, 2025

View reviewed changes

requested chnages

fa822df

salmanmohammadi reviewed Sep 9, 2025

View reviewed changes

Comment thread src/axolotl/utils/schemas/config.py

ved1beta and others added 3 commits September 9, 2025 23:49

required chnages

581d6f6

required changes

bfd4e1d

Merge branch 'main' into data_num_proc2

4c27d72

salmanmohammadi reviewed Sep 15, 2025

View reviewed changes

salmanmohammadi suggested changes Sep 15, 2025

View reviewed changes

ved1beta added 2 commits September 15, 2025 20:03

required changes

1d36b92

Merge branch 'data_num_proc2' of github.com:ved1beta/axolotl into dat…

eec6b8e

…a_num_proc2

salmanmohammadi reviewed Sep 15, 2025

View reviewed changes

Comment thread src/axolotl/utils/schemas/config.py

salmanmohammadi reviewed Sep 15, 2025

View reviewed changes

Comment thread src/axolotl/utils/schemas/config.py

elif get_default_process_count()

614c162

ved1beta requested a review from salmanmohammadi September 17, 2025 15:12

add:del data

88ec039

salmanmohammadi approved these changes Sep 17, 2025

View reviewed changes

NanoCode012 added the ready to merge label Sep 18, 2025

NanoCode012 requested changes Oct 10, 2025

View reviewed changes

NanoCode012 removed the ready to merge label Oct 10, 2025

ved1beta and others added 2 commits October 10, 2025 09:35

Update cicd/Dockerfile.jinja

fc9d80c

Co-authored-by: NanoCode012 <kevinvong@rocketmail.com>

Update cicd/single_gpu.py

270e1d1

Co-authored-by: NanoCode012 <kevinvong@rocketmail.com>

ved1beta requested a review from NanoCode012 October 12, 2025 16:17

NanoCode012 approved these changes Oct 13, 2025

View reviewed changes

NanoCode012 merged commit cd856b4 into axolotl-ai-cloud:main Oct 13, 2025
14 of 15 checks passed

Uh oh!

Conversation

ved1beta commented Sep 4, 2025 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NanoCode012 Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

NanoCode012 left a comment

Choose a reason for hiding this comment

Uh oh!

salmanmohammadi Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

salmanmohammadi Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

salmanmohammadi Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

salmanmohammadi commented Sep 9, 2025

Uh oh!

Uh oh!

salmanmohammadi Sep 15, 2025

Choose a reason for hiding this comment

Uh oh!

salmanmohammadi Sep 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

salmanmohammadi left a comment

Choose a reason for hiding this comment

Uh oh!

NanoCode012 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

NanoCode012 Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

ved1beta commented Sep 4, 2025 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Sep 4, 2025 •

edited

Loading

coderabbitai Bot Sep 4, 2025 •

edited

Loading

codecov Bot commented Sep 4, 2025 •

edited

Loading