Skip to content

feat:add support dataset_num_processes#3129

Merged
NanoCode012 merged 13 commits into
axolotl-ai-cloud:mainfrom
ved1beta:data_num_proc2
Oct 13, 2025
Merged

feat:add support dataset_num_processes#3129
NanoCode012 merged 13 commits into
axolotl-ai-cloud:mainfrom
ved1beta:data_num_proc2

Conversation

@ved1beta
Copy link
Copy Markdown
Member

@ved1beta ved1beta commented Sep 4, 2025

Description

refactor for #1783 deprecate dataset_processes in preference of dataset_num_proc

Summary by CodeRabbit

  • New Features
    • Introduced dataset_num_proc config to control dataset preprocessing parallelism.
    • Added AXOLOTL_DATASET_NUM_PROC environment variable (takes precedence over other CPU-count envs).
  • Refactor
    • Replaced uses of dataset_processes with dataset_num_proc across data processing and training paths.
    • Deprecated dataset_processes; warning shown if used and error raised when both keys are set.
  • Documentation
    • Updated debugging guidance and examples to use dataset_num_proc.
  • Chores
    • CI images and single-GPU runner now export and pass AXOLOTL_DATASET_NUM_PROC by default.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Sep 4, 2025

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

📝 Walkthrough

Walkthrough

Renames the dataset parallelism configuration from dataset_processes to dataset_num_proc across code, tests, and docs. Adds AXOLOTL_DATASET_NUM_PROC env handling and precedence over AXOLOTL_DATASET_PROCESSES. Updates schema validation to deprecate dataset_processes with compatibility and conflict checks.

Changes

Cohort / File(s) Summary
Env and CI
cicd/Dockerfile.jinja, cicd/single_gpu.py
Add AXOLOTL_DATASET_NUM_PROC="8" to Dockerfile and subprocess env; keep existing AXOLOTL_DATASET_PROCESSES="8".
Schema and defaults
src/axolotl/utils/schemas/config.py
Deprecate dataset_processes; introduce validator default_dataset_num_proc with back-compat (map dataset_processes → dataset_num_proc, warn; error if both set); default via get_default_process_count.
Default process count/env precedence
src/axolotl/utils/datasets.py
Make get_default_process_count prefer AXOLOTL_DATASET_NUM_PROC over AXOLOTL_DATASET_PROCESSES, then RUNPOD_CPU_COUNT, then os.cpu_count().
Builders
src/axolotl/core/builders/base.py
Accept dataset_num_proc in training_args; remove aliasing from dataset_processes.
Data utilities
src/axolotl/utils/data/* (.../rl.py, .../shared.py, .../utils.py, .../wrappers.py)
Replace uses of cfg.dataset_processes with cfg.dataset_num_proc for map/filter/save/wrapper worker counts.
Trainer
src/axolotl/utils/trainer.py
Swap cfg.dataset_processes → cfg.dataset_num_proc in map/filter paths; update sampler/process count reference accordingly.
Docs
docs/debugging.qmd
Update guidance and examples: dataset_processes → dataset_num_proc (YAML and CLI flag).
Tests
tests/core/test_builders.py, tests/e2e/patched/test_activation_checkpointing.py, tests/e2e/test_llama_pretrain.py, tests/test_datasets.py, tests/test_packed_dataset.py
Rename test configs to use dataset_num_proc instead of dataset_processes; values unchanged.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested labels

ready to merge

Suggested reviewers

  • winglian
  • NanoCode012
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/axolotl/utils/data/shared.py (1)

412-423: Enforce num_workers ≥ 1
Refactor the fallback so it can’t return 0 or None—otherwise Dataset.from_generator and _generate_from_iterable_dataset (which does idx % num_workers) will error.

Apply:

-    num_workers = cfg.dataset_num_proc or get_default_process_count()
+    num_workers_cfg = cfg.dataset_num_proc
+    num_workers = (
+        num_workers_cfg
+        if isinstance(num_workers_cfg, int) and num_workers_cfg > 0
+        else (get_default_process_count() or 1)
+    )

Add a unit test for cfg.dataset_num_proc = 0/None with no AXOLOTL_DATASET_NUM_PROC to confirm the fallback yields 1.

src/axolotl/utils/trainer.py (1)

461-472: Fix typo: use cfg.dataset_num_proc for MultipackBatchSampler
In src/axolotl/utils/trainer.py (around line 470), replace the undefined cfg.data_num_proc with cfg.dataset_num_proc:

-                num_processes=cfg.data_num_proc,
+                num_processes=cfg.dataset_num_proc,
🧹 Nitpick comments (7)
cicd/single_gpu.py (1)

67-70: Don’t override user-provided env; mirror old var only if absent.

Currently this forces both env vars to "8" even if the caller set different values. Respect existing env and mirror to the deprecated variable only when needed.

     sp_env = os.environ.copy()
-    sp_env["AXOLOTL_DATASET_NUM_PROC"] = "8"
-    sp_env["AXOLOTL_DATASET_PROCESSES"] = "8"
+    # Prefer caller-provided values; default to 8 if neither is set.
+    sp_env.setdefault("AXOLOTL_DATASET_NUM_PROC", os.environ.get("AXOLOTL_DATASET_NUM_PROC", "8"))
+    sp_env.setdefault(
+        "AXOLOTL_DATASET_PROCESSES",
+        os.environ.get("AXOLOTL_DATASET_PROCESSES", sp_env["AXOLOTL_DATASET_NUM_PROC"]),
+    )
cicd/Dockerfile.jinja (1)

12-13: Avoid baking a fixed default; make it configurable at build/run time.

Hardcoding "8" in the image can surprise downstream users. Expose as ARG or rely on runtime ENV, and keep the deprecated var only for compatibility.

-ENV AXOLOTL_DATASET_NUM_PROC="8"
-ENV AXOLOTL_DATASET_PROCESSES="8"
+ARG AXOLOTL_DATASET_NUM_PROC=8
+ENV AXOLOTL_DATASET_NUM_PROC="${AXOLOTL_DATASET_NUM_PROC}"
+# Back-compat: default to the same value unless overridden at runtime
+ARG AXOLOTL_DATASET_PROCESSES
+ENV AXOLOTL_DATASET_PROCESSES="${AXOLOTL_DATASET_PROCESSES:-${AXOLOTL_DATASET_NUM_PROC}}"
src/axolotl/utils/datasets.py (1)

7-13: Validate and sanitize process count; guard os.cpu_count() None.

Ensure positive integers and safe fallback to 1.

-    if axolotl_dataset_num_proc := os.environ.get("AXOLOTL_DATASET_NUM_PROC"):
-        return int(axolotl_dataset_num_proc)
+    if axolotl_dataset_num_proc := os.environ.get("AXOLOTL_DATASET_NUM_PROC"):
+        val = int(axolotl_dataset_num_proc)
+        return max(val, 1)
     if axolotl_dataset_processes := os.environ.get("AXOLOTL_DATASET_PROCESSES"):
-        return int(axolotl_dataset_processes)
+        val = int(axolotl_dataset_processes)
+        return max(val, 1)
     if runpod_cpu_count := os.environ.get("RUNPOD_CPU_COUNT"):
-        return int(runpod_cpu_count)
-    return os.cpu_count()
+        val = int(runpod_cpu_count)
+        return max(val, 1)
+    cpu = os.cpu_count() or 1
+    return max(cpu, 1)
src/axolotl/utils/data/utils.py (1)

224-228: Defensive default for num_proc (handle 0/None).

If config leaves num_proc unset or mis-set to 0, default to 1 to avoid runtime surprises.

-    if not isinstance(dataset, IterableDataset):
-        filter_map_kwargs["num_proc"] = cfg.dataset_num_proc
+    if not isinstance(dataset, IterableDataset):
+        # datasets.filter/map accept None, but enforce >=1 if provided
+        np_val = cfg.dataset_num_proc
+        if np_val is not None:
+            np_val = max(int(np_val), 1)
+        filter_map_kwargs["num_proc"] = np_val
         filter_map_kwargs["load_from_cache_file"] = not cfg.is_preprocess
docs/debugging.qmd (2)

32-33: Doc tip: mention env var override

Add that AXOLOTL_DATASET_NUM_PROC=1 also forces single-process preprocessing (it now takes precedence over AXOLOTL_DATASET_PROCESSES).

Apply near this bullet:

-    - Set `dataset_num_proc: 1` in your axolotl config or run the training command with `--dataset_num_proc=1`.
+    - Set `dataset_num_proc: 1` in your axolotl config, run with `--dataset_num_proc=1`, or set `AXOLOTL_DATASET_NUM_PROC=1`.

104-114: Mirror env var in VSCode args comment

Add a note about AXOLOTL_DATASET_NUM_PROC alongside the existing VSCode launch snippet to surface this quick-debug option:

 docs/debugging.qmd: lines 104–114
-                // with the debugging tips above.  Modify as needed.
+                // with the debugging tips above.  Modify as needed. Alternatively set AXOLOTL_DATASET_NUM_PROC=1.
tests/core/test_builders.py (1)

443-443: Remove redundant reassignment of dataset_num_proc.

All cfg fixtures copy base_cfg which already sets dataset_num_proc=4. This line is unnecessary.

-        cfg["dataset_num_proc"] = 4
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between c6ae5c4 and 09acac4.

📒 Files selected for processing (16)
  • cicd/Dockerfile.jinja (1 hunks)
  • cicd/single_gpu.py (1 hunks)
  • docs/debugging.qmd (2 hunks)
  • src/axolotl/core/builders/base.py (1 hunks)
  • src/axolotl/utils/data/rl.py (2 hunks)
  • src/axolotl/utils/data/shared.py (1 hunks)
  • src/axolotl/utils/data/utils.py (1 hunks)
  • src/axolotl/utils/data/wrappers.py (1 hunks)
  • src/axolotl/utils/datasets.py (1 hunks)
  • src/axolotl/utils/schemas/config.py (2 hunks)
  • src/axolotl/utils/trainer.py (5 hunks)
  • tests/core/test_builders.py (2 hunks)
  • tests/e2e/patched/test_activation_checkpointing.py (1 hunks)
  • tests/e2e/test_llama_pretrain.py (1 hunks)
  • tests/test_datasets.py (7 hunks)
  • tests/test_packed_dataset.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (7)
src/axolotl/utils/data/rl.py (1)
tests/test_exact_deduplication.py (1)
  • cfg (201-216)
src/axolotl/utils/data/wrappers.py (1)
tests/test_exact_deduplication.py (1)
  • cfg (201-216)
src/axolotl/utils/data/shared.py (2)
tests/test_exact_deduplication.py (1)
  • cfg (201-216)
src/axolotl/utils/datasets.py (1)
  • get_default_process_count (6-13)
src/axolotl/utils/data/utils.py (1)
tests/test_exact_deduplication.py (1)
  • cfg (201-216)
tests/core/test_builders.py (1)
tests/test_exact_deduplication.py (1)
  • cfg (201-216)
src/axolotl/utils/trainer.py (2)
src/axolotl/integrations/base.py (2)
  • cfg (339-340)
  • cfg (343-344)
tests/test_exact_deduplication.py (1)
  • cfg (201-216)
src/axolotl/utils/schemas/config.py (1)
src/axolotl/utils/datasets.py (1)
  • get_default_process_count (6-13)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
  • GitHub Check: PyTest from Source Dist (3.11, 2.8.0)
  • GitHub Check: PyTest (3.11, 2.6.0)
  • GitHub Check: PyTest from Source Dist (3.11, 2.6.0)
  • GitHub Check: PyTest from Source Dist (3.11, 2.7.1)
  • GitHub Check: PyTest (3.11, 2.7.1)
  • GitHub Check: PyTest (3.11, 2.8.0)
  • GitHub Check: preview
🔇 Additional comments (10)
src/axolotl/utils/schemas/config.py (1)

222-232: Deprecation metadata reads well.

Marking dataset_processes as deprecated with clear description is good.

tests/test_packed_dataset.py (1)

58-58: LGTM on key rename.

Config now uses dataset_num_proc; matches the new canonical key.

tests/e2e/patched/test_activation_checkpointing.py (1)

72-72: LGTM on config rename in e2e.

Keeps behavior while aligning to dataset_num_proc.

Add a tiny param test to ensure:

  • dataset_processes emits a deprecation warning and still works.
  • dataset_num_processes (new alias) is accepted and normalized to dataset_num_proc.
tests/e2e/test_llama_pretrain.py (1)

32-32: LGTM: config key rename to dataset_num_proc

Switching to dataset_num_proc in this E2E test matches the new config and ensures single-process preprocessing for determinism.

src/axolotl/core/builders/base.py (1)

501-505: No action needed: dataset_num_proc is a valid training-argument field
It’s declared on the AxolotlTrainingMixins dataclass (so all AxolotlTrainingArguments variants accept it) and is exposed on config-driven classes—e.g. AxolotlORPOConfig lists dataset_num_proc as a parameter (huggingface.co)

src/axolotl/utils/data/rl.py (1)

116-120: LGTM: map uses dataset_num_proc

dataset.map(..., num_proc=cfg.dataset_num_proc, ...) aligns with the new key.

src/axolotl/utils/trainer.py (4)

279-281: LGTM: standardize num_proc for map/filter

Using cfg.dataset_num_proc in non-Iterable paths is consistent.


317-322: LGTM: group-by-length mapping uses dataset_num_proc

Consistent with the new key.


334-339: LGTM: PoSE train map uses dataset_num_proc

Matches the rest of the data pipeline changes.


343-348: LGTM: PoSE eval map uses dataset_num_proc

Consistent application.

Comment on lines 235 to 240
split_datasets[i] = split_datasets[i].filter(
drop_long,
num_proc=cfg.dataset_processes,
num_proc=cfg.dataset_num_proc,
load_from_cache_file=not cfg.is_preprocess,
desc="Dropping Long Sequences",
)
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot Sep 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

LGTM: filter uses dataset_num_proc

Consistent use during long-seq dropping.

Quick check that the schema constrains this to >=1 to avoid HF num_proc=0 issues:


🏁 Script executed:

#!/bin/bash
rg -nP 'dataset_num_proc.*Field\(|dataset_num_proc.*Annotated|model_validator.*default_dataset_num_proc' src/axolotl/utils/schemas -C2

Length of output: 136


🏁 Script executed:

#!/bin/bash
# Search entire repository for occurrences of dataset_num_proc and inspect definitions
rg -nH 'dataset_num_proc' -C3 .

Length of output: 17484


Enforce dataset_num_proc >= 1 in the schema: The pydantic schema for dataset_num_proc currently has no minimum constraint, so users could set it to 0 and trigger HF’s num_proc=0 errors. Add ge=1 to its Field in src/axolotl/utils/schemas/config.py or include a validator in default_dataset_num_proc to ensure it’s always at least 1.

🤖 Prompt for AI Agents
In src/axolotl/utils/data/rl.py around lines 235 to 240, the filter call passes
cfg.dataset_num_proc to HF dataset's num_proc which can be 0; update
src/axolotl/utils/schemas/config.py to enforce dataset_num_proc >= 1 by adding
ge=1 to its pydantic Field (or alternatively add a validator in
default_dataset_num_proc that coerces/raises and ensures a minimum of 1), so
users cannot set 0 and HF won’t receive num_proc=0.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ved1beta , edge case that I doubt someone would do, but should we add a minimum check in the validator?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

Comment on lines 82 to 85
dataset_kwargs: dict[str, Any] = {
"process_count": cfg.dataset_processes,
"process_count": cfg.dataset_num_proc,
"keep_in_memory": cfg.dataset_keep_in_memory is True,
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Harden process_count fallback to avoid None/0.

If cfg.dataset_num_proc is unset or 0, process_count may be invalid for downstream .map/.filter. Use a safe fallback.

Apply:

+from axolotl.utils.datasets import get_default_process_count
@@
-    dataset_kwargs: dict[str, Any] = {
-        "process_count": cfg.dataset_num_proc,
+    process_count = (
+        cfg.dataset_num_proc
+        if isinstance(cfg.dataset_num_proc, int) and cfg.dataset_num_proc > 0
+        else (get_default_process_count() or 1)
+    )
+    dataset_kwargs: dict[str, Any] = {
+        "process_count": process_count,
         "keep_in_memory": cfg.dataset_keep_in_memory is True,
     }

Please confirm schema normalization guarantees a positive int for dataset_num_proc when only deprecated keys (dataset_processes or dataset_num_processes) are set.


🏁 Script executed:

#!/bin/bash
set -e

echo "=== Occurrences of dataset_num_proc ==="
rg -n "dataset_num_proc" -C3

echo
echo "=== Occurrences of deprecated key dataset_processes ==="
rg -n "dataset_processes" -C3 || true

echo
echo "=== Occurrences of deprecated key dataset_num_processes ==="
rg -n "dataset_num_processes" -C3 || true

echo
echo "=== Scanning all config files for dataset normalization ==="
fd -t f -g "*config*.py" | while read -r file; do
  echo
  echo ">> File: $file"
  rg -nP "dataset_.*process" -C2 "$file" || true
done

Length of output: 24593


Harden process_count fallback in wrappers.py
Schema normalization only maps dataset_processesdataset_num_proc without enforcing > 0 (and dataset_num_processes isn’t recognized), so cfg.dataset_num_proc can still be None or 0. Apply:

+from axolotl.utils.datasets import get_default_process_count
@@ src/axolotl/utils/data/wrappers.py:82
-    dataset_kwargs: dict[str, Any] = {
-        "process_count": cfg.dataset_num_proc,
+    process_count = (
+        cfg.dataset_num_proc
+        if isinstance(cfg.dataset_num_proc, int) and cfg.dataset_num_proc > 0
+        else (get_default_process_count() or 1)
+    )
+    dataset_kwargs: dict[str, Any] = {
+        "process_count": process_count,
         "keep_in_memory": cfg.dataset_keep_in_memory is True,
     }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
dataset_kwargs: dict[str, Any] = {
"process_count": cfg.dataset_processes,
"process_count": cfg.dataset_num_proc,
"keep_in_memory": cfg.dataset_keep_in_memory is True,
}
from axolotl.utils.datasets import get_default_process_count
process_count = (
cfg.dataset_num_proc
if isinstance(cfg.dataset_num_proc, int) and cfg.dataset_num_proc > 0
else (get_default_process_count() or 1)
)
dataset_kwargs: dict[str, Any] = {
"process_count": process_count,
"keep_in_memory": cfg.dataset_keep_in_memory is True,
}
🤖 Prompt for AI Agents
In src/axolotl/utils/data/wrappers.py around lines 82 to 85, the dataset_kwargs
assignment uses cfg.dataset_num_proc directly which can be None or 0; change the
logic to ensure process_count is a positive integer fallbacking to 1 (e.g.,
compute a safe value by treating None or non-positive values as 1 and casting to
int), and keep the existing keep_in_memory boolean coercion unchanged.

Comment on lines +1307 to 1323
def default_dataset_num_proc(cls, data):
if data.get("dataset_processes") is not None:
if data.get("dataset_num_proc") is None:
data["dataset_num_proc"] = data["dataset_processes"]
LOG.warning(
"dataset_processes is deprecated and will be removed in a future version. "
"Please use dataset_num_proc instead."
)
else:
raise ValueError(
"Both dataset_processes and dataset_num_proc are set. "
"Using dataset_num_proc and ignoring dataset_processes."
)
if data.get("dataset_num_proc") is None:
data["dataset_num_proc"] = get_default_process_count()

return data
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Defaulting logic leaves dataset_num_proc unset when both fields are None; error message contradicts behavior.

  • If neither key is provided, dataset_num_proc stays None (no default applied).
  • When both are set, the ValueError message says it will ignore dataset_processes, but the code raises instead.

Proposed fix: always derive a sanitized dataset_num_proc; raise with a clear message only when both are set.

 @model_validator(mode="before")
 @classmethod
-    def default_dataset_num_proc(cls, data):
-        if data.get("dataset_processes") is not None:
-            if data.get("dataset_num_proc") is None:
-                data["dataset_num_proc"] = data["dataset_processes"]
-                LOG.warning(
-                    "dataset_processes is deprecated and will be removed in a future version. "
-                    "Please use dataset_num_proc instead."
-                )
-            else:
-                raise ValueError(
-                    "Both dataset_processes and dataset_num_proc are set. "
-                    "Using dataset_num_proc and ignoring dataset_processes."
-                )
-            if data.get("dataset_num_proc") is None:
-                data["dataset_num_proc"] = get_default_process_count()
-        return data
+    def default_dataset_num_proc(cls, data):
+        has_old = data.get("dataset_processes") is not None
+        has_new = data.get("dataset_num_proc") is not None
+        if has_old and has_new:
+            raise ValueError(
+                "Both dataset_processes and dataset_num_proc are set. Remove dataset_processes and use dataset_num_proc."
+            )
+        if not has_new:
+            if has_old:
+                data["dataset_num_proc"] = data["dataset_processes"]
+                LOG.warning(
+                    "dataset_processes is deprecated and will be removed in a future version. "
+                    "Please use dataset_num_proc instead."
+                )
+            else:
+                data["dataset_num_proc"] = get_default_process_count()
+        # sanitize
+        try:
+            data["dataset_num_proc"] = max(int(data["dataset_num_proc"]), 1)
+        except (TypeError, ValueError):
+            raise ValueError("dataset_num_proc must be a positive integer.")
+        return data
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def default_dataset_num_proc(cls, data):
if data.get("dataset_processes") is not None:
if data.get("dataset_num_proc") is None:
data["dataset_num_proc"] = data["dataset_processes"]
LOG.warning(
"dataset_processes is deprecated and will be removed in a future version. "
"Please use dataset_num_proc instead."
)
else:
raise ValueError(
"Both dataset_processes and dataset_num_proc are set. "
"Using dataset_num_proc and ignoring dataset_processes."
)
if data.get("dataset_num_proc") is None:
data["dataset_num_proc"] = get_default_process_count()
return data
@model_validator(mode="before")
@classmethod
def default_dataset_num_proc(cls, data):
has_old = data.get("dataset_processes") is not None
has_new = data.get("dataset_num_proc") is not None
if has_old and has_new:
raise ValueError(
"Both dataset_processes and dataset_num_proc are set. "
"Remove dataset_processes and use dataset_num_proc."
)
if not has_new:
if has_old:
data["dataset_num_proc"] = data["dataset_processes"]
LOG.warning(
"dataset_processes is deprecated and will be removed in a future version. "
"Please use dataset_num_proc instead."
)
else:
data["dataset_num_proc"] = get_default_process_count()
# sanitize
try:
data["dataset_num_proc"] = max(int(data["dataset_num_proc"]), 1)
except (TypeError, ValueError):
raise ValueError("dataset_num_proc must be a positive integer.")
return data
🤖 Prompt for AI Agents
In src/axolotl/utils/schemas/config.py around lines 1307-1323, the defaulting
logic can leave dataset_num_proc unset when both keys are None and the
ValueError contradicts intended behavior when both are set; update the function
so it first checks: if dataset_processes is set and dataset_num_proc is None,
assign dataset_num_proc = dataset_processes and log the deprecation warning; if
both dataset_processes and dataset_num_proc are set, raise a ValueError with a
clear message that dataset_num_proc will be used and dataset_processes is
ignored; after these checks, if dataset_num_proc is still None, set it to
get_default_process_count(); finally return data.

Comment thread tests/core/test_builders.py Outdated
"ddp_bucket_cap_mb": 25,
"ddp_broadcast_buffers": False,
"dataset_processes": 4,
"dataset_num_proc": 4,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Rename to dataset_num_proc looks correct.

Matches the new canonical key; no logic change.

Run this to find any remaining uses of the deprecated key or accidental dual-specification elsewhere in the repo:


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "Deprecated key occurrences (should be 0 outside compat paths):"
rg -nP '\bdataset_processes\b' -C2

echo
echo "Check env var usages:"
rg -nP 'AXOLOTL_DATASET_(NUM_PROC|PROCESSES)\b' -C2

echo
echo "Canonical keys in use:"
rg -nP '\bdataset_num_proc(?:esses)?\b' -C2

echo
echo "Potential spots setting both keys in same mapping (should be 0):"
rg -nPU '(?s)\{[^}]*\bdataset_processes\b[^}]*\bdataset_num_proc\b[^}]*\}' -n || true

Length of output: 17337


Remove deprecated dataset_processes references outside compatibility code

  • tests/test_exact_deduplication.py:213, tests/e2e/test_streaming.py:33 and devtools/dev_chat_template.yml:16 still use "dataset_processes":; update these to "dataset_num_proc":
  • Keep the deprecation handling in src/axolotl/utils/schemas/config.py (and the env-var fallbacks in src/axolotl/utils/datasets.py and CI scripts) for now

Comment thread tests/test_datasets.py
Comment on lines +144 to 145
"dataset_num_proc": 4,
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

LGTM on adopting dataset_num_proc across tests.

Renames are consistent and improve clarity.

There appear to be lingering uses of dataset_processes elsewhere (e.g., tests/test_exact_deduplication.py in provided snippet). Run to surface stragglers and verify alias support:

Also applies to: 183-184, 223-224, 255-256, 289-290, 373-374, 474-475


🏁 Script executed:

#!/bin/bash
# Find deprecated/new keys and envs
rg -nP -C2 '\bdataset_processes\b|\bdataset_num_processes\b|\bAXOLOTL_DATASET_PROCESSES\b' || true
# Sanity: confirm canonical key usage sites
rg -nP -C2 '\bdataset_num_proc\b' || true

Length of output: 16633


🏁 Script executed:

#!/bin/bash
sed -n '200,240p' tests/test_exact_deduplication.py
sed -n '20,80p' tests/e2e/test_streaming.py

Length of output: 3392


Replace all remaining dataset_processes occurrences with the new dataset_num_proc key
Alias support via default_dataset_num_proc and env-var fallbacks is in place. Update the following to use dataset_num_proc (and drop AXOLOTL_DATASET_PROCESSES):

  • tests/test_exact_deduplication.py (fixture at lines 211-214)
  • tests/e2e/test_streaming.py (cfg at lines 31-34)
  • devtools/dev_chat_template.yml (line 16)
  • cicd/single_gpu.py (remove AXOLOTL_DATASET_PROCESSES)
  • cicd/Dockerfile.jinja (remove AXOLOTL_DATASET_PROCESSES)
🤖 Prompt for AI Agents
In tests/test_datasets.py around lines 144-145, ensure the configuration uses
"dataset_num_proc" (already present) and then update all remaining references of
the old key "dataset_processes" to "dataset_num_proc" across the codebase:
change the fixture in tests/test_exact_deduplication.py (lines ~211-214) to use
dataset_num_proc, update the config in tests/e2e/test_streaming.py (lines
~31-34) to dataset_num_proc, and remove any use of the AXOLOTL_DATASET_PROCESSES
env var/key from devtools/dev_chat_template.yml (line 16), cicd/single_gpu.py,
and cicd/Dockerfile.jinja; keep alias/fallback logic via
default_dataset_num_proc and env var fallbacks if present but do not reintroduce
the AXOLOTL_DATASET_PROCESSES key.

@codecov
Copy link
Copy Markdown

codecov Bot commented Sep 4, 2025

Codecov Report

❌ Patch coverage is 50.00000% with 9 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/axolotl/utils/schemas/config.py 50.00% 5 Missing ⚠️
src/axolotl/utils/trainer.py 25.00% 3 Missing ⚠️
src/axolotl/utils/datasets.py 50.00% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown
Collaborator

@NanoCode012 NanoCode012 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Comment thread src/axolotl/utils/schemas/config.py Outdated
"Please use dataset_num_proc instead."
)
else:
raise ValueError(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need a ValueError here, just log a warning.

Comment thread src/axolotl/utils/trainer.py Outdated
sequential=cfg.sample_packing_sequentially,
drop_last=True,
num_processes=cfg.dataset_processes,
num_processes=cfg.data_num_proc,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo?

Comment thread tests/core/test_builders.py Outdated
"ddp_bucket_cap_mb": 25,
"ddp_broadcast_buffers": False,
"dataset_processes": 4,
"dataset_num_proc": 4,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should leave at least one of the tests with the old value to ensure the migration happens correctly.

@salmanmohammadi
Copy link
Copy Markdown
Contributor

Sorry to request changes, just a few comments and didn't want to merge until they were addressed.

Comment thread src/axolotl/utils/schemas/config.py
Comment thread src/axolotl/utils/schemas/config.py Outdated
deprecated="Use `dataset_num_proc` instead. This parameter will be removed in a future version.",
json_schema_extra={
"description": (
"DEPRECATED: Use `dataset_num_proc` instead. "
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need this when we're using deprecated above.

Comment thread src/axolotl/utils/schemas/config.py Outdated
data["dataset_processes"] = get_default_process_count()
def default_dataset_num_proc(cls, data):
if data.get("dataset_processes") is not None:
if data.get("dataset_num_proc") is not None:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you check this again? The conditions for this statement are reversed. You're checking for the existence of dataset_num_proc but then overriding it with dataset_processes.

Comment thread src/axolotl/utils/schemas/config.py
Comment thread src/axolotl/utils/schemas/config.py
Copy link
Copy Markdown
Contributor

@salmanmohammadi salmanmohammadi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after CI is green.

Copy link
Copy Markdown
Collaborator

@NanoCode012 NanoCode012 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just did a re-check and noticed this.

Comment thread cicd/Dockerfile.jinja Outdated
Comment thread cicd/single_gpu.py Outdated
def get_default_process_count():
if axolotl_dataset_num_proc := os.environ.get("AXOLOTL_DATASET_NUM_PROC"):
return int(axolotl_dataset_num_proc)
if axolotl_dataset_processes := os.environ.get("AXOLOTL_DATASET_PROCESSES"):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed this. Should this include a LOG warning if this is set, in the schema.config.py?

ved1beta and others added 2 commits October 10, 2025 09:35
Co-authored-by: NanoCode012 <kevinvong@rocketmail.com>
Co-authored-by: NanoCode012 <kevinvong@rocketmail.com>
@ved1beta ved1beta requested a review from NanoCode012 October 12, 2025 16:17
@NanoCode012 NanoCode012 merged commit cd856b4 into axolotl-ai-cloud:main Oct 13, 2025
14 of 15 checks passed
flaviusburca pushed a commit to invergent-ai/axolotl that referenced this pull request Oct 18, 2025
* feat:add support dataset_num_processes

* chore

* required changes

* requested chnages

* required chnages

* required changes

* required changes

* elif get_default_process_count()

* add:del data

* Update cicd/Dockerfile.jinja

Co-authored-by: NanoCode012 <kevinvong@rocketmail.com>

* Update cicd/single_gpu.py

Co-authored-by: NanoCode012 <kevinvong@rocketmail.com>

---------

Co-authored-by: salman <salman.mohammadi@outlook.com>
Co-authored-by: NanoCode012 <kevinvong@rocketmail.com>
(cherry picked from commit cd856b4)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants