feat:add support dataset_num_processes#3129
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the 📝 WalkthroughWalkthroughRenames the dataset parallelism configuration from dataset_processes to dataset_num_proc across code, tests, and docs. Adds AXOLOTL_DATASET_NUM_PROC env handling and precedence over AXOLOTL_DATASET_PROCESSES. Updates schema validation to deprecate dataset_processes with compatibility and conflict checks. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Suggested reviewers
✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/axolotl/utils/data/shared.py (1)
412-423: Enforce num_workers ≥ 1
Refactor the fallback so it can’t return 0 or None—otherwiseDataset.from_generatorand_generate_from_iterable_dataset(which doesidx % num_workers) will error.Apply:
- num_workers = cfg.dataset_num_proc or get_default_process_count() + num_workers_cfg = cfg.dataset_num_proc + num_workers = ( + num_workers_cfg + if isinstance(num_workers_cfg, int) and num_workers_cfg > 0 + else (get_default_process_count() or 1) + )Add a unit test for
cfg.dataset_num_proc = 0/Nonewith noAXOLOTL_DATASET_NUM_PROCto confirm the fallback yields 1.src/axolotl/utils/trainer.py (1)
461-472: Fix typo: use cfg.dataset_num_proc for MultipackBatchSampler
In src/axolotl/utils/trainer.py (around line 470), replace the undefinedcfg.data_num_procwithcfg.dataset_num_proc:- num_processes=cfg.data_num_proc, + num_processes=cfg.dataset_num_proc,
🧹 Nitpick comments (7)
cicd/single_gpu.py (1)
67-70: Don’t override user-provided env; mirror old var only if absent.Currently this forces both env vars to "8" even if the caller set different values. Respect existing env and mirror to the deprecated variable only when needed.
sp_env = os.environ.copy() - sp_env["AXOLOTL_DATASET_NUM_PROC"] = "8" - sp_env["AXOLOTL_DATASET_PROCESSES"] = "8" + # Prefer caller-provided values; default to 8 if neither is set. + sp_env.setdefault("AXOLOTL_DATASET_NUM_PROC", os.environ.get("AXOLOTL_DATASET_NUM_PROC", "8")) + sp_env.setdefault( + "AXOLOTL_DATASET_PROCESSES", + os.environ.get("AXOLOTL_DATASET_PROCESSES", sp_env["AXOLOTL_DATASET_NUM_PROC"]), + )cicd/Dockerfile.jinja (1)
12-13: Avoid baking a fixed default; make it configurable at build/run time.Hardcoding "8" in the image can surprise downstream users. Expose as ARG or rely on runtime ENV, and keep the deprecated var only for compatibility.
-ENV AXOLOTL_DATASET_NUM_PROC="8" -ENV AXOLOTL_DATASET_PROCESSES="8" +ARG AXOLOTL_DATASET_NUM_PROC=8 +ENV AXOLOTL_DATASET_NUM_PROC="${AXOLOTL_DATASET_NUM_PROC}" +# Back-compat: default to the same value unless overridden at runtime +ARG AXOLOTL_DATASET_PROCESSES +ENV AXOLOTL_DATASET_PROCESSES="${AXOLOTL_DATASET_PROCESSES:-${AXOLOTL_DATASET_NUM_PROC}}"src/axolotl/utils/datasets.py (1)
7-13: Validate and sanitize process count; guard os.cpu_count() None.Ensure positive integers and safe fallback to 1.
- if axolotl_dataset_num_proc := os.environ.get("AXOLOTL_DATASET_NUM_PROC"): - return int(axolotl_dataset_num_proc) + if axolotl_dataset_num_proc := os.environ.get("AXOLOTL_DATASET_NUM_PROC"): + val = int(axolotl_dataset_num_proc) + return max(val, 1) if axolotl_dataset_processes := os.environ.get("AXOLOTL_DATASET_PROCESSES"): - return int(axolotl_dataset_processes) + val = int(axolotl_dataset_processes) + return max(val, 1) if runpod_cpu_count := os.environ.get("RUNPOD_CPU_COUNT"): - return int(runpod_cpu_count) - return os.cpu_count() + val = int(runpod_cpu_count) + return max(val, 1) + cpu = os.cpu_count() or 1 + return max(cpu, 1)src/axolotl/utils/data/utils.py (1)
224-228: Defensive default for num_proc (handle 0/None).If config leaves num_proc unset or mis-set to 0, default to 1 to avoid runtime surprises.
- if not isinstance(dataset, IterableDataset): - filter_map_kwargs["num_proc"] = cfg.dataset_num_proc + if not isinstance(dataset, IterableDataset): + # datasets.filter/map accept None, but enforce >=1 if provided + np_val = cfg.dataset_num_proc + if np_val is not None: + np_val = max(int(np_val), 1) + filter_map_kwargs["num_proc"] = np_val filter_map_kwargs["load_from_cache_file"] = not cfg.is_preprocessdocs/debugging.qmd (2)
32-33: Doc tip: mention env var overrideAdd that
AXOLOTL_DATASET_NUM_PROC=1also forces single-process preprocessing (it now takes precedence overAXOLOTL_DATASET_PROCESSES).Apply near this bullet:
- - Set `dataset_num_proc: 1` in your axolotl config or run the training command with `--dataset_num_proc=1`. + - Set `dataset_num_proc: 1` in your axolotl config, run with `--dataset_num_proc=1`, or set `AXOLOTL_DATASET_NUM_PROC=1`.
104-114: Mirror env var in VSCode args commentAdd a note about AXOLOTL_DATASET_NUM_PROC alongside the existing VSCode launch snippet to surface this quick-debug option:
docs/debugging.qmd: lines 104–114 - // with the debugging tips above. Modify as needed. + // with the debugging tips above. Modify as needed. Alternatively set AXOLOTL_DATASET_NUM_PROC=1.tests/core/test_builders.py (1)
443-443: Remove redundant reassignment of dataset_num_proc.All cfg fixtures copy base_cfg which already sets dataset_num_proc=4. This line is unnecessary.
- cfg["dataset_num_proc"] = 4
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (16)
cicd/Dockerfile.jinja(1 hunks)cicd/single_gpu.py(1 hunks)docs/debugging.qmd(2 hunks)src/axolotl/core/builders/base.py(1 hunks)src/axolotl/utils/data/rl.py(2 hunks)src/axolotl/utils/data/shared.py(1 hunks)src/axolotl/utils/data/utils.py(1 hunks)src/axolotl/utils/data/wrappers.py(1 hunks)src/axolotl/utils/datasets.py(1 hunks)src/axolotl/utils/schemas/config.py(2 hunks)src/axolotl/utils/trainer.py(5 hunks)tests/core/test_builders.py(2 hunks)tests/e2e/patched/test_activation_checkpointing.py(1 hunks)tests/e2e/test_llama_pretrain.py(1 hunks)tests/test_datasets.py(7 hunks)tests/test_packed_dataset.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (7)
src/axolotl/utils/data/rl.py (1)
tests/test_exact_deduplication.py (1)
cfg(201-216)
src/axolotl/utils/data/wrappers.py (1)
tests/test_exact_deduplication.py (1)
cfg(201-216)
src/axolotl/utils/data/shared.py (2)
tests/test_exact_deduplication.py (1)
cfg(201-216)src/axolotl/utils/datasets.py (1)
get_default_process_count(6-13)
src/axolotl/utils/data/utils.py (1)
tests/test_exact_deduplication.py (1)
cfg(201-216)
tests/core/test_builders.py (1)
tests/test_exact_deduplication.py (1)
cfg(201-216)
src/axolotl/utils/trainer.py (2)
src/axolotl/integrations/base.py (2)
cfg(339-340)cfg(343-344)tests/test_exact_deduplication.py (1)
cfg(201-216)
src/axolotl/utils/schemas/config.py (1)
src/axolotl/utils/datasets.py (1)
get_default_process_count(6-13)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
- GitHub Check: PyTest from Source Dist (3.11, 2.8.0)
- GitHub Check: PyTest (3.11, 2.6.0)
- GitHub Check: PyTest from Source Dist (3.11, 2.6.0)
- GitHub Check: PyTest from Source Dist (3.11, 2.7.1)
- GitHub Check: PyTest (3.11, 2.7.1)
- GitHub Check: PyTest (3.11, 2.8.0)
- GitHub Check: preview
🔇 Additional comments (10)
src/axolotl/utils/schemas/config.py (1)
222-232: Deprecation metadata reads well.Marking dataset_processes as deprecated with clear description is good.
tests/test_packed_dataset.py (1)
58-58: LGTM on key rename.Config now uses dataset_num_proc; matches the new canonical key.
tests/e2e/patched/test_activation_checkpointing.py (1)
72-72: LGTM on config rename in e2e.Keeps behavior while aligning to dataset_num_proc.
Add a tiny param test to ensure:
- dataset_processes emits a deprecation warning and still works.
- dataset_num_processes (new alias) is accepted and normalized to dataset_num_proc.
tests/e2e/test_llama_pretrain.py (1)
32-32: LGTM: config key rename to dataset_num_procSwitching to
dataset_num_procin this E2E test matches the new config and ensures single-process preprocessing for determinism.src/axolotl/core/builders/base.py (1)
501-505: No action needed:dataset_num_procis a valid training-argument field
It’s declared on theAxolotlTrainingMixinsdataclass (so allAxolotlTrainingArgumentsvariants accept it) and is exposed on config-driven classes—e.g.AxolotlORPOConfiglistsdataset_num_procas a parameter (huggingface.co)src/axolotl/utils/data/rl.py (1)
116-120: LGTM: map uses dataset_num_proc
dataset.map(..., num_proc=cfg.dataset_num_proc, ...)aligns with the new key.src/axolotl/utils/trainer.py (4)
279-281: LGTM: standardize num_proc for map/filterUsing
cfg.dataset_num_procin non-Iterable paths is consistent.
317-322: LGTM: group-by-length mapping uses dataset_num_procConsistent with the new key.
334-339: LGTM: PoSE train map uses dataset_num_procMatches the rest of the data pipeline changes.
343-348: LGTM: PoSE eval map uses dataset_num_procConsistent application.
| split_datasets[i] = split_datasets[i].filter( | ||
| drop_long, | ||
| num_proc=cfg.dataset_processes, | ||
| num_proc=cfg.dataset_num_proc, | ||
| load_from_cache_file=not cfg.is_preprocess, | ||
| desc="Dropping Long Sequences", | ||
| ) |
There was a problem hiding this comment.
💡 Verification agent
🧩 Analysis chain
LGTM: filter uses dataset_num_proc
Consistent use during long-seq dropping.
Quick check that the schema constrains this to >=1 to avoid HF num_proc=0 issues:
🏁 Script executed:
#!/bin/bash
rg -nP 'dataset_num_proc.*Field\(|dataset_num_proc.*Annotated|model_validator.*default_dataset_num_proc' src/axolotl/utils/schemas -C2Length of output: 136
🏁 Script executed:
#!/bin/bash
# Search entire repository for occurrences of dataset_num_proc and inspect definitions
rg -nH 'dataset_num_proc' -C3 .Length of output: 17484
Enforce dataset_num_proc >= 1 in the schema: The pydantic schema for dataset_num_proc currently has no minimum constraint, so users could set it to 0 and trigger HF’s num_proc=0 errors. Add ge=1 to its Field in src/axolotl/utils/schemas/config.py or include a validator in default_dataset_num_proc to ensure it’s always at least 1.
🤖 Prompt for AI Agents
In src/axolotl/utils/data/rl.py around lines 235 to 240, the filter call passes
cfg.dataset_num_proc to HF dataset's num_proc which can be 0; update
src/axolotl/utils/schemas/config.py to enforce dataset_num_proc >= 1 by adding
ge=1 to its pydantic Field (or alternatively add a validator in
default_dataset_num_proc that coerces/raises and ensures a minimum of 1), so
users cannot set 0 and HF won’t receive num_proc=0.
There was a problem hiding this comment.
@ved1beta , edge case that I doubt someone would do, but should we add a minimum check in the validator?
There was a problem hiding this comment.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
| dataset_kwargs: dict[str, Any] = { | ||
| "process_count": cfg.dataset_processes, | ||
| "process_count": cfg.dataset_num_proc, | ||
| "keep_in_memory": cfg.dataset_keep_in_memory is True, | ||
| } |
There was a problem hiding this comment.
💡 Verification agent
🧩 Analysis chain
Harden process_count fallback to avoid None/0.
If cfg.dataset_num_proc is unset or 0, process_count may be invalid for downstream .map/.filter. Use a safe fallback.
Apply:
+from axolotl.utils.datasets import get_default_process_count
@@
- dataset_kwargs: dict[str, Any] = {
- "process_count": cfg.dataset_num_proc,
+ process_count = (
+ cfg.dataset_num_proc
+ if isinstance(cfg.dataset_num_proc, int) and cfg.dataset_num_proc > 0
+ else (get_default_process_count() or 1)
+ )
+ dataset_kwargs: dict[str, Any] = {
+ "process_count": process_count,
"keep_in_memory": cfg.dataset_keep_in_memory is True,
}Please confirm schema normalization guarantees a positive int for dataset_num_proc when only deprecated keys (dataset_processes or dataset_num_processes) are set.
🏁 Script executed:
#!/bin/bash
set -e
echo "=== Occurrences of dataset_num_proc ==="
rg -n "dataset_num_proc" -C3
echo
echo "=== Occurrences of deprecated key dataset_processes ==="
rg -n "dataset_processes" -C3 || true
echo
echo "=== Occurrences of deprecated key dataset_num_processes ==="
rg -n "dataset_num_processes" -C3 || true
echo
echo "=== Scanning all config files for dataset normalization ==="
fd -t f -g "*config*.py" | while read -r file; do
echo
echo ">> File: $file"
rg -nP "dataset_.*process" -C2 "$file" || true
doneLength of output: 24593
Harden process_count fallback in wrappers.py
Schema normalization only maps dataset_processes → dataset_num_proc without enforcing > 0 (and dataset_num_processes isn’t recognized), so cfg.dataset_num_proc can still be None or 0. Apply:
+from axolotl.utils.datasets import get_default_process_count
@@ src/axolotl/utils/data/wrappers.py:82
- dataset_kwargs: dict[str, Any] = {
- "process_count": cfg.dataset_num_proc,
+ process_count = (
+ cfg.dataset_num_proc
+ if isinstance(cfg.dataset_num_proc, int) and cfg.dataset_num_proc > 0
+ else (get_default_process_count() or 1)
+ )
+ dataset_kwargs: dict[str, Any] = {
+ "process_count": process_count,
"keep_in_memory": cfg.dataset_keep_in_memory is True,
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| dataset_kwargs: dict[str, Any] = { | |
| "process_count": cfg.dataset_processes, | |
| "process_count": cfg.dataset_num_proc, | |
| "keep_in_memory": cfg.dataset_keep_in_memory is True, | |
| } | |
| from axolotl.utils.datasets import get_default_process_count | |
| process_count = ( | |
| cfg.dataset_num_proc | |
| if isinstance(cfg.dataset_num_proc, int) and cfg.dataset_num_proc > 0 | |
| else (get_default_process_count() or 1) | |
| ) | |
| dataset_kwargs: dict[str, Any] = { | |
| "process_count": process_count, | |
| "keep_in_memory": cfg.dataset_keep_in_memory is True, | |
| } |
🤖 Prompt for AI Agents
In src/axolotl/utils/data/wrappers.py around lines 82 to 85, the dataset_kwargs
assignment uses cfg.dataset_num_proc directly which can be None or 0; change the
logic to ensure process_count is a positive integer fallbacking to 1 (e.g.,
compute a safe value by treating None or non-positive values as 1 and casting to
int), and keep the existing keep_in_memory boolean coercion unchanged.
| def default_dataset_num_proc(cls, data): | ||
| if data.get("dataset_processes") is not None: | ||
| if data.get("dataset_num_proc") is None: | ||
| data["dataset_num_proc"] = data["dataset_processes"] | ||
| LOG.warning( | ||
| "dataset_processes is deprecated and will be removed in a future version. " | ||
| "Please use dataset_num_proc instead." | ||
| ) | ||
| else: | ||
| raise ValueError( | ||
| "Both dataset_processes and dataset_num_proc are set. " | ||
| "Using dataset_num_proc and ignoring dataset_processes." | ||
| ) | ||
| if data.get("dataset_num_proc") is None: | ||
| data["dataset_num_proc"] = get_default_process_count() | ||
|
|
||
| return data |
There was a problem hiding this comment.
Defaulting logic leaves dataset_num_proc unset when both fields are None; error message contradicts behavior.
- If neither key is provided, dataset_num_proc stays None (no default applied).
- When both are set, the ValueError message says it will ignore dataset_processes, but the code raises instead.
Proposed fix: always derive a sanitized dataset_num_proc; raise with a clear message only when both are set.
@model_validator(mode="before")
@classmethod
- def default_dataset_num_proc(cls, data):
- if data.get("dataset_processes") is not None:
- if data.get("dataset_num_proc") is None:
- data["dataset_num_proc"] = data["dataset_processes"]
- LOG.warning(
- "dataset_processes is deprecated and will be removed in a future version. "
- "Please use dataset_num_proc instead."
- )
- else:
- raise ValueError(
- "Both dataset_processes and dataset_num_proc are set. "
- "Using dataset_num_proc and ignoring dataset_processes."
- )
- if data.get("dataset_num_proc") is None:
- data["dataset_num_proc"] = get_default_process_count()
- return data
+ def default_dataset_num_proc(cls, data):
+ has_old = data.get("dataset_processes") is not None
+ has_new = data.get("dataset_num_proc") is not None
+ if has_old and has_new:
+ raise ValueError(
+ "Both dataset_processes and dataset_num_proc are set. Remove dataset_processes and use dataset_num_proc."
+ )
+ if not has_new:
+ if has_old:
+ data["dataset_num_proc"] = data["dataset_processes"]
+ LOG.warning(
+ "dataset_processes is deprecated and will be removed in a future version. "
+ "Please use dataset_num_proc instead."
+ )
+ else:
+ data["dataset_num_proc"] = get_default_process_count()
+ # sanitize
+ try:
+ data["dataset_num_proc"] = max(int(data["dataset_num_proc"]), 1)
+ except (TypeError, ValueError):
+ raise ValueError("dataset_num_proc must be a positive integer.")
+ return data📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def default_dataset_num_proc(cls, data): | |
| if data.get("dataset_processes") is not None: | |
| if data.get("dataset_num_proc") is None: | |
| data["dataset_num_proc"] = data["dataset_processes"] | |
| LOG.warning( | |
| "dataset_processes is deprecated and will be removed in a future version. " | |
| "Please use dataset_num_proc instead." | |
| ) | |
| else: | |
| raise ValueError( | |
| "Both dataset_processes and dataset_num_proc are set. " | |
| "Using dataset_num_proc and ignoring dataset_processes." | |
| ) | |
| if data.get("dataset_num_proc") is None: | |
| data["dataset_num_proc"] = get_default_process_count() | |
| return data | |
| @model_validator(mode="before") | |
| @classmethod | |
| def default_dataset_num_proc(cls, data): | |
| has_old = data.get("dataset_processes") is not None | |
| has_new = data.get("dataset_num_proc") is not None | |
| if has_old and has_new: | |
| raise ValueError( | |
| "Both dataset_processes and dataset_num_proc are set. " | |
| "Remove dataset_processes and use dataset_num_proc." | |
| ) | |
| if not has_new: | |
| if has_old: | |
| data["dataset_num_proc"] = data["dataset_processes"] | |
| LOG.warning( | |
| "dataset_processes is deprecated and will be removed in a future version. " | |
| "Please use dataset_num_proc instead." | |
| ) | |
| else: | |
| data["dataset_num_proc"] = get_default_process_count() | |
| # sanitize | |
| try: | |
| data["dataset_num_proc"] = max(int(data["dataset_num_proc"]), 1) | |
| except (TypeError, ValueError): | |
| raise ValueError("dataset_num_proc must be a positive integer.") | |
| return data |
🤖 Prompt for AI Agents
In src/axolotl/utils/schemas/config.py around lines 1307-1323, the defaulting
logic can leave dataset_num_proc unset when both keys are None and the
ValueError contradicts intended behavior when both are set; update the function
so it first checks: if dataset_processes is set and dataset_num_proc is None,
assign dataset_num_proc = dataset_processes and log the deprecation warning; if
both dataset_processes and dataset_num_proc are set, raise a ValueError with a
clear message that dataset_num_proc will be used and dataset_processes is
ignored; after these checks, if dataset_num_proc is still None, set it to
get_default_process_count(); finally return data.
| "ddp_bucket_cap_mb": 25, | ||
| "ddp_broadcast_buffers": False, | ||
| "dataset_processes": 4, | ||
| "dataset_num_proc": 4, |
There was a problem hiding this comment.
💡 Verification agent
🧩 Analysis chain
Rename to dataset_num_proc looks correct.
Matches the new canonical key; no logic change.
Run this to find any remaining uses of the deprecated key or accidental dual-specification elsewhere in the repo:
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "Deprecated key occurrences (should be 0 outside compat paths):"
rg -nP '\bdataset_processes\b' -C2
echo
echo "Check env var usages:"
rg -nP 'AXOLOTL_DATASET_(NUM_PROC|PROCESSES)\b' -C2
echo
echo "Canonical keys in use:"
rg -nP '\bdataset_num_proc(?:esses)?\b' -C2
echo
echo "Potential spots setting both keys in same mapping (should be 0):"
rg -nPU '(?s)\{[^}]*\bdataset_processes\b[^}]*\bdataset_num_proc\b[^}]*\}' -n || trueLength of output: 17337
Remove deprecated dataset_processes references outside compatibility code
- tests/test_exact_deduplication.py:213, tests/e2e/test_streaming.py:33 and devtools/dev_chat_template.yml:16 still use
"dataset_processes":; update these to"dataset_num_proc": - Keep the deprecation handling in
src/axolotl/utils/schemas/config.py(and the env-var fallbacks insrc/axolotl/utils/datasets.pyand CI scripts) for now
| "dataset_num_proc": 4, | ||
| } |
There was a problem hiding this comment.
💡 Verification agent
🧩 Analysis chain
LGTM on adopting dataset_num_proc across tests.
Renames are consistent and improve clarity.
There appear to be lingering uses of dataset_processes elsewhere (e.g., tests/test_exact_deduplication.py in provided snippet). Run to surface stragglers and verify alias support:
Also applies to: 183-184, 223-224, 255-256, 289-290, 373-374, 474-475
🏁 Script executed:
#!/bin/bash
# Find deprecated/new keys and envs
rg -nP -C2 '\bdataset_processes\b|\bdataset_num_processes\b|\bAXOLOTL_DATASET_PROCESSES\b' || true
# Sanity: confirm canonical key usage sites
rg -nP -C2 '\bdataset_num_proc\b' || trueLength of output: 16633
🏁 Script executed:
#!/bin/bash
sed -n '200,240p' tests/test_exact_deduplication.py
sed -n '20,80p' tests/e2e/test_streaming.pyLength of output: 3392
Replace all remaining dataset_processes occurrences with the new dataset_num_proc key
Alias support via default_dataset_num_proc and env-var fallbacks is in place. Update the following to use dataset_num_proc (and drop AXOLOTL_DATASET_PROCESSES):
- tests/test_exact_deduplication.py (fixture at lines 211-214)
- tests/e2e/test_streaming.py (cfg at lines 31-34)
- devtools/dev_chat_template.yml (line 16)
- cicd/single_gpu.py (remove
AXOLOTL_DATASET_PROCESSES) - cicd/Dockerfile.jinja (remove
AXOLOTL_DATASET_PROCESSES)
🤖 Prompt for AI Agents
In tests/test_datasets.py around lines 144-145, ensure the configuration uses
"dataset_num_proc" (already present) and then update all remaining references of
the old key "dataset_processes" to "dataset_num_proc" across the codebase:
change the fixture in tests/test_exact_deduplication.py (lines ~211-214) to use
dataset_num_proc, update the config in tests/e2e/test_streaming.py (lines
~31-34) to dataset_num_proc, and remove any use of the AXOLOTL_DATASET_PROCESSES
env var/key from devtools/dev_chat_template.yml (line 16), cicd/single_gpu.py,
and cicd/Dockerfile.jinja; keep alias/fallback logic via
default_dataset_num_proc and env var fallbacks if present but do not reintroduce
the AXOLOTL_DATASET_PROCESSES key.
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
| "Please use dataset_num_proc instead." | ||
| ) | ||
| else: | ||
| raise ValueError( |
There was a problem hiding this comment.
I don't think we need a ValueError here, just log a warning.
| sequential=cfg.sample_packing_sequentially, | ||
| drop_last=True, | ||
| num_processes=cfg.dataset_processes, | ||
| num_processes=cfg.data_num_proc, |
| "ddp_bucket_cap_mb": 25, | ||
| "ddp_broadcast_buffers": False, | ||
| "dataset_processes": 4, | ||
| "dataset_num_proc": 4, |
There was a problem hiding this comment.
We should leave at least one of the tests with the old value to ensure the migration happens correctly.
|
Sorry to request changes, just a few comments and didn't want to merge until they were addressed. |
| deprecated="Use `dataset_num_proc` instead. This parameter will be removed in a future version.", | ||
| json_schema_extra={ | ||
| "description": ( | ||
| "DEPRECATED: Use `dataset_num_proc` instead. " |
There was a problem hiding this comment.
We don't need this when we're using deprecated above.
| data["dataset_processes"] = get_default_process_count() | ||
| def default_dataset_num_proc(cls, data): | ||
| if data.get("dataset_processes") is not None: | ||
| if data.get("dataset_num_proc") is not None: |
There was a problem hiding this comment.
Could you check this again? The conditions for this statement are reversed. You're checking for the existence of dataset_num_proc but then overriding it with dataset_processes.
salmanmohammadi
left a comment
There was a problem hiding this comment.
LGTM after CI is green.
NanoCode012
left a comment
There was a problem hiding this comment.
Just did a re-check and noticed this.
| def get_default_process_count(): | ||
| if axolotl_dataset_num_proc := os.environ.get("AXOLOTL_DATASET_NUM_PROC"): | ||
| return int(axolotl_dataset_num_proc) | ||
| if axolotl_dataset_processes := os.environ.get("AXOLOTL_DATASET_PROCESSES"): |
There was a problem hiding this comment.
I missed this. Should this include a LOG warning if this is set, in the schema.config.py?
Co-authored-by: NanoCode012 <kevinvong@rocketmail.com>
Co-authored-by: NanoCode012 <kevinvong@rocketmail.com>
* feat:add support dataset_num_processes * chore * required changes * requested chnages * required chnages * required changes * required changes * elif get_default_process_count() * add:del data * Update cicd/Dockerfile.jinja Co-authored-by: NanoCode012 <kevinvong@rocketmail.com> * Update cicd/single_gpu.py Co-authored-by: NanoCode012 <kevinvong@rocketmail.com> --------- Co-authored-by: salman <salman.mohammadi@outlook.com> Co-authored-by: NanoCode012 <kevinvong@rocketmail.com> (cherry picked from commit cd856b4)
Description
refactor for #1783 deprecate dataset_processes in preference of dataset_num_proc
Summary by CodeRabbit