Data loader refactor by djsaunde · Pull Request #2707 · axolotl-ai-cloud/axolotl

djsaunde · 2025-05-22T18:43:16Z

Description

Data loading refactor, with an emphasis on sft.py, rl.py, and related modules.

Motivation and Context

The current state of data loading involves a lot of misdirection, undocumented code, missing typing, etc. This refactor aims to clean things up to improve readability and extensibility.

Also closes #2684 via filelock implementation (credit to @casper-hansen for reference code).

How has this been tested?

TODO

Screenshots (if appropriate)

Types of changes

Social Handles (Optional)

Summary by CodeRabbit

New Features
- Introduced a modular system for dataset wrapping and tokenization supporting diverse dataset types and prompt styles.
- Added generalized dataset preparation workflows for supervised fine-tuning and reinforcement learning with distributed synchronization and caching.
- Implemented a file-based locking mechanism to coordinate dataset loading and preparation across concurrent processes.
Improvements
- Enhanced dataset loading from local, cloud, remote, and URL sources with improved error handling and modular design.
- Streamlined deduplication and sequence filtering for better clarity and reliability.
- Improved code clarity, documentation, and consistent use of modern Python type hints.
- Simplified dataset preparation logic with better modularity and distributed coordination.
- Refined retry strategies and hashing utilities with clearer documentation.
- Safer configuration validation preventing attribute errors.
- Added explicit public APIs and standardized dataset cache paths.
- Updated dataset wrapper selection with extensible handlers for various dataset categories.
- Simplified and clarified dataset loading and preparation function signatures and internal logic.
- Removed redundant CLI argument dependencies in dataset loading calls across tests.
Bug Fixes
- Corrected subprocess termination check logic to properly detect process exit.
- Fixed potential attribute errors in configuration validation by using safe attribute access.
Tests
- Updated tests to reflect new dataset preparation and deduplication APIs.
- Added comprehensive tests for the file-based locking mechanism ensuring safe concurrent dataset loading.
- Adjusted tests to remove CLI argument dependencies in dataset loading calls.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Data loader refactor#2707

Data loader refactor#2707
djsaunde merged 36 commits into
mainfrom
data-load-refactor

djsaunde commented May 22, 2025 •

edited by coderabbitai Bot

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

djsaunde commented May 22, 2025 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

How has this been tested?

Screenshots (if appropriate)

Types of changes

Social Handles (Optional)

Summary by CodeRabbit

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

djsaunde commented May 22, 2025 •

edited by coderabbitai Bot

Loading