Skip to content

Data loader refactor#2707

Merged
djsaunde merged 36 commits into
mainfrom
data-load-refactor
Jun 10, 2025
Merged

Data loader refactor#2707
djsaunde merged 36 commits into
mainfrom
data-load-refactor

Conversation

@djsaunde

@djsaunde djsaunde commented May 22, 2025

Copy link
Copy Markdown
Collaborator

Description

Data loading refactor, with an emphasis on sft.py, rl.py, and related modules.

Motivation and Context

The current state of data loading involves a lot of misdirection, undocumented code, missing typing, etc. This refactor aims to clean things up to improve readability and extensibility.

Also closes #2684 via filelock implementation (credit to @casper-hansen for reference code).

How has this been tested?

TODO

Screenshots (if appropriate)

Types of changes

Social Handles (Optional)

Summary by CodeRabbit

  • New Features

    • Introduced a modular system for dataset wrapping and tokenization supporting diverse dataset types and prompt styles.
    • Added generalized dataset preparation workflows for supervised fine-tuning and reinforcement learning with distributed synchronization and caching.
    • Implemented a file-based locking mechanism to coordinate dataset loading and preparation across concurrent processes.
  • Improvements

    • Enhanced dataset loading from local, cloud, remote, and URL sources with improved error handling and modular design.
    • Streamlined deduplication and sequence filtering for better clarity and reliability.
    • Improved code clarity, documentation, and consistent use of modern Python type hints.
    • Simplified dataset preparation logic with better modularity and distributed coordination.
    • Refined retry strategies and hashing utilities with clearer documentation.
    • Safer configuration validation preventing attribute errors.
    • Added explicit public APIs and standardized dataset cache paths.
    • Updated dataset wrapper selection with extensible handlers for various dataset categories.
    • Simplified and clarified dataset loading and preparation function signatures and internal logic.
    • Removed redundant CLI argument dependencies in dataset loading calls across tests.
  • Bug Fixes

    • Corrected subprocess termination check logic to properly detect process exit.
    • Fixed potential attribute errors in configuration validation by using safe attribute access.
  • Tests

    • Updated tests to reflect new dataset preparation and deduplication APIs.
    • Added comprehensive tests for the file-based locking mechanism ensuring safe concurrent dataset loading.
    • Adjusted tests to remove CLI argument dependencies in dataset loading calls.

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Distributed Timeout during Dataset Tokenization

4 participants