cache indices support #93

saforem2 · 2025-08-25T14:50:10Z

Copilot Summary

This pull request refactors the dataset index building logic in megatron/data/gpt_dataset.py to improve modularity and add disk caching for index arrays. The changes introduce new internal helper functions, enhance logging, and optimize distributed loading of index files. The most important changes are grouped below:

Dataset Index Building Refactor

Extracted index building logic into dedicated helper functions: _build_indices_blended, _build_indices_concat, and a new _build_indices dispatcher for improved code clarity and maintainability. [1] [2]

Disk Caching and Distributed Loading

Added a _cache_indices function that computes a hash of the dataset description, saves index arrays to disk, and loads them efficiently across distributed processes, reducing redundant computation and improving startup time.
Implemented robust error handling and logging for cache directory creation and file operations, with warnings for cache misses and access issues.

Logging and Messaging Improvements

Improved log messages for dataset construction, including more descriptive output for corpus datasets and timing information for index file operations. [1] [2]
Updated a log message to clarify that the dataset type is "CorpusDataset" instead of "ConcatDataset" for better accuracy.

zhenghh04 and others added 3 commits August 16, 2025 01:26

cache indices support

b691201

Merge branch 'saforem2/fix-formatting' into feature/cache_indices

1a3653d

Merge branch 'saforem2/fix-formatting' into feature/cache_indices

3d83690

saforem2 merged commit ec58e99 into saforem2/fix-formatting Sep 3, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cache indices support #93

cache indices support #93

Uh oh!

saforem2 commented Aug 25, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cache indices support #93

cache indices support #93

Uh oh!

Conversation

saforem2 commented Aug 25, 2025

Copilot Summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants