Skip to content

Conversation

@saforem2
Copy link
Member

Copilot Summary

This pull request refactors the dataset index building logic in megatron/data/gpt_dataset.py to improve modularity and add disk caching for index arrays. The changes introduce new internal helper functions, enhance logging, and optimize distributed loading of index files. The most important changes are grouped below:

Dataset Index Building Refactor

  • Extracted index building logic into dedicated helper functions: _build_indices_blended, _build_indices_concat, and a new _build_indices dispatcher for improved code clarity and maintainability. [1] [2]

Disk Caching and Distributed Loading

  • Added a _cache_indices function that computes a hash of the dataset description, saves index arrays to disk, and loads them efficiently across distributed processes, reducing redundant computation and improving startup time.
  • Implemented robust error handling and logging for cache directory creation and file operations, with warnings for cache misses and access issues.

Logging and Messaging Improvements

  • Improved log messages for dataset construction, including more descriptive output for corpus datasets and timing information for index file operations. [1] [2]
  • Updated a log message to clarify that the dataset type is "CorpusDataset" instead of "ConcatDataset" for better accuracy.

@saforem2 saforem2 merged commit ec58e99 into saforem2/fix-formatting Sep 3, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants