Skip to content

Conversation

@martinfoell
Copy link
Contributor

This Pull request:

Introduces a new shuffling strategy for creating training batches, ensuring that each batch consists of data from different parts of the RDataFrame. Each chunk loaded into memory, which is used to create batches, now contains blocks of data drawn from different parts of the dataframe.

@github-actions
Copy link

github-actions bot commented Oct 9, 2025

Test Results

    21 files      21 suites   3d 16h 14m 56s ⏱️
 3 692 tests  3 691 ✅ 0 💤 1 ❌
75 650 runs  75 648 ✅ 0 💤 2 ❌

For more details on these failures, see this check.

Results for commit 4aa4641.

♻️ This comment has been updated with latest results.

Copy link
Member

@vepadulano vepadulano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for this great work! This is a first iteration of comments from my side.

@martinfoell
Copy link
Contributor Author

Thank you for the review @vepadulano! I have implemented the changes that you suggested and left some comments where it was unclear what the code was doing.

This commit introduces a new shuffling strategy for creating batches of data used in ML training loaded by RDataFrame lazily. Shuffling of the data is important in ML to avoid bias towards any particular class, label, type used as target of a classification task in the training. Lazy loading of data into batches allows one to manage datasets that do not fit in memory.

The shuffling strategy is implemented as follows. There are three important components: chunk, block and batch. A chunk is the largest piece of data that is loaded into memory at a time. A block is a logical range of consecutive entries, multiple blocks make up a chunk. The batch is the container that will be expected by the model. The sizes of the chunks, blocks and batches are set by the user to specify how shuffled the data should be. Given the total number of entries in the dataset, the chunk size and the block size we compute how many blocks will make up a chunk. The logic of creating the chunks from the blocks is done in the class RChunkConstructor where the chunks are defined by taking blocks from random parts of the dataset. Once the chunks are defined by the indices they are loaded into memory one at a time with the class RChunkLoader where the entries of the chunk are further shuffled. Finally, the chunk is further split into batches with the class RBatchLoader.

In the class RBatchGenerator the steps described above are connected, creating training and validation batches from the splitting defined by the user.
@martinfoell martinfoell force-pushed the rbatchgenerator-new-shuffling branch from 57396ba to 4aa4641 Compare October 21, 2025 13:47
Copy link
Member

@vepadulano vepadulano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Congratulations @martinfoell great work!

@martinfoell martinfoell merged commit 2f055b0 into root-project:master Oct 21, 2025
21 of 27 checks passed
guitargeek added a commit to guitargeek/root that referenced this pull request Oct 22, 2025
Follow up on root-project#20071.

Fix the following compiler warnings:
```txt
/home/rembserj/code/root/root_src/tmva/tmva/inc/TMVA/BatchGenerator/RBatchLoader.hxx:46:16: warning: private field 'fChunkSize' is not used [-Wunused-private-field]
   46 |    std::size_t fChunkSize;
      |                ^
/home/rembserj/code/root/root_src/tmva/tmva/inc/TMVA/BatchGenerator/RBatchLoader.hxx:49:16: warning: private field 'fMaxBatches' is not used [-Wunused-private-field]
   49 |    std::size_t fMaxBatches;
      |                ^
/home/rembserj/code/root/root_src/tmva/tmva/inc/TMVA/BatchGenerator/RBatchLoader.hxx:50:16: warning: private field 'fTrainingRemainderRow' is not used [-Wunused-private-field]
   50 |    std::size_t fTrainingRemainderRow = 0;
      |                ^
/home/rembserj/code/root/root_src/tmva/tmva/inc/TMVA/BatchGenerator/RBatchLoader.hxx:51:16: warning: private field 'fValidationRemainderRow' is not used [-Wunused-private-field]
   51 |    std::size_t fValidationRemainderRow = 0;
      |                ^
```

Also, use inline namespaces for code brevity and remove some comments
that trigger clang-format to not run on the doc string, although it only
suggests harmless line breaks that don't change how the rendered doxygen
looks like.
guitargeek added a commit to guitargeek/root that referenced this pull request Oct 22, 2025
Follow up on root-project#20071.

Fix the following compiler warnings:
```txt
/home/rembserj/code/root/root_src/tmva/tmva/inc/TMVA/BatchGenerator/RBatchLoader.hxx:46:16: warning: private field 'fChunkSize' is not used [-Wunused-private-field]
   46 |    std::size_t fChunkSize;
      |                ^
/home/rembserj/code/root/root_src/tmva/tmva/inc/TMVA/BatchGenerator/RBatchLoader.hxx:49:16: warning: private field 'fMaxBatches' is not used [-Wunused-private-field]
   49 |    std::size_t fMaxBatches;
      |                ^
/home/rembserj/code/root/root_src/tmva/tmva/inc/TMVA/BatchGenerator/RBatchLoader.hxx:50:16: warning: private field 'fTrainingRemainderRow' is not used [-Wunused-private-field]
   50 |    std::size_t fTrainingRemainderRow = 0;
      |                ^
/home/rembserj/code/root/root_src/tmva/tmva/inc/TMVA/BatchGenerator/RBatchLoader.hxx:51:16: warning: private field 'fValidationRemainderRow' is not used [-Wunused-private-field]
   51 |    std::size_t fValidationRemainderRow = 0;
      |                ^
```

Also, use inline namespaces for code brevity and remove some comments
that trigger clang-format to not run on the doc string, although it only
suggests harmless line breaks that don't change how the rendered doxygen
looks like.
guitargeek added a commit that referenced this pull request Oct 22, 2025
Follow up on #20071.

Fix the following compiler warnings:
```txt
/home/rembserj/code/root/root_src/tmva/tmva/inc/TMVA/BatchGenerator/RBatchLoader.hxx:46:16: warning: private field 'fChunkSize' is not used [-Wunused-private-field]
   46 |    std::size_t fChunkSize;
      |                ^
/home/rembserj/code/root/root_src/tmva/tmva/inc/TMVA/BatchGenerator/RBatchLoader.hxx:49:16: warning: private field 'fMaxBatches' is not used [-Wunused-private-field]
   49 |    std::size_t fMaxBatches;
      |                ^
/home/rembserj/code/root/root_src/tmva/tmva/inc/TMVA/BatchGenerator/RBatchLoader.hxx:50:16: warning: private field 'fTrainingRemainderRow' is not used [-Wunused-private-field]
   50 |    std::size_t fTrainingRemainderRow = 0;
      |                ^
/home/rembserj/code/root/root_src/tmva/tmva/inc/TMVA/BatchGenerator/RBatchLoader.hxx:51:16: warning: private field 'fValidationRemainderRow' is not used [-Wunused-private-field]
   51 |    std::size_t fValidationRemainderRow = 0;
      |                ^
```

Also, use inline namespaces for code brevity and remove some comments
that trigger clang-format to not run on the doc string, although it only
suggests harmless line breaks that don't change how the rendered doxygen
looks like.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants