[tmva] Implementation of a new shuffling strategy in RBatchGenerator #20071

martinfoell · 2025-10-09T12:23:26Z

This Pull request:

Introduces a new shuffling strategy for creating training batches, ensuring that each batch consists of data from different parts of the RDataFrame. Each chunk loaded into memory, which is used to create batches, now contains blocks of data drawn from different parts of the dataframe.

github-actions · 2025-10-09T18:45:08Z

Test Results

21 files 21 suites 3d 16h 14m 56s ⏱️
3 692 tests 3 691 ✅ 0 💤 1 ❌
75 650 runs 75 648 ✅ 0 💤 2 ❌

For more details on these failures, see this check.

Results for commit 4aa4641.

♻️ This comment has been updated with latest results.

vepadulano

Thanks a lot for this great work! This is a first iteration of comments from my side.

bindings/pyroot/pythonizations/python/ROOT/_pythonization/_tmva/_batchgenerator.py

bindings/pyroot/pythonizations/test/rbatchgenerator_completeness.py

tmva/tmva/inc/TMVA/BatchGenerator/RBatchGenerator.hxx

tmva/tmva/inc/TMVA/BatchGenerator/RChunkConstructor.hxx

martinfoell · 2025-10-17T14:33:37Z

Thank you for the review @vepadulano! I have implemented the changes that you suggested and left some comments where it was unclear what the code was doing.

This commit introduces a new shuffling strategy for creating batches of data used in ML training loaded by RDataFrame lazily. Shuffling of the data is important in ML to avoid bias towards any particular class, label, type used as target of a classification task in the training. Lazy loading of data into batches allows one to manage datasets that do not fit in memory. The shuffling strategy is implemented as follows. There are three important components: chunk, block and batch. A chunk is the largest piece of data that is loaded into memory at a time. A block is a logical range of consecutive entries, multiple blocks make up a chunk. The batch is the container that will be expected by the model. The sizes of the chunks, blocks and batches are set by the user to specify how shuffled the data should be. Given the total number of entries in the dataset, the chunk size and the block size we compute how many blocks will make up a chunk. The logic of creating the chunks from the blocks is done in the class RChunkConstructor where the chunks are defined by taking blocks from random parts of the dataset. Once the chunks are defined by the indices they are loaded into memory one at a time with the class RChunkLoader where the entries of the chunk are further shuffled. Finally, the chunk is further split into batches with the class RBatchLoader. In the class RBatchGenerator the steps described above are connected, creating training and validation batches from the splitting defined by the user.

vepadulano

Congratulations @martinfoell great work!

Follow up on root-project#20071. Fix the following compiler warnings: ```txt /home/rembserj/code/root/root_src/tmva/tmva/inc/TMVA/BatchGenerator/RBatchLoader.hxx:46:16: warning: private field 'fChunkSize' is not used [-Wunused-private-field] 46 | std::size_t fChunkSize; | ^ /home/rembserj/code/root/root_src/tmva/tmva/inc/TMVA/BatchGenerator/RBatchLoader.hxx:49:16: warning: private field 'fMaxBatches' is not used [-Wunused-private-field] 49 | std::size_t fMaxBatches; | ^ /home/rembserj/code/root/root_src/tmva/tmva/inc/TMVA/BatchGenerator/RBatchLoader.hxx:50:16: warning: private field 'fTrainingRemainderRow' is not used [-Wunused-private-field] 50 | std::size_t fTrainingRemainderRow = 0; | ^ /home/rembserj/code/root/root_src/tmva/tmva/inc/TMVA/BatchGenerator/RBatchLoader.hxx:51:16: warning: private field 'fValidationRemainderRow' is not used [-Wunused-private-field] 51 | std::size_t fValidationRemainderRow = 0; | ^ ``` Also, use inline namespaces for code brevity and remove some comments that trigger clang-format to not run on the doc string, although it only suggests harmless line breaks that don't change how the rendered doxygen looks like.

Follow up on #20071. Fix the following compiler warnings: ```txt /home/rembserj/code/root/root_src/tmva/tmva/inc/TMVA/BatchGenerator/RBatchLoader.hxx:46:16: warning: private field 'fChunkSize' is not used [-Wunused-private-field] 46 | std::size_t fChunkSize; | ^ /home/rembserj/code/root/root_src/tmva/tmva/inc/TMVA/BatchGenerator/RBatchLoader.hxx:49:16: warning: private field 'fMaxBatches' is not used [-Wunused-private-field] 49 | std::size_t fMaxBatches; | ^ /home/rembserj/code/root/root_src/tmva/tmva/inc/TMVA/BatchGenerator/RBatchLoader.hxx:50:16: warning: private field 'fTrainingRemainderRow' is not used [-Wunused-private-field] 50 | std::size_t fTrainingRemainderRow = 0; | ^ /home/rembserj/code/root/root_src/tmva/tmva/inc/TMVA/BatchGenerator/RBatchLoader.hxx:51:16: warning: private field 'fValidationRemainderRow' is not used [-Wunused-private-field] 51 | std::size_t fValidationRemainderRow = 0; | ^ ``` Also, use inline namespaces for code brevity and remove some comments that trigger clang-format to not run on the doc string, although it only suggests harmless line breaks that don't change how the rendered doxygen looks like.

martinfoell self-assigned this Oct 9, 2025

martinfoell requested review from bellenot, couet, dpiparo, lmoneta and vepadulano as code owners October 9, 2025 12:23

vepadulano requested changes Oct 17, 2025

View reviewed changes

martinfoell force-pushed the rbatchgenerator-new-shuffling branch from 57396ba to 4aa4641 Compare October 21, 2025 13:47

vepadulano approved these changes Oct 21, 2025

View reviewed changes

martinfoell merged commit 2f055b0 into root-project:master Oct 21, 2025
21 of 27 checks passed

guitargeek mentioned this pull request Oct 22, 2025

[tmva] Fix Clang 21 compiler warnings in RBatchLoader #20175

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[tmva] Implementation of a new shuffling strategy in RBatchGenerator #20071

[tmva] Implementation of a new shuffling strategy in RBatchGenerator #20071

Uh oh!

martinfoell commented Oct 9, 2025

Uh oh!

github-actions bot commented Oct 9, 2025 •

edited

Loading

Uh oh!

vepadulano left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

martinfoell commented Oct 17, 2025

Uh oh!

vepadulano left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[tmva] Implementation of a new shuffling strategy in RBatchGenerator #20071

[tmva] Implementation of a new shuffling strategy in RBatchGenerator #20071

Uh oh!

Conversation

martinfoell commented Oct 9, 2025

This Pull request:

Uh oh!

github-actions bot commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Results

Uh oh!

vepadulano left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

martinfoell commented Oct 17, 2025

Uh oh!

vepadulano left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions bot commented Oct 9, 2025 •

edited

Loading