- 
                Notifications
    You must be signed in to change notification settings 
- Fork 1.4k
[tmva] Implementation of a new shuffling strategy in RBatchGenerator #20071
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tmva] Implementation of a new shuffling strategy in RBatchGenerator #20071
Conversation
| Test Results    21 files      21 suites   3d 16h 14m 56s ⏱️ For more details on these failures, see this check. Results for commit 4aa4641. ♻️ This comment has been updated with latest results. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for this great work! This is a first iteration of comments from my side.
        
          
                bindings/pyroot/pythonizations/python/ROOT/_pythonization/_tmva/_batchgenerator.py
          
            Show resolved
            Hide resolved
        
              
          
                bindings/pyroot/pythonizations/python/ROOT/_pythonization/_tmva/_batchgenerator.py
          
            Show resolved
            Hide resolved
        
              
          
                bindings/pyroot/pythonizations/python/ROOT/_pythonization/_tmva/_batchgenerator.py
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                bindings/pyroot/pythonizations/python/ROOT/_pythonization/_tmva/_batchgenerator.py
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                bindings/pyroot/pythonizations/test/rbatchgenerator_completeness.py
              
                Outdated
          
            Show resolved
            Hide resolved
        
      | Thank you for the review @vepadulano! I have implemented the changes that you suggested and left some comments where it was unclear what the code was doing. | 
This commit introduces a new shuffling strategy for creating batches of data used in ML training loaded by RDataFrame lazily. Shuffling of the data is important in ML to avoid bias towards any particular class, label, type used as target of a classification task in the training. Lazy loading of data into batches allows one to manage datasets that do not fit in memory. The shuffling strategy is implemented as follows. There are three important components: chunk, block and batch. A chunk is the largest piece of data that is loaded into memory at a time. A block is a logical range of consecutive entries, multiple blocks make up a chunk. The batch is the container that will be expected by the model. The sizes of the chunks, blocks and batches are set by the user to specify how shuffled the data should be. Given the total number of entries in the dataset, the chunk size and the block size we compute how many blocks will make up a chunk. The logic of creating the chunks from the blocks is done in the class RChunkConstructor where the chunks are defined by taking blocks from random parts of the dataset. Once the chunks are defined by the indices they are loaded into memory one at a time with the class RChunkLoader where the entries of the chunk are further shuffled. Finally, the chunk is further split into batches with the class RBatchLoader. In the class RBatchGenerator the steps described above are connected, creating training and validation batches from the splitting defined by the user.
57396ba    to
    4aa4641      
    Compare
  
    There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Congratulations @martinfoell great work!
Follow up on root-project#20071. Fix the following compiler warnings: ```txt /home/rembserj/code/root/root_src/tmva/tmva/inc/TMVA/BatchGenerator/RBatchLoader.hxx:46:16: warning: private field 'fChunkSize' is not used [-Wunused-private-field] 46 | std::size_t fChunkSize; | ^ /home/rembserj/code/root/root_src/tmva/tmva/inc/TMVA/BatchGenerator/RBatchLoader.hxx:49:16: warning: private field 'fMaxBatches' is not used [-Wunused-private-field] 49 | std::size_t fMaxBatches; | ^ /home/rembserj/code/root/root_src/tmva/tmva/inc/TMVA/BatchGenerator/RBatchLoader.hxx:50:16: warning: private field 'fTrainingRemainderRow' is not used [-Wunused-private-field] 50 | std::size_t fTrainingRemainderRow = 0; | ^ /home/rembserj/code/root/root_src/tmva/tmva/inc/TMVA/BatchGenerator/RBatchLoader.hxx:51:16: warning: private field 'fValidationRemainderRow' is not used [-Wunused-private-field] 51 | std::size_t fValidationRemainderRow = 0; | ^ ``` Also, use inline namespaces for code brevity and remove some comments that trigger clang-format to not run on the doc string, although it only suggests harmless line breaks that don't change how the rendered doxygen looks like.
Follow up on root-project#20071. Fix the following compiler warnings: ```txt /home/rembserj/code/root/root_src/tmva/tmva/inc/TMVA/BatchGenerator/RBatchLoader.hxx:46:16: warning: private field 'fChunkSize' is not used [-Wunused-private-field] 46 | std::size_t fChunkSize; | ^ /home/rembserj/code/root/root_src/tmva/tmva/inc/TMVA/BatchGenerator/RBatchLoader.hxx:49:16: warning: private field 'fMaxBatches' is not used [-Wunused-private-field] 49 | std::size_t fMaxBatches; | ^ /home/rembserj/code/root/root_src/tmva/tmva/inc/TMVA/BatchGenerator/RBatchLoader.hxx:50:16: warning: private field 'fTrainingRemainderRow' is not used [-Wunused-private-field] 50 | std::size_t fTrainingRemainderRow = 0; | ^ /home/rembserj/code/root/root_src/tmva/tmva/inc/TMVA/BatchGenerator/RBatchLoader.hxx:51:16: warning: private field 'fValidationRemainderRow' is not used [-Wunused-private-field] 51 | std::size_t fValidationRemainderRow = 0; | ^ ``` Also, use inline namespaces for code brevity and remove some comments that trigger clang-format to not run on the doc string, although it only suggests harmless line breaks that don't change how the rendered doxygen looks like.
Follow up on #20071. Fix the following compiler warnings: ```txt /home/rembserj/code/root/root_src/tmva/tmva/inc/TMVA/BatchGenerator/RBatchLoader.hxx:46:16: warning: private field 'fChunkSize' is not used [-Wunused-private-field] 46 | std::size_t fChunkSize; | ^ /home/rembserj/code/root/root_src/tmva/tmva/inc/TMVA/BatchGenerator/RBatchLoader.hxx:49:16: warning: private field 'fMaxBatches' is not used [-Wunused-private-field] 49 | std::size_t fMaxBatches; | ^ /home/rembserj/code/root/root_src/tmva/tmva/inc/TMVA/BatchGenerator/RBatchLoader.hxx:50:16: warning: private field 'fTrainingRemainderRow' is not used [-Wunused-private-field] 50 | std::size_t fTrainingRemainderRow = 0; | ^ /home/rembserj/code/root/root_src/tmva/tmva/inc/TMVA/BatchGenerator/RBatchLoader.hxx:51:16: warning: private field 'fValidationRemainderRow' is not used [-Wunused-private-field] 51 | std::size_t fValidationRemainderRow = 0; | ^ ``` Also, use inline namespaces for code brevity and remove some comments that trigger clang-format to not run on the doc string, although it only suggests harmless line breaks that don't change how the rendered doxygen looks like.
This Pull request:
Introduces a new shuffling strategy for creating training batches, ensuring that each batch consists of data from different parts of the RDataFrame. Each chunk loaded into memory, which is used to create batches, now contains blocks of data drawn from different parts of the dataframe.