Change the _maxCalibrationExamples default on CalibratorUtils#5415
Change the _maxCalibrationExamples default on CalibratorUtils#5415antoniovs1029 merged 2 commits intodotnet:masterfrom
Conversation
Codecov Report
@@ Coverage Diff @@
## master #5415 +/- ##
==========================================
- Coverage 74.08% 74.02% -0.07%
==========================================
Files 1019 1019
Lines 190355 190363 +8
Branches 20469 20469
==========================================
- Hits 141033 140914 -119
- Misses 43791 43905 +114
- Partials 5531 5544 +13
Flags with carried forward coverage won't be shown. Click here to find out more.
|
| // maximum number of rows passed to the calibrator. | ||
| private const int _maxCalibrationExamples = 1000000; | ||
| // if 0, we'll actually look through the whole dataset to | ||
| // when training the calibrator |
There was a problem hiding this comment.
Earlier, you had explained to me that if this value is zero, we would look through the whole dataset, but still only use a million rows (randomly selected) for calibration. Is that correct?
Can you please clarify the exact behavior in comments?
There was a problem hiding this comment.
It is correct for PlattCalibrator, but depending of the calibrator the behavior is different. If this is 0, the only thing that does happen for all the calibrators is that we'll look through all the dataset. I explained this further on the other comment I left, so I don't think it's necessary to clarify it more in here.
| if (maxRows > 0 && ++num >= maxRows) | ||
| // If maxRows was 0, we'll process all of the rows in the dataset | ||
| // Notice that depending of the calibrator, "processing" means | ||
| // only using N random rows of the ones that where processed |
There was a problem hiding this comment.
nit, typo: "depending on the calibrator"
* Update to Onnxruntime 1.5.1 (#5406) * Added variables to tests to control Gpu settings * Added dependency to prerelease * Updated to 1.5.1 * Remove prerelease feed * Nit on GPU variables * Change the _maxCalibrationExamples default on CalibratorUtils (#5415) * Change the _maxCalibrationExamples default * Improving comments * Fix perf regression in ShuffleRows (#5417) RowShufflingTransformer is using ChannelReader incorrectly. It needs to block waiting for items to read and was Thread.Sleeping in order to wait, but not spin the current core. This caused a major perf regression. The fix is to block synchronously correctly - by calling AsTask() on the ValueTask that is returned from the ChannelReader and block on the Task. Fix #5416 Co-authored-by: Antonio Velázquez <38739674+antoniovs1029@users.noreply.github.com> Co-authored-by: Eric Erhardt <eric.erhardt@microsoft.com>
* Update to Onnxruntime 1.5.1 (dotnet#5406) * Added variables to tests to control Gpu settings * Added dependency to prerelease * Updated to 1.5.1 * Remove prerelease feed * Nit on GPU variables * Change the _maxCalibrationExamples default on CalibratorUtils (dotnet#5415) * Change the _maxCalibrationExamples default * Improving comments * Fix perf regression in ShuffleRows (dotnet#5417) RowShufflingTransformer is using ChannelReader incorrectly. It needs to block waiting for items to read and was Thread.Sleeping in order to wait, but not spin the current core. This caused a major perf regression. The fix is to block synchronously correctly - by calling AsTask() on the ValueTask that is returned from the ChannelReader and block on the Task. Fix dotnet#5416 Co-authored-by: Antonio Velázquez <38739674+antoniovs1029@users.noreply.github.com> Co-authored-by: Eric Erhardt <eric.erhardt@microsoft.com>
* Update to Onnxruntime 1.5.1 (#5406) * Added variables to tests to control Gpu settings * Added dependency to prerelease * Updated to 1.5.1 * Remove prerelease feed * Nit on GPU variables * Change the _maxCalibrationExamples default on CalibratorUtils (#5415) * Change the _maxCalibrationExamples default * Improving comments * Fix perf regression in ShuffleRows (#5417) RowShufflingTransformer is using ChannelReader incorrectly. It needs to block waiting for items to read and was Thread.Sleeping in order to wait, but not spin the current core. This caused a major perf regression. The fix is to block synchronously correctly - by calling AsTask() on the ValueTask that is returned from the ChannelReader and block on the Task. Fix #5416 Co-authored-by: Antonio Velázquez <38739674+antoniovs1029@users.noreply.github.com> Co-authored-by: Eric Erhardt <eric.erhardt@microsoft.com>
As reported offline, ML.NET yielded different results than TLC when training a PlattCalibrator with the same dataset.
Upon further investigation, it turns out that it only happened on datasets over 1 million rows, and the reason was that when porting the CalibratorUtils class from TLC, a "_maxCalibrationExamples = 1000000" default parameter was added.
Upon reading through the code (in particular CalibratorTrainingBase's ProcessingTrainingExample) it turns out that on TLC
TrainCalibratorwas called withmaxRows = 0, and this made that when training the PlattCalibrator, all the dataset was seen, but only 1M rows where selected randomly to be added to the DataStore. In contrast, on ML.NET that same method was called withmaxRows = 1M, and this made that only the first 1M rows were added to the DataStore (instead of randomly selecting them from the complete dataset). This caused bias and undesired results.