Converted potentially large variables to type long#5041
Converted potentially large variables to type long#5041mstfbl merged 4 commits intodotnet:masterfrom
Conversation
| // *** Binary format *** | ||
| // int: _labelCount | ||
| // int[_labelCount]: _labelHistogram | ||
| // int: _labelCount (read during reading of _labelHistogram in ReadLongArray()) |
There was a problem hiding this comment.
I added this comment as we are not explicitly reading _labelCount here, but in ReadLongArray() as shown below:
machinelearning/src/Microsoft.ML.Core/Utilities/Stream.cs
Lines 681 to 688 in 214926f
| // int[_labelCount]: _absentFeaturesLogProb | ||
| ctx.Writer.WriteIntArray(_labelHistogram.AsSpan(0, _labelCount)); | ||
| ctx.Writer.Write(_labelCount); | ||
| ctx.Writer.WriteLongStream(_labelHistogram); |
There was a problem hiding this comment.
.AsSpan() is not implemented for converting from 'System.Span' to 'System.Collections.Generic.IEnumerable. The only other way around this is using Array.Copy(), but that introduces needless array copying. I also don't see a case where not all of _labelHistogram is not serialized. As such, _labelHistogram is always serialized whole here. #Resolved
| { | ||
| if (_labelHistogram[i] > 0) | ||
| ctx.Writer.WriteIntsNoCount(_featureHistogram[i].AsSpan(0, _featureCount)); | ||
| ctx.Writer.WriteLongStream(_featureHistogram[i]); |
There was a problem hiding this comment.
Sames as above.
.AsSpan() is not implemented for converting from 'System.Span' to 'System.Collections.Generic.IEnumerable. The only other way around this is using Array.Copy(), but that introduces needless array copying. I also don't see a case where not all of _labelHistogram is not serialized. As such, _labelHistogram is always serialized whole here. #Resolved
| else | ||
| { | ||
| _labelHistogram = Array.ConvertAll(ctx.Reader.ReadIntArray() ?? new int[0], x => (long)x); | ||
| } |
There was a problem hiding this comment.
If ReadIntArray returns null, it likely means the file is bad. Should you be throwing an error in this case? The old behavior seems wrong. #Resolved
There was a problem hiding this comment.
Hey Harish, the array being read from ctx.Reader.ReadIntArray(int size) can return null if the size of the array being loaded is 0. Here's the source code:
machinelearning/src/Microsoft.ML.Core/Utilities/Stream.cs
Lines 605 to 641 in 7628d6c
| _featureHistogram[iLabel] = ctx.Reader.ReadLongArray(_featureCount); | ||
| else | ||
| _featureHistogram[iLabel] = Array.ConvertAll(ctx.Reader.ReadIntArray(_featureCount) ?? new int[0], x => (long)x); | ||
| for (int iFeature = 0; iFeature < _featureCount; iFeature += 1) |
There was a problem hiding this comment.
Same comment as above #Resolved
Fixes #3228
As explained in Issue #3228, very large datasets more than 2.14 billion rows of data can cause overflow when, say, the sum of these labels are obtained, and if these are stored as ints. This PR converts arrays and matrices for storing labels and features in their respective histograms from type
intto typelong. In addition, this PR updates the version of NaiveBayesMulticlassTrainer's Loader to preserve backwards compatibility.