Skip to content

Converted potentially large variables to type long#5041

Merged
mstfbl merged 4 commits intodotnet:masterfrom
mstfbl:Issue-3228
Apr 21, 2020
Merged

Converted potentially large variables to type long#5041
mstfbl merged 4 commits intodotnet:masterfrom
mstfbl:Issue-3228

Conversation

@mstfbl
Copy link
Contributor

@mstfbl mstfbl commented Apr 20, 2020

Fixes #3228

As explained in Issue #3228, very large datasets more than 2.14 billion rows of data can cause overflow when, say, the sum of these labels are obtained, and if these are stored as ints. This PR converts arrays and matrices for storing labels and features in their respective histograms from type int to type long. In addition, this PR updates the version of NaiveBayesMulticlassTrainer's Loader to preserve backwards compatibility.

@mstfbl mstfbl changed the title Converted potentially large values to type long Converted potentially large variables to type long Apr 20, 2020
// *** Binary format ***
// int: _labelCount
// int[_labelCount]: _labelHistogram
// int: _labelCount (read during reading of _labelHistogram in ReadLongArray())
Copy link
Contributor Author

@mstfbl mstfbl Apr 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this comment as we are not explicitly reading _labelCount here, but in ReadLongArray() as shown below:

public static long[] ReadLongArray(this BinaryReader reader)
{
Contracts.AssertValue(reader);
int size = reader.ReadInt32();
Contracts.CheckDecode(size >= 0);
return ReadLongArray(reader, size);
}
#Resolved

// int[_labelCount]: _absentFeaturesLogProb
ctx.Writer.WriteIntArray(_labelHistogram.AsSpan(0, _labelCount));
ctx.Writer.Write(_labelCount);
ctx.Writer.WriteLongStream(_labelHistogram);
Copy link
Contributor Author

@mstfbl mstfbl Apr 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.AsSpan() is not implemented for converting from 'System.Span' to 'System.Collections.Generic.IEnumerable. The only other way around this is using Array.Copy(), but that introduces needless array copying. I also don't see a case where not all of _labelHistogram is not serialized. As such, _labelHistogram is always serialized whole here. #Resolved

{
if (_labelHistogram[i] > 0)
ctx.Writer.WriteIntsNoCount(_featureHistogram[i].AsSpan(0, _featureCount));
ctx.Writer.WriteLongStream(_featureHistogram[i]);
Copy link
Contributor Author

@mstfbl mstfbl Apr 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sames as above.

.AsSpan() is not implemented for converting from 'System.Span' to 'System.Collections.Generic.IEnumerable. The only other way around this is using Array.Copy(), but that introduces needless array copying. I also don't see a case where not all of _labelHistogram is not serialized. As such, _labelHistogram is always serialized whole here. #Resolved

@mstfbl mstfbl marked this pull request as ready for review April 21, 2020 01:39
@mstfbl mstfbl requested a review from a team as a code owner April 21, 2020 01:39
else
{
_labelHistogram = Array.ConvertAll(ctx.Reader.ReadIntArray() ?? new int[0], x => (long)x);
}
Copy link
Contributor

@harishsk harishsk Apr 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If ReadIntArray returns null, it likely means the file is bad. Should you be throwing an error in this case? The old behavior seems wrong. #Resolved

Copy link
Contributor Author

@mstfbl mstfbl Apr 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Harish, the array being read from ctx.Reader.ReadIntArray(int size) can return null if the size of the array being loaded is 0. Here's the source code:

public static int[] ReadIntArray(this BinaryReader reader)
{
Contracts.AssertValue(reader);
int size = reader.ReadInt32();
Contracts.CheckDecode(size >= 0);
return ReadIntArray(reader, size);
}
public static int[] ReadIntArray(this BinaryReader reader, int size)
{
Contracts.AssertValue(reader);
Contracts.Assert(size >= 0);
if (size == 0)
return null;
var values = new int[size];
long bufferSizeInBytes = (long)size * sizeof(int);
if (bufferSizeInBytes < _bulkReadThresholdInBytes)
{
for (int i = 0; i < size; i++)
values[i] = reader.ReadInt32();
}
else
{
unsafe
{
fixed (void* dst = values)
{
ReadBytes(reader, dst, bufferSizeInBytes, bufferSizeInBytes);
}
}
}
return values;
}
#Resolved

_featureHistogram[iLabel] = ctx.Reader.ReadLongArray(_featureCount);
else
_featureHistogram[iLabel] = Array.ConvertAll(ctx.Reader.ReadIntArray(_featureCount) ?? new int[0], x => (long)x);
for (int iFeature = 0; iFeature < _featureCount; iFeature += 1)
Copy link
Contributor

@harishsk harishsk Apr 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above #Resolved

Copy link
Contributor

@harishsk harishsk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@mstfbl mstfbl merged commit 0a980fc into dotnet:master Apr 21, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Mar 18, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Overflow in MultiClassNaiveBayes

2 participants