Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 32 additions & 12 deletions src/Microsoft.ML.FastTree/FastTree.cs
Original file line number Diff line number Diff line change
Expand Up @@ -1012,7 +1012,7 @@ private static IEnumerable<KeyValuePair<int, int>> NonZeroBinnedValuesForSparse(
}

private FeatureFlockBase CreateOneHotFlock(IChannel ch,
List<int> features, int[] binnedValues, int[] lastOn, ValuesList[] instanceList,
List<int> features, int[] binnedValues, int[] lastOn, Dictionary<int, ValuesList> instanceList,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice fix! Being one of the primary trainers in ML․NET, I'd recommend testing the speed & memory on a variety of datasets.

@justinormont justinormont Jun 18, 2020

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that you're reporting speed/memory on the FastTree ranking Test (not TrainTest); Test_Ranking_MSLRWeb10K_RawNumericFeatures_FastTreeRanking benchmark. For MAML, Test only runs the prediction step.

Running the training (TrainTest) TrainTest_Ranking_MSLRWeb10K_RawNumericFeatures_FastTreeRanking might be more telling.

The MSLR dataset is purely numeric, no text columns in it; so shouldn't be affected by the ngram length change.

Adding a FastTree text benchmark

You could add FastTree to the text benchmark in (Benchmarks/Text/MultiClassClassification.cs).

Code for adding a FastTree benchmark on WikiDetox would be::

        [Benchmark]
        public void CV_Multiclass_WikiDetox_BigramsAndTrichar_OVAFastTree()
        {
            string cmd = @"CV k=5 data=" + _dataPathWiki +
                        " loader=TextLoader{quote=- sparse=- col=Label:R4:0 col=rev_id:TX:1 col=comment:TX:2 col=logged_in:BL:4 col=ns:TX:5 col=sample:TX:6 col=split:TX:7 col=year:R4:3 header=+}" +
                        " xf=Convert{col=logged_in type=R4}" +
                        " xf=CategoricalTransform{col=ns}" +
                        " xf=TextTransform{col=FeaturesText:comment wordExtractor=NGramExtractorTransform{ngram=2}}" +
                        " xf=Concat{col=Features:FeaturesText,logged_in,ns}" +
                        " tr=OVA{p=FastTree}";

            var environment = EnvironmentFactory.CreateClassificationEnvironment<TextLoader, OneHotEncodingTransformer, FastTreeTrainer, LinearBinaryModelParameters>();
            cmd.ExecuteMamlCommand(environment);
        }

Heuristic

After testing various datasets, you may find the dictionary is beneficial for small (or large) sparse datasets and if we find that, we could use a heuristic of useDictionary = sparseness > 0.95 && rowsTimesSlots < 1E6;

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any example datasets I can use? Or point me to where I can find some? I am not sure where I would find datasets for something like this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was talking with Harish this morning about the sparseness and choosing between list/dictionary. We are going to discuss it again, but I coudln't find a way to know ahead of time whether the data would be sparse or dense without the user letting us know.

The test FastForestBinaryClassificationTestSummary uses a small dataset. It has about 1300 unique words and total word count is about 9000. The current fast tree trainer multiplies those 2 values together (which is about 11.5 million) and allocates an array of that size right off of the bat which is where the huge memory usage comes from. It ends up using a very small fraction of that amount during training, but I coudln't figure out how to tell how many it was actually using ahead of time.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The benchmark I was suggesting to be added, uses WikiDetox which is a 70MB text dataset.

See my recommended code for CV_Multiclass_WikiDetox_BigramsAndTrichar_OVAFastTree, above.

You should be able to just drop that code into (Benchmarks/Text/MultiClassClassification.cs).

The test FastForestBinaryClassificationTestSummary uses a small dataset. It has about 1300 unique words and total word count is about 9000.

Tests are on micro-datasets to test if something has changed. The benchmarks are meant to be real-world datasets, like MSLR and WikiDetox.

Would it be possible to measure the sparsity at runtime? Either before you use the array/dictionary on a subsample, or after N% of the data has passed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, after all the investigation, testing, and syncing you and I did, I was only ever able to reproduce this issue when the feature column has been one-hot-encoded. Per our offline discussion, this should never happen (and when it does, usually an error is thrown for trying to make an array that is too large), so the test case appears to be wrong, and my changes are not necessary. Even when the input data was sparse this situation did not repro.

Due to this, I am going to revert my changes to FastTree and just fix the test instead.

ref int[] forwardIndexerWork, ref VBuffer<double> temp, bool categorical)
{
Contracts.AssertValue(ch);
Expand Down Expand Up @@ -1732,7 +1732,7 @@ private sealed class MemImpl : DataConverter
private readonly RoleMappedData _data;

// instanceList[feature] is the vector of values for the given feature
private readonly ValuesList[] _instanceList;
private readonly Dictionary<int, ValuesList> _instanceList;

private readonly List<short> _targetsList;
private readonly List<double> _actualTargets;
Expand All @@ -1753,10 +1753,12 @@ private MemImpl(RoleMappedData data, IHost host, double[][] binUpperBounds, floa
: base(data, host, binUpperBounds, maxLabel, kind, categoricalFeatureIndices, categoricalSplit)
{
_data = data;
// Array of List<double> objects for each feature, containing values for that feature over all rows
_instanceList = new ValuesList[NumFeatures];
for (int i = 0; i < _instanceList.Length; i++)
_instanceList[i] = new ValuesList();

// Dictionary<int, List<double>> objects for each feature, containing values for that feature over all rows.
// We use a dictionary so we only allocate memory for a feature when its needed. This helps greatly with memory
// and cpu performance when the features are sparse.
_instanceList = new Dictionary<int, ValuesList>();

// Labels.
_targetsList = new List<short>();
_actualTargets = new List<double>();
Expand Down Expand Up @@ -1864,7 +1866,12 @@ private void MakeBoundariesAndCheckLabels(out long missingInstances, out long to
}

foreach (var kvp in cursor.Features.Items())
_instanceList[kvp.Key].Add(index, kvp.Value);
{
if (!_instanceList.TryGetValue(kvp.Key, out ValuesList value))
value = new ValuesList();
value.Add(index, kvp.Value);
_instanceList[kvp.Key] = value;
}

_actualTargets.Add(cursor.Label);
if (_weights != null)
Expand Down Expand Up @@ -1904,13 +1911,19 @@ private void InitializeBins(int maxBins, IParallelTraining parallelTraining)
int iFeature = 0;
pch.SetHeader(new ProgressHeader("features"), e => e.SetProgress(0, iFeature, NumFeatures));
List<int> trivialFeatures = new List<int>();

// Use for when we dont have a value at the index. This saves memory by only allocating it once.
var tempValue = new ValuesList();
for (iFeature = 0; iFeature < NumFeatures; iFeature++)
{
Host.CheckAlive();
if (!localConstructBinFeatures[iFeature])
continue;
// The following strange call will actually sparsify.
_instanceList[iFeature].CopyTo(len, ref temp);
if (!_instanceList.TryGetValue(iFeature, out ValuesList value))
value = tempValue;
value.CopyTo(len, ref temp);

// REVIEW: In principle we could also put the min docs per leaf information
// into here, and collapse bins somehow as we determine the bins, so that "trivial"
// bins on the head or tail of the bin distribution are never actually considered.
Expand Down Expand Up @@ -2040,7 +2053,9 @@ private IEnumerable<FeatureFlockBase> CreateFlocks(IChannel ch, IProgressChannel
? NumFeatures
: FeatureMap[iFeature + flock.Count];
for (int i = min; i < lim; ++i)
_instanceList[i] = null;
if(_instanceList.TryGetValue(i, out ValuesList value))
_instanceList[i] = null;

iFeature += flock.Count;
yield return flock;
}
Expand Down Expand Up @@ -2608,7 +2623,7 @@ public IEnumerable<KeyValuePair<int, int>> Binned(double[] binUpperBounds, int l
public sealed class ForwardIndexer
{
// All of the _values list. We are only addressing _min through _lim.
private readonly ValuesList[] _values;
private readonly Dictionary<int, ValuesList> _values;
// Parallel to the subsequence of _values in min to lim, indicates the index where
// we should start to look for the next value, if the corresponding value list in
// _values is sparse. If the corresponding value list is dense the entry at this
Expand Down Expand Up @@ -2680,12 +2695,17 @@ public sealed class ForwardIndexer
/// <param name="features">The array of feature indices this will index</param>
/// <param name="workArray">A possibly shared working array, once used by this forward
/// indexer it should not be used in any previously created forward indexer</param>
public ForwardIndexer(ValuesList[] values, int[] features, ref int[] workArray)
public ForwardIndexer(Dictionary<int, ValuesList> values, int[] features, ref int[] workArray)
{
Contracts.AssertValue(values);
Contracts.AssertValueOrNull(workArray);
Contracts.AssertValue(features);
Contracts.Assert(Utils.IsIncreasing(0, features, values.Length));
// REVIEW: Currently we have Int32.MaxValue, but it used to be the length of the feature array.
// Now that we are using a sparse representation with the dictionary we don't have that length here anymore.
// Is this min/max comparison useful here? Or is Int32.MaxValue ok? If not we can pass the feature length to this method
// so it has access to it. All tests pass using Int32.MaxValue, so I am not sure what this is really testing, or if the
// only thing that was really needed was the increasing check, but not the bounds check.
Contracts.Assert(Utils.IsIncreasing(0, features, Int32.MaxValue));
Contracts.Assert(features.All(i => values[i] != null));
_values = values;
_featureIndices = features;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -79,7 +79,7 @@ public class Options

public Options()
{
NgramLength = 1;
NgramLength = 2;
SkipLength = NgramExtractingEstimator.Defaults.SkipLength;
UseAllLengths = NgramExtractingEstimator.Defaults.UseAllLengths;
MaximumNgramsCount = new int[] { NgramExtractingEstimator.Defaults.MaximumNgramsCount };
Expand Down
1 change: 1 addition & 0 deletions test/BaselineOutput/Common/EntryPoints/core_manifest.json
Original file line number Diff line number Diff line change
Expand Up @@ -23696,6 +23696,7 @@
"Default": {
"Name": "NGram",
"Settings": {
"NgramLength": 2,
"MaxNumTerms": [
10000000
]
Expand Down
14 changes: 7 additions & 7 deletions test/BaselineOutput/Common/Text/featurized.tsv

Large diffs are not rendered by default.

6 changes: 3 additions & 3 deletions test/Microsoft.ML.Tests/TrainerEstimators/LbfgsTests.cs
Original file line number Diff line number Diff line change
Expand Up @@ -101,14 +101,14 @@ public void TestLRWithStats()

Assert.NotNull(biasStats);

CompareNumbersWithTolerance(biasStats.StandardError, 0.25, digitsOfPrecision: 2);
CompareNumbersWithTolerance(biasStats.ZScore, 7.97, digitsOfPrecision: 2);
CompareNumbersWithTolerance(biasStats.StandardError, 0.24, digitsOfPrecision: 2);
CompareNumbersWithTolerance(biasStats.ZScore, 8.32, digitsOfPrecision: 2);

var scoredData = transformer.Transform(dataView);

var coefficients = stats.GetWeightsCoefficientStatistics(100);

Assert.Equal(18, coefficients.Length);
Assert.Equal(17, coefficients.Length);

foreach (var coefficient in coefficients)
Assert.True(coefficient.StandardError < 1.0);
Expand Down
6 changes: 3 additions & 3 deletions test/Microsoft.ML.Tests/Transformers/TextFeaturizerTests.cs
Original file line number Diff line number Diff line change
Expand Up @@ -221,14 +221,14 @@ public void TextFeaturizerWithL2NormTest()
var prediction = engine.Predict(data[0]);
Assert.Equal(data[0].A, string.Join(" ", prediction.OutputTokens));
var exp1 = 0.333333343f;
var exp2 = 0.707106769f;
var expected = new float[] { exp1, exp1, exp1, exp1, exp1, exp1, exp1, exp1, exp1, exp2, exp2 };
var exp2 = 0.577350259f;
var expected = new float[] { exp1, exp1, exp1, exp1, exp1, exp1, exp1, exp1, exp1, exp2, exp2, exp2 };
Assert.Equal(expected, prediction.Features);

prediction = engine.Predict(data[1]);
exp1 = 0.4472136f;
Assert.Equal(data[1].A, string.Join(" ", prediction.OutputTokens));
expected = new float[] { exp1, 0.0f, 0.0f, 0.0f, 0.0f, exp1, exp1, exp1, exp1, 0.0f, 1.0f };
expected = new float[] { exp1, 0.0f, 0.0f, 0.0f, 0.0f, exp1, exp1, exp1, exp1, 0.0f, 0.0f, 1.0f };
Assert.Equal(expected, prediction.Features);
}

Expand Down