-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Changed default NGram length from 1 to 2. #5248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
michaelgsharp
merged 3 commits into
dotnet:master
from
michaelgsharp:fast-tree-memory-fix
Jun 26, 2020
Merged
Changes from 1 commit
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -23696,6 +23696,7 @@ | |
| "Default": { | ||
| "Name": "NGram", | ||
| "Settings": { | ||
| "NgramLength": 2, | ||
| "MaxNumTerms": [ | ||
| 10000000 | ||
| ] | ||
|
|
||
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice fix! Being one of the primary trainers in ML․NET, I'd recommend testing the speed & memory on a variety of datasets.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that you're reporting speed/memory on the FastTree ranking
Test(notTrainTest); Test_Ranking_MSLRWeb10K_RawNumericFeatures_FastTreeRanking benchmark. For MAML,Testonly runs the prediction step.Running the training (
TrainTest) TrainTest_Ranking_MSLRWeb10K_RawNumericFeatures_FastTreeRanking might be more telling.The MSLR dataset is purely numeric, no text columns in it; so shouldn't be affected by the ngram length change.
Adding a FastTree text benchmark
You could add FastTree to the text benchmark in (Benchmarks/Text/MultiClassClassification.cs).
Code for adding a FastTree benchmark on WikiDetox would be::
Heuristic
After testing various datasets, you may find the dictionary is beneficial for small (or large) sparse datasets and if we find that, we could use a heuristic of
useDictionary = sparseness > 0.95 && rowsTimesSlots < 1E6;There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have any example datasets I can use? Or point me to where I can find some? I am not sure where I would find datasets for something like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was talking with Harish this morning about the sparseness and choosing between list/dictionary. We are going to discuss it again, but I coudln't find a way to know ahead of time whether the data would be sparse or dense without the user letting us know.
The test
FastForestBinaryClassificationTestSummaryuses a small dataset. It has about 1300 unique words and total word count is about 9000. The current fast tree trainer multiplies those 2 values together (which is about 11.5 million) and allocates an array of that size right off of the bat which is where the huge memory usage comes from. It ends up using a very small fraction of that amount during training, but I coudln't figure out how to tell how many it was actually using ahead of time.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The benchmark I was suggesting to be added, uses WikiDetox which is a 70MB text dataset.
See my recommended code for
CV_Multiclass_WikiDetox_BigramsAndTrichar_OVAFastTree, above.You should be able to just drop that code into (Benchmarks/Text/MultiClassClassification.cs).
Tests are on micro-datasets to test if something has changed. The benchmarks are meant to be real-world datasets, like MSLR and WikiDetox.
Would it be possible to measure the sparsity at runtime? Either before you use the array/dictionary on a subsample, or after N% of the data has passed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, after all the investigation, testing, and syncing you and I did, I was only ever able to reproduce this issue when the feature column has been one-hot-encoded. Per our offline discussion, this should never happen (and when it does, usually an error is thrown for trying to make an array that is too large), so the test case appears to be wrong, and my changes are not necessary. Even when the input data was sparse this situation did not repro.
Due to this, I am going to revert my changes to FastTree and just fix the test instead.