Add support for combining hashes in vector columns to HashingTransformer#4828
Add support for combining hashes in vector columns to HashingTransformer#4828yaeldekel merged 13 commits intodotnet:masterfrom
Conversation
|
Question: some of the input types for hashing have missing values (float, double and key type). The hashing transformer maps missing values in the input, to 0, the missing value of key type. What should we do in the combine case? |
| columnOptions = new HashingEstimator.ColumnOptions(samplingKeyColumn, origStratCol, 30, (uint)seed.Value, combine: true); | ||
| else if (((ISeededEnvironment)env).Seed.HasValue) | ||
| columnOptions = new HashingEstimator.ColumnOptionsInternal(samplingKeyColumn, origStratCol, 30, (uint)((ISeededEnvironment)env).Seed.Value); | ||
| columnOptions = new HashingEstimator.ColumnOptions(samplingKeyColumn, origStratCol, 30, (uint)((ISeededEnvironment)env).Seed.Value, combine: true); |
There was a problem hiding this comment.
You can merge these conditions in one by defining var localSeed = seed.HasValue ? seed.Value : (ISeededEnvironment)env).Seed.HasValue ? (ISeededEnvironment)env).Seed.Value : null;
if(localSeed.HasValue)
...
else
columnOptions = new HashingEstimator.ColumnOptions(samplingKeyColumn, origStratCol, 30, combine: true); #Resolved
There was a problem hiding this comment.
I don't think we want HashingEstimator listening to the global seed. The hashing is a different meaning of "seed"; this isn't a PRNG seed.
See #4752 (comment)
/cc @najeeb-kazmi
| modelSignature: "HASHTRNS", | ||
| // verWrittenCur: 0x00010001, // Initial | ||
| verWrittenCur: 0x00010002, // Invert hash key values, hash fix | ||
| //verWrittenCur: 0x00010002, // Invert hash key values, hash fix |
There was a problem hiding this comment.
v [](start = 18, length = 1)
v [](start = 18, length = 1)
space #Resolved
bfd1524 to
f62a846
Compare
| input = new TypeConvertingTransformer(Host, origStratCol, DataKind.Int64, origStratCol).Transform(input); | ||
|
|
||
| output = new HashingEstimator(Host, stratificationColumn, origStratCol, 30).Fit(input).Transform(input); | ||
| } |
There was a problem hiding this comment.
I realize the previous code too used 30 bits. But the default seems to be 31 bits. Any reason why we have 30 here? If there is a specific reason for 30 bits, can you please add a note to that in the code?
There was a problem hiding this comment.
I am actually not sure why 30 was chosen. If you think it's important I can try changing it back to 31.
In reply to: 386729669 [](ancestors = 386729669)
There was a problem hiding this comment.
I have no reason to say it should change to 31 either :-). But since it is is used in a couple of places, can you please make it a named constant?
In reply to: 425102125 [](ancestors = 425102125,386729669)
| var columnOptions = | ||
| localSeed.HasValue ? | ||
| new HashingEstimator.ColumnOptions(samplingKeyColumn, origStratCol, 30, (uint)localSeed.Value, combine: true) : | ||
| new HashingEstimator.ColumnOptions(samplingKeyColumn, origStratCol, 30, combine: true); |
There was a problem hiding this comment.
Similar comment as before. If the 30bits hashing is standard, can we define it as some global constant somewhere?
| return ComposeGetterOne(input, iinfo, srcCol, srcType); | ||
| if (_columns[iinfo].Combine) | ||
| return ComposeGetterCombined(input, iinfo, srcCol, vectorType); | ||
| return ComposeGetterVec(input, iinfo, srcCol, vectorType); |
There was a problem hiding this comment.
We need to figure out the onnx export mechanism for this. #Resolved
There was a problem hiding this comment.
…y of HashingEstimator.
| verReadableCur: 0x00010003, | ||
| //verWrittenCur: 0x00010003, // Uses MurmurHash3_x86_32 | ||
| verWrittenCur: 0x00010004, // Add Combine option | ||
| verReadableCur: 0x00010004, |
There was a problem hiding this comment.
We still haven't released a version with MurmurHash3. Do we need to increment this if we make sure to include it in the v1.5 release?
There was a problem hiding this comment.
Technically users may create and save models containing this transformer once the PR that changed the version to 0x00010003 was merged. This most likely didn't happen, but to be on the safe side I think it's better to increment the version again.
In reply to: 424565803 [](ancestors = 424565803)
| private readonly VectorDataViewType[] _kvTypes; | ||
| private readonly bool _isVersion1; | ||
| private readonly bool _useOldHashing; | ||
|
|
There was a problem hiding this comment.
suggestion: Is it possible to come up with a better name? The only one I can think of is "_useOnnxCompatibleHashing" which would invert the meaning of the current name. Please use a a more descriptive name if possible.
There was a problem hiding this comment.
I would like to keep it a variable that would have the value "false" by default, that's why I didn't want to change it to _useOnnxCompatibleHashing. Is this name that I changed it to acceptable?
In reply to: 424567216 [](ancestors = 424567216)
| if (srcType is KeyDataViewType) | ||
| return false; | ||
| if (_parent._columns[iinfo].Combine) | ||
| return false; |
There was a problem hiding this comment.
@lynx1820 Any thoughts why we are not supporting key types? #Resolved
There was a problem hiding this comment.
Thanks for the remainder! I'll work on adding support for keytypes. Zero value keys don't get mapped to a hash but map to zero, a behavior not currently present. #Resolved
There was a problem hiding this comment.
As @lynx1820 mentioned, in key types 0 is mapped to 0 instead of to its hash value, so it would be more complicated to export to ONNX.
In reply to: 424584542 [](ancestors = 424584542)
Codecov Report
@@ Coverage Diff @@
## master #4828 +/- ##
==========================================
- Coverage 75.71% 73.54% -2.17%
==========================================
Files 993 992 -1
Lines 179035 179484 +449
Branches 19356 19296 -60
==========================================
- Hits 135549 131997 -3552
- Misses 38262 42305 +4043
+ Partials 5224 5182 -42
|
Codecov Report
@@ Coverage Diff @@
## master #4828 +/- ##
==========================================
- Coverage 75.71% 75.60% -0.11%
==========================================
Files 993 995 +2
Lines 179035 179664 +629
Branches 19356 19314 -42
==========================================
+ Hits 135549 135834 +285
- Misses 38262 38565 +303
- Partials 5224 5265 +41
|
|
That sounds good to me. In reply to: 586994995 [](ancestors = 586994995) |
I am making this change so that
CountTargetEncodingEstimator(see PR #4514 ) can start usingHashingEstimatorinstead ofHashJoiningTransform(see this comment in the above PR).In addition to enabling
CountTargetEncodingEstimatorto use it, this will fix a bug when splitting datasets in the CV command and in the train-test/CV APIs. If a stratification column that is a vector type is specified, currently an exception will be thrown because theRangeFilterapplied after hashing cannot handle vector columns.