-
Notifications
You must be signed in to change notification settings - Fork 1.9k
IGNITE-13713 [ML]: Add target encoding preprocessor #8466
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
52fe6ae to
97b633f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, this PR makes me happy, but raises a few questions:
- Use-case with GBT looks strange for me
- So large dataset in 32 000 rows looks big for resources too. Could we use another dataset instead of proposed? Titanic or something else with 100-1000 rows.
Let's discuss it here, in this PR
| strEncoderPreprocessor | ||
| ); | ||
|
|
||
| Preprocessor<Integer, Object[]> lbEncoderPreprocessor = new EncoderTrainer<Integer, Object[]>() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, but I didn't understand this pipeline? Why those 3 encoders are combined here? Could they work only in this combination?
In my opinion, user have a choice what to do with Strings, but he should choose one method (not the chain of methods).
Please share your vision here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we want use EncoderType.TARGET_ENCODER only for a few columns (may be only one). In this example I use EncoderType.STRING_ENCODER as general propose encoder and EncoderType.TARGET_ENCODER for special one.
| }); | ||
|
|
||
| double[][] postProcessedData = new double[][] { | ||
| {1.0, 0.1, 1.0}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain please numbers in the last columns: why are they 1.0 and 2.0? not 0.33 and 0.66
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Described each case
| * encodedValue = globalTargetMean * (1 - alpha) + categoryTargetMean * alpha | ||
| * if categorySize == 1 then use globalTargetMean | ||
| * | ||
| * min_samples_leaf - minimum samples to take category average into account. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like min_samples_leaf is not used in this class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but it is used for evaluate TargetEncodingMeta, so I wanted to mention this in encoder class.
| int finalI = i; | ||
|
|
||
| targetEncodingMetas[i] = new TargetEncodingMeta( | ||
| targetCounters[i].getTargetSum() / targetCounters[i].getTargetCount(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest to refactor constructor parameters to separate variables for readability purposes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extracted to new method
| .collect(Collectors.toMap( | ||
| Map.Entry::getKey, | ||
| value -> { | ||
| double prior = targetCounters[finalI].getTargetSum() / |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also this lambda should be encapsulated and commented separately
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prior evaluation extracted but lambda still exists.
| ); | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove the blank line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| } | ||
| else if (featureVal instanceof String) | ||
| strVal = (String)featureVal; | ||
| else if (featureVal instanceof Double) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add type conversion to Doulbe from another Number types (and boolean)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add Number & Boolean
97b633f to
4103d2c
Compare
4103d2c to
f53d2da
Compare
* IGNITE-13672 [ML]: Add initial JSON export/import support for all models (#8521) * [IGNITE-13672] Initial solution * [IGNITE-13672] Added an example * [IGNITE-13672] Added a draft solution * [IGNITE-13672] Updated JSON model * [IGNITE-13672] Updated JSON model * [IGNITE-13672] Removed GMM support * [IGNITE-13672] Fixed blank lines * [IGNITE-13672] Fixed licenses * [IGNITE-13672] Fixed whitespaces * [IGNITE-13672] Fixed whitespaces * [IGNITE-13672] Fixed whitespaces * [IGNITE-13672] Fixed examples * [IGNITE-13672] Fixed examples * [IGNITE-13672] Fixed test * IGNITE-13388 Fix apache-ignite deb package dependency on JVM package - Fixes #8191. Signed-off-by: Ilya Kasnacheev <[email protected]> * IGNITE-13770 Fix NPE in Ignite.dataRegionMetrics with empty persistent region - Fixes #8506. Signed-off-by: Ilya Kasnacheev <[email protected]> * IGNITE-13640 Added runtime dependencies to opencensus module. Fixes #8406 Signed-off-by: Slava Koptilin <[email protected]> * IGNITE-13520 Skip generating encryption keys on the client node. (#8317) * IGNITE-13496 Java thin: make async API non-blocking with GridNioServer Refactor Java Thin Client to use GridNioServer in client mode: * Client threads are never blocked * Single worker thread is shared across all connections within `IgniteClient` Benchmark results (i7-9700K, Ubuntu 20.04.1, JDK 1.8.0_275): Before Benchmark Mode Cnt Score Error Units JmhThinClientCacheBenchmark.get thrpt 10 65916.805 ± 2118.954 ops/s JmhThinClientCacheBenchmark.put thrpt 10 62304.444 ± 2521.371 ops/s After Benchmark Mode Cnt Score Error Units JmhThinClientCacheBenchmark.get thrpt 10 92501.557 ± 1380.384 ops/s JmhThinClientCacheBenchmark.put thrpt 10 82907.446 ± 7572.537 ops/s * IGNITE-13793: Implement SQLRowCount for SELECT This closes #8525 * [IGNITE-13803] Fixed Scalar test failed due to incorrect Jackson dependency (#8529) * [IGNITE-13803] Changed dependency * [IGNITE-13803] Exclude dependency * IGNITE-13190 Native Persistence Defragmentation core functionality - Fixes #7984. Signed-off-by: Sergey Chugunov <[email protected]> * IGNITE-13742 INACTIVE mode is forced on nodes in Maintenance Mode - Fixes #8524. Signed-off-by: Sergey Chugunov <[email protected]> * IGNITE-13807 [MINOR] Fix error message in tests. (#8530) * IGNITE-13795 Added escaping of node consistent id in diagnostic pagelock dump file name. - Fixes #8526. Signed-off-by: Sergey Chugunov <[email protected]> * IGNITE-13776 BPlus tree lock retries limit reached with sqlOnHeapCacheEnabled (#8514) * IGNITE-13802 Added missing "setCandidatePageCount" in "GridCacheOffheapManager.addPartitions" - Fixes #8527. Signed-off-by: Sergey Chugunov <[email protected]> * IGNITE-10655 .NET: Add IgniteConfiguration.JavaPeerClassLoadingEnabled * IGNITE-13633 Fixed ServiceDescriptor#serviceClass failure in case of service deployed through UriDeploymentSpi (#8431) * IGNITE-13808 Failure handling disabled for index validation. - Fixes #8535. Signed-off-by: Sergey Chugunov <[email protected]> * IGNITE-13697 Schedule and cancel control utility commands for defragmentation feature - Fixes #8449. Signed-off-by: Sergey Chugunov <[email protected]> * IGNITE-13811 Fixed bug with removing wrong key from pingMap in ServerImpl. - Fixes #8539. Signed-off-by: Sergey Chugunov <[email protected]> * IGNITE-13813 Fixed assertion in page snapshot apply method. - Fixes #8541. Signed-off-by: Sergey Chugunov <[email protected]> * IGNITE-13812 Fixed possible ClassCastException on checkpoint start with disabled WAL. - Fixes #8540. Signed-off-by: Sergey Chugunov <[email protected]> * IGNITE-8884 .NET: Fix async key-val operations - use WriteObjectDetached Fix async cache operations when key and value objects reference each other or have references to the same object. Async key-val operations used `WriteObject` instead `WriteObjectDetached`, so references to the same inner object were shared in the binary stream (referenced object is written once). However, cache stores key and val binary objects separately, so the reference to the inner object gets broken. `WriteObjectDetached` disables reference sharing and writes both object independently. * IGNITE-13320 Cache encryption key rotation CLI management - Fixes #8242. Signed-off-by: Aleksey Plekhanov <[email protected]> * IGNITE-13825: Fix precision and scale for columns in SQL result set This closes #8551 * IGNITE-10075 .NET Avoid binary configurations of Ignite Java service params (#8509) * IGNITE-13827 Java thin client: Fixed hang on ComputeTask returning unregistered type - Fixes #8552. Signed-off-by: Aleksey Plekhanov <[email protected]> * IGNITE-13709 Control.sh API - status command for defragmentation feature - Fixes #8548. Signed-off-by: Sergey Chugunov <[email protected]> * IGNITE-13775 checkpointRWLock wrapper refactoring - Fixes #8516. Signed-off-by: Ilya Kasnacheev <[email protected]> * IGNITE-13814 restorePartitionStates moved to sys pool instead of striped pool. - Fixes #8542. Signed-off-by: Sergey Chugunov <[email protected]> * IGNITE-13801: Fix Ab Initio related ODBC issues This closes #8528 * IGNITE-13713 Add target encoding preprocessor (#8466) * IGNITE-13714 Add catboost inference integration (#8489) * IGNITE-13353 Got rid of unnecessary rebalance on starting new cache. Signed-off-by: Slava Koptilin <[email protected]> * IGNITE-13823 WAL iterator WRITE permission requirement removed. - Fixes #8549. Signed-off-by: Sergey Chugunov <[email protected]> * IGNITE-13450 [MINOR] Added missed javadoc for EVT_CACHE_QUERY_EXECUTED event. * IGNITE-13786 Add defragmentation-specific B+Tree optimizations - Fixes #8560. Signed-off-by: Alexey Goncharuk <[email protected]> * IGNITE-13826 .NET: Add RendezvousAffinityFunction.BackupFilter Add RendezvousAffinityFunction.BackupFilter with a single predefined implementation that delegates to Java: ClusterNodeAttributeAffinityBackupFilter. * IGNITE-13833 More versions added to PersistenceBasicCompatibilityTest - Fixes #8562. Signed-off-by: Ilya Kasnacheev <[email protected]> * IGNITE-13832 Proper handling of interrupted exceptions in disco-notifier-worker. - Fixes #8561. Signed-off-by: Sergey Chugunov <[email protected]> * IGNITE-13101 Metastore should complete all write futures during stop and prohibit creating new ones - Fixes #8554. Signed-off-by: Sergey Chugunov <[email protected]> * IGNITE-13815 Remove ability to delete segments from the middle of WAL archive - Fixes #8545. Signed-off-by: Ilya Kasnacheev <[email protected]> * IGNITE-12892 WAL archive size configuration made more clear - Fixes #8550. Signed-off-by: Sergey Chugunov <[email protected]> * IGNITE-13838 IgniteSqlSplitterSelfTest fixes various tests - Fixes #8565. Signed-off-by: Ilya Kasnacheev <[email protected]> * ignite docs: fixing a broken documentation link * ignite docs: updated the index page with quick links to the APIs and examples * ignite docs: fixed broken links and updated the C++ API header * IGNITE-12666 Provide cluster performance profiling tool (#7693) * ignite docs: fixed case of GitHub * IGNITE-13743 JMX API for Defragmentation monitoring and management - Fixes #8496. Signed-off-by: Sergey Chugunov <[email protected]> * IGNITE-13848 Fixed incorrect updating of SegmentReservationStorage#minReserveIdx when truncating WAL segments. Fixes #8573 Signed-off-by: Slava Koptilin <[email protected]> * IGNITE-13847 GridEncryptionManager#onWalSegmentRemoved should be invoked async - Fixes #8576. Signed-off-by: Ilya Kasnacheev <[email protected]> * IGNITE-13876 Updated documentation for 2.9.1 release (#8592) * IGNITE-13865 Support DateTime as a key or value in .NET and Java (#8580) * IGNITE-13880 Fix PageMemoryTracker related flaky tests - Fixes #8597. Signed-off-by: Aleksey Plekhanov <[email protected]> * IGNITE-13766 API for network connectivity check - Fixes #8500. Signed-off-by: Sergey Chugunov <[email protected]> * IGNITE-13864 Fixed an issue where acknowledge on a stale latch could lead to assertion error. Fixes #8579 Signed-off-by: Slava Koptilin <[email protected]> * IGNITE-13869 Added additional logging for a query mapping. Fixes #8585 Signed-off-by: Slava Koptilin <[email protected]> * IGNITE-13867 Fixed an issue related to erroneous sending TTL update requests. Fixes #8583 Signed-off-by: Slava Koptilin <[email protected]> * IGNITE-13870 Removed obsolete GridCacheAdapter#validateCacheKey. Fixes #8586 Signed-off-by: Slava Koptilin <[email protected]> * IGNITE-13720 Parallelism for defragmentation added. - Fixes #8574. Signed-off-by: Sergey Chugunov <[email protected]> * IGNITE-13831 Move WAL archive cleanup from checkpoint to rollover - Fixes #8563. Signed-off-by: Ilya Kasnacheev <[email protected]> * IGNITE-13866 validate_indexes command is interrupted if connection to initiator is broken. Fixes #8593 Signed-off-by: Slava Koptilin <[email protected]> * IGNITE-13868 Added additional tests related to simultaneously created caches. Fixes #8584 Signed-off-by: Slava Koptilin <[email protected]> * IGNITE-13896 Fix javadoc build failure - Fixes #8601. Signed-off-by: Aleksey Plekhanov <[email protected]> * IGNITE-12824 .NET: Add BinaryConfiguration.TimestampConverter (#8568) Co-authored-by: Pavel Tupitsyn <[email protected]> * IGNITE-13900: Fix C++ Affinity tests (#8605) * IGNITE-13708 Add thin client support for Spring Transactions - Fixes #8556. Signed-off-by: Aleksey Plekhanov <[email protected]> * IGNITE-13910 Missing segment is not released - Fixes #8612. Signed-off-by: Sergey Chugunov <[email protected]> * IGNITE-13908: ODBC nullability info for columns This closes #8610 * IGNITE-13507 Fix NullPointerException on tx recovery - Fixes #8547. Signed-off-by: Ilya Kasnacheev <[email protected]> * IGNITE-13734 .NET: Register service return type on method invocation (#8602) * IGNITE-13856 Linear performance for DirectByteBufferStreamImplV2.writeString - Fixes #8577. Signed-off-by: Ilya Kasnacheev <[email protected]> * IGNITE-13555 Java thin: add IPv6 address support - Change HostAndPortRange.parse method to support addresses like [IPv6_host]:port1..port2, because previous implementation didn't recognized IPv6. - Add tests for HostAndPortRange.parse method for both IPv4 and IPv6 hosts. * IGNITE-13680 Improve OS suggestions for Linux - Fixes #8503. Signed-off-by: Ilya Kasnacheev <[email protected]> * IGNITE-11406 Fix NullPointerException on client start - Fixes #8604. Signed-off-by: Ilya Kasnacheev <[email protected]> Co-authored-by: Alexey Zinoviev <[email protected]> Co-authored-by: Peter Ivanov <[email protected]> Co-authored-by: Ilya Kasnacheev <[email protected]> Co-authored-by: Alexander Lapin <[email protected]> Co-authored-by: Pavel Pereslegin <[email protected]> Co-authored-by: Pavel Tupitsyn <[email protected]> Co-authored-by: Igor Sapego <[email protected]> Co-authored-by: ibessonov <[email protected]> Co-authored-by: korlov42 <[email protected]> Co-authored-by: Aleksandr Shapkin <[email protected]> Co-authored-by: Aleksey Plekhanov <[email protected]> Co-authored-by: Nikolay <[email protected]> Co-authored-by: zstan <[email protected]> Co-authored-by: Mark Andreev <[email protected]> Co-authored-by: sergeyuttsel <[email protected]> Co-authored-by: Slava Koptilin <[email protected]> Co-authored-by: Kirill Tkalenko <[email protected]> Co-authored-by: Semyon Danilov <[email protected]> Co-authored-by: Nikita Safonov <[email protected]> Co-authored-by: Denis Magda <[email protected]> Co-authored-by: Nikita Amelchev <[email protected]> Co-authored-by: ymolochkov <[email protected]> Co-authored-by: vd_pyatkov <[email protected]> Co-authored-by: Anton Kalashnikov <[email protected]> Co-authored-by: Mikhail Petrov <[email protected]> Co-authored-by: pvinokurov <[email protected]> Co-authored-by: Ilya Kazakov <[email protected]> Co-authored-by: Varvara Kozhukhova <[email protected]> Co-authored-by: shubin <[email protected]>
Issue: https://issues.apache.org/jira/browse/IGNITE-13713