Add LabelEncoder CUDA execution provider for numeric types#28045
Add LabelEncoder CUDA execution provider for numeric types#28045
Conversation
Implements LabelEncoder for the CUDA execution provider supporting numeric types (int64, float, double). Uses sorted arrays and binary search on GPU for efficient O(log n) per-element lookup. Supports: - Opset 2-3: int64↔float, int64↔int64, float↔float - Opset 4+: above plus double↔double, double↔int64, int64↔double String types remain CPU-only as they cannot run on GPU. Agent-Logs-Url: https://github.com/microsoft/onnxruntime/sessions/d17c0a15-3bf2-4ac4-bc57-255876153271 Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>
Add tests for: - Float NaN keys to int64 values (opset 4) - Float NaN keys to float values (opset 4) - Double NaN keys to int64 values (opset 4) - Int64 to double conversion (opset 4) - Double to double conversion (opset 4) These tests exercise the CUDA binary search with NaN handling and double type support. Agent-Logs-Url: https://github.com/microsoft/onnxruntime/sessions/d17c0a15-3bf2-4ac4-bc57-255876153271 Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>
… test intent Agent-Logs-Url: https://github.com/microsoft/onnxruntime/sessions/d17c0a15-3bf2-4ac4-bc57-255876153271 Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>
tianleiwu
left a comment
There was a problem hiding this comment.
Review Summary
This PR adds the first-ever ML domain (ai.onnx.ml) kernel registration for the CUDA EP, implementing LabelEncoder for opset 2-3 and 4+ with numeric types only (int64, float, double). The implementation uses a GPU-appropriate sorted-array + binary search approach instead of hash maps, with proper NaN handling. The code is well-structured, closely mirrors CPU patterns, and all CI checks pass.
Overall: Implementation is correct and well-structured. The concerns below are minor (style/convention adherence and test coverage) and do not affect correctness or safety.
Findings Summary
| # | Severity | Component | Issue |
|---|---|---|---|
| 1 | Nitpick | label_encoder.cc |
Unused #include <filesystem> |
| 2 | Suggestion | label_encoder.cc CopyToGpu |
Size arithmetic should use SafeInt per project conventions |
| 3 | Suggestion | label_encoder_test.cc |
No empty-input-tensor edge case test |
…y-tensor tests - Remove unused #include <filesystem> - Use SafeInt<size_t> for cudaMemcpy size arithmetic in CopyToGpu - Add empty input tensor tests for opset 2 and opset 4 LabelEncoder
There was a problem hiding this comment.
Pull request overview
Adds a CUDA Execution Provider implementation of ai.onnx.ml.LabelEncoder for numeric key/value types, aiming to avoid CPU round-trips for large label maps.
Changes:
- Introduces CUDA kernel + host-side op implementations for LabelEncoder (opset 2–3 and 4+ numeric variants).
- Registers the new ML-domain CUDA kernels and adds the ML domain constant for shared-library/provider API usage.
- Expands existing LabelEncoder tests (NaN handling, double variants, empty-input) and updates CUDA operator kernel documentation.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| onnxruntime/core/providers/cuda/ml/label_encoder_impl.h | Declares CUDA kernel launcher for sorted-key binary-search LabelEncoder. |
| onnxruntime/core/providers/cuda/ml/label_encoder_impl.cu | Implements per-element binary search kernel with NaN short-circuit for float/double keys. |
| onnxruntime/core/providers/cuda/ml/label_encoder.h | Declares CUDA LabelEncoder kernel classes for opset 2–3 and opset 4+. |
| onnxruntime/core/providers/cuda/ml/label_encoder.cc | Implements attribute loading (list/tensor), key sorting, GPU copy, and kernel registrations for supported type pairs. |
| onnxruntime/core/providers/cuda/cuda_execution_provider.cc | Adds kernel-class forward decls + registration routine for ML-domain LabelEncoder variants and wires registration into registry init. |
| onnxruntime/core/providers/shared_library/provider_api.h | Adds missing kMLDomain constant for shared-library/provider builds. |
| onnxruntime/test/providers/cpu/ml/label_encoder_test.cc | Adds numeric NaN-key tests, double type-combination tests, and empty-input tests (intended to exercise CUDA EP when available). |
| docs/OperatorKernels.md | Documents CUDA provider support for ai.onnx.ml::LabelEncoder. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…tics SortKeysValues() now uses std::stable_sort and deduplicates sorted keys, keeping only the first occurrence of each key (including NaN). This matches the CPU LabelEncoder's map_.emplace() first-occurrence-wins behavior. num_keys_ is now set after SortKeysValues since dedup may shrink the arrays.
|
/azp run Win_TRT_Minimal_CUDA_Test_CI, Windows GPU Doc Gen CI Pipeline |
|
Azure Pipelines successfully started running 2 pipeline(s). |
|
/azp run Win_TRT_Minimal_CUDA_Test_CI, Windows GPU Doc Gen CI Pipeline |
|
Azure Pipelines successfully started running 2 pipeline(s). |
Add the CUDA LabelEncoder sources to the minimal provider build so Windows TRT minimal CI links the new kernel registrations. Also factor the non-plugin TensorProto construction into a helper so the shared-provider path only needs one conditional block.
Move the default_tensor read path in LabelEncoder behind a helper so the plugin and shared-provider TensorProto handling stays in one place. This keeps GetDefaultValue focused on fallback selection instead of attribute transport details.
Description
Implements
ai.onnx.ml.LabelEncoderon the CUDA execution provider for numeric key/value types using sorted arrays + binary search (O(log n) per element).New files (
onnxruntime/core/providers/cuda/ml/):label_encoder_impl.cu/.h— CUDA kernel: per-thread binary search on sorted keys, NaN-aware for float/doublelabel_encoder.cc/.h— Host-side op classes (CudaLabelEncoderfor opset 2-3,CudaLabelEncoder_4for opset 4+). Constructor sorts keys, copies to GPU;ComputeInternallaunches kernel.Modified files:
cuda_execution_provider.cc— Register 11 kernel variants (4 versioned opset 2-3, 7 opset 4+)provider_api.h— Add missingkMLDomainconstant (first ML-domain op on CUDA EP)docs/OperatorKernels.md— Addai.onnx.mlsection to CUDA provider tableSupported type combinations:
int64↔float,int64↔int64,float↔floatdouble↔double,double↔int64,int64↔doubleString types remain CPU-only. NaN keys are placed at end of sorted array and short-circuited before binary search.
Tests: 5 new test cases covering NaN-key-to-numeric-value mappings and double type combinations. Existing numeric tests (
FloatToInt64Opset2,Int64ToFloatOpset2, etc.) will automatically run on CUDA viaOpTester::Run().Motivation and Context
Models with large LabelEncoder nodes (>100k entries) force a CPU round-trip when all other nodes run on GPU. This adds the CUDA implementation to eliminate that data transfer bottleneck.