Skip to content

TokenizingByCharacters export to Onnx#4805

Merged
kere-nel merged 6 commits intodotnet:masterfrom
kere-nel:onnx_tokenizing
Feb 7, 2020
Merged

TokenizingByCharacters export to Onnx#4805
kere-nel merged 6 commits intodotnet:masterfrom
kere-nel:onnx_tokenizing

Conversation

@kere-nel
Copy link
Contributor

@kere-nel kere-nel commented Feb 6, 2020

  • Transformer that tokenizes by character and returns the characters (as uint16)
  • Since there's not a comparable onnx operator, a label encoder is used to map a string token to it's corresponding character value. This will unfortunately make the model much larger, since 65535 values have to be saved as a mapping guide for label encoder.

@kere-nel kere-nel requested a review from a team as a code owner February 6, 2020 23:23
@kere-nel kere-nel requested review from ganik and harishsk February 6, 2020 23:24
@harishsk
Copy link
Contributor

harishsk commented Feb 6, 2020

/// | Exportable to ONNX | No |

Please change this line whenever you add new support


Refers to: src/Microsoft.ML.Transforms/Text/TokenizingByCharacters.cs:610 in 0532330. [](commit_id = 0532330, deletion_comment = False)

@kere-nel kere-nel merged commit daaea53 into dotnet:master Feb 7, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Mar 19, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants