-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add out of the box emoji handling #301
Comments
@tooaverage yes that would be an awesome contribution 🎆 😃 |
Weo @tmbo What is the best/leanest way to integrate this? |
one pretty lean solution would be to replace the emoji with words, for example taking the short name or keywords http://unicode.org/emoji/charts/full-emoji-list.html |
Hello @tooaverage, any update on this issue? I trained the model using both unicode (\U0001f37a) and the actual emoji (🍺) within the training data, and set synonyms to map to my named entity. No success (tried just in case). |
This project seems to have built a dictionary of vectors for emoji - is this useful? https://raw.githubusercontent.com/uclmr/emoji2vec/master/pre-trained/ Is there a way I can assign these vectors to emoji tokens if they're found? |
spacymoji would be helpful here: https://pypi.python.org/pypi/spacymoji/1.0.0 |
Any progress on this? 😊 |
We have not been working on this, but it would be a great contribution 😉 |
Might be a little off-topic but how are emoji's handled currently by the NLU? I would like the include emojis in our current chatbot's training data. Are they currently simply ignored while predicting the intent? |
ok i think that depends on the intent classifcation component used:
|
@tmbo In the |
I tried with tensorflow_embedding but no luck so far, but I was trying with actual emojis. Have to try with codes/synonyms |
I think the tensorflow embedding policy also ignores them, because it only looks at words with a certain amount of characters. You can change that in the |
Yes, it depends whether a python string that stores emoji is falling under |
Hey @Ghostvv for now I am using token_pattern as "(?u)\b\w+\b" for words. However, I also want to add a capability to recognize emoji, what should be the token_pattern then? |
@kirtisynap19 in this case token pattern should correspond to regex that also picks emojis |
@Ghostvv thanks for your response. And what goes in the training data? Is it '👍' or 'U+1F44D' or 'u"\U0001F44D"' or "\uD83D\uDC4D"? Token pattern: (?u)(\b\w+\b|(\u00a9|\u00ae|[\u2000-\u3300]|\ud83c[\ud000-\udfff]|\ud83d[\ud000-\udfff]|\ud83e[\ud000-\udfff])) |
@kirtisynap19 I'm not sure, try both. You can print vocabulary of CounntVectorizer to check |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Hi, this issue is available? @lucasdutraf, @Henrike100, and I to work on it! :) |
@mbslet Are you working on this change ? if not, i would like to pick it up ! |
Has there been any progress on this? Or should I just do rule-based intent classification for this? :) |
We've used pretrained emoji embeddings (emoji2vec) and wrote custom featurizer. After you just add it to your pipeline and provide relevant examples (as you do for the text). It also should be noted that it's better to do text preprocessing before featurizing it, because if you have something like "Hello there 😈😈😈" where text and emoji are mixed, the final classification result is not always good. So what we do for example, is that we strip emojis if they are with text and classify only text and if there's no text and only emojis, we classify emoji, which gives consistent results. Overall it may sound hard, but it's not, it's just a cycle of trial and error while you get the version that works for you best. |
➤ Maxime Verger commented: 💡 Heads up! We're moving issues to Jira: https://rasa-open-source.atlassian.net/browse/OSS. From now on, this Jira board is the place where you can browse (without an account) and create issues (you'll need a free Jira account for that). This GitHub issue has already been migrated to Jira and will be closed on January 9th, 2023. Do not forget to subscribe to the corresponding Jira issue! ➡️ More information in the forum: https://forum.rasa.com/t/migration-of-rasa-oss-issues-to-jira/56569. |
Please explain it with an example. That will be helpful. Thanks @giorgobiani |
I think it'd be be helpful to have basic emoji handling. And eventually understanding positive
😃
/negative
😡
/neutral
🐩
emoji.
Can assign to me
😇
The text was updated successfully, but these errors were encountered: