Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add out of the box emoji handling #301

Closed
tooaverage opened this issue Apr 23, 2017 · 25 comments
Closed

Add out of the box emoji handling #301

tooaverage opened this issue Apr 23, 2017 · 25 comments
Labels
area:rasa-oss 🎡 Anything related to the open source Rasa framework area:rasa-oss/ml/nlu-components Issues focused around rasa's NLU components help wanted type:discussion 👨‍👧‍👦 Early stage of an idea or validation of thoughts. Should NOT be closed by PR. type:enhancement ✨ Additions of new features or changes to existing ones, should be doable in a single PR

Comments

@tooaverage
Copy link

tooaverage commented Apr 23, 2017

I think it'd be be helpful to have basic emoji handling. And eventually understanding positive

😃
/negative

😡
/neutral

🐩
emoji.

Can assign to me

😇

@tmbo
Copy link
Member

tmbo commented Apr 23, 2017

@tooaverage yes that would be an awesome contribution 🎆 😃

@tmbo tmbo added the type:enhancement ✨ Additions of new features or changes to existing ones, should be doable in a single PR label Apr 23, 2017
@tooaverage
Copy link
Author

Weo @tmbo What is the best/leanest way to integrate this?

@amn41
Copy link
Contributor

amn41 commented Apr 23, 2017

one pretty lean solution would be to replace the emoji with words, for example taking the short name or keywords http://unicode.org/emoji/charts/full-emoji-list.html

@tmbo tmbo closed this as completed Apr 23, 2017
@tmbo tmbo reopened this Apr 23, 2017
@gildastone
Copy link

Hello @tooaverage, any update on this issue?

I trained the model using both unicode (\U0001f37a) and the actual emoji (🍺) within the training data, and set synonyms to map to my named entity. No success (tried just in case).

@nicksahler
Copy link

nicksahler commented May 22, 2017

This project seems to have built a dictionary of vectors for emoji -

is this useful?

https://raw.githubusercontent.com/uclmr/emoji2vec/master/pre-trained/
(the paper this belongs to is also great)

Is there a way I can assign these vectors to emoji tokens if they're found?

@amn41
Copy link
Contributor

amn41 commented Mar 5, 2018

spacymoji would be helpful here: https://pypi.python.org/pypi/spacymoji/1.0.0

@giorgobiani
Copy link

Any progress on this? 😊

@tmbo
Copy link
Member

tmbo commented Aug 21, 2018

We have not been working on this, but it would be a great contribution 😉

@parthsharma1996
Copy link
Contributor

Might be a little off-topic but how are emoji's handled currently by the NLU? I would like the include emojis in our current chatbot's training data.

Are they currently simply ignored while predicting the intent?

@tmbo
Copy link
Member

tmbo commented Sep 24, 2018

ok i think that depends on the intent classifcation component used:

  • for svm + spacy: the emojis will be ignored because they do not have a word vector assigned and hence don't contribute anything to the sentence representation
  • for the embedding policy, actually I am not sure <- @Ghostvv

@parthsharma1996
Copy link
Contributor

parthsharma1996 commented Sep 24, 2018

@tmbo In the tensorflow_embedding pipeline my guess is that the vectors for the emojis should also be learned from the training data, since that pipeline learns the word embeddings from only the training data anyway (thus able to handle OOV words).
Guesses aside, would be interesting to know how it actually happens though

@giorgobiani
Copy link

giorgobiani commented Sep 24, 2018

I tried with tensorflow_embedding but no luck so far, but I was trying with actual emojis. Have to try with codes/synonyms

@akelad
Copy link
Contributor

akelad commented Sep 25, 2018

I think the tensorflow embedding policy also ignores them, because it only looks at words with a certain amount of characters. You can change that in the token_pattern parameter of the intent_featurizer_count_vectors though

@Ghostvv
Copy link
Contributor

Ghostvv commented Sep 25, 2018

Yes, it depends whether a python string that stores emoji is falling under token_pattern or not

@kirtisynap19
Copy link

Hey @Ghostvv for now I am using token_pattern as "(?u)\b\w+\b" for words. However, I also want to add a capability to recognize emoji, what should be the token_pattern then?
Secondly, what would the training data be like for emojis, "👍" or "U+1F44D"?

@Ghostvv
Copy link
Contributor

Ghostvv commented May 28, 2019

@kirtisynap19 in this case token pattern should correspond to regex that also picks emojis

@kirtisynap19
Copy link

kirtisynap19 commented May 28, 2019

@Ghostvv thanks for your response. And what goes in the training data? Is it '👍' or 'U+1F44D' or 'u"\U0001F44D"' or "\uD83D\uDC4D"?

Token pattern: (?u)(\b\w+\b|(\u00a9|\u00ae|[\u2000-\u3300]|\ud83c[\ud000-\udfff]|\ud83d[\ud000-\udfff]|\ud83e[\ud000-\udfff]))

@Ghostvv
Copy link
Contributor

Ghostvv commented May 29, 2019

@kirtisynap19 I'm not sure, try both. You can print vocabulary of CounntVectorizer to check

@stale
Copy link

stale bot commented Feb 3, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the status:stale label Feb 3, 2020
@tmbo tmbo removed the status:stale label Feb 3, 2020
@mbslet
Copy link
Contributor

mbslet commented Oct 22, 2020

Hi, this issue is available? @lucasdutraf, @Henrike100, and I to work on it! :)

@saikiran2603
Copy link

saikiran2603 commented Nov 14, 2020

@mbslet Are you working on this change ? if not, i would like to pick it up !

@alwx alwx added the area:rasa-oss/ml/nlu-components Issues focused around rasa's NLU components label Jan 29, 2021
@alwx alwx added type:discussion 👨‍👧‍👦 Early stage of an idea or validation of thoughts. Should NOT be closed by PR. area:rasa-oss 🎡 Anything related to the open source Rasa framework labels Jan 29, 2021
@chiranshu14
Copy link

Has there been any progress on this? Or should I just do rule-based intent classification for this? :)

@giorgobiani
Copy link

We've used pretrained emoji embeddings (emoji2vec) and wrote custom featurizer. After you just add it to your pipeline and provide relevant examples (as you do for the text). It also should be noted that it's better to do text preprocessing before featurizing it, because if you have something like "Hello there 😈😈😈" where text and emoji are mixed, the final classification result is not always good. So what we do for example, is that we strip emojis if they are with text and classify only text and if there's no text and only emojis, we classify emoji, which gives consistent results. Overall it may sound hard, but it's not, it's just a cycle of trial and error while you get the version that works for you best.

@rasabot-exalate rasabot-exalate added area:rasa-oss :ferris wheel: area:rasa-oss 🎡 Anything related to the open source Rasa framework type:discussion 👨‍👧‍👦 Early stage of an idea or validation of thoughts. Should NOT be closed by PR. help wanted type:enhancement ✨ Additions of new features or changes to existing ones, should be doable in a single PR and removed type:discussion 👨‍👧‍👦 Early stage of an idea or validation of thoughts. Should NOT be closed by PR. area:rasa-oss 🎡 Anything related to the open source Rasa framework type:enhancement ✨ Additions of new features or changes to existing ones, should be doable in a single PR help wanted area:rasa-oss :ferris wheel: labels Mar 17, 2022 — with Exalate Issue Sync
@sync-by-unito
Copy link

sync-by-unito bot commented Dec 16, 2022

➤ Maxime Verger commented:

💡 Heads up! We're moving issues to Jira: https://rasa-open-source.atlassian.net/browse/OSS.

From now on, this Jira board is the place where you can browse (without an account) and create issues (you'll need a free Jira account for that). This GitHub issue has already been migrated to Jira and will be closed on January 9th, 2023. Do not forget to subscribe to the corresponding Jira issue!

➡️ More information in the forum: https://forum.rasa.com/t/migration-of-rasa-oss-issues-to-jira/56569.

@m-vdb m-vdb closed this as completed Jan 9, 2023
@mishra011
Copy link

We've used pretrained emoji embeddings (emoji2vec) and wrote custom featurizer. After you just add it to your pipeline and provide relevant examples (as you do for the text). It also should be noted that it's better to do text preprocessing before featurizing it, because if you have something like "Hello there 😈😈😈" where text and emoji are mixed, the final classification result is not always good. So what we do for example, is that we strip emojis if they are with text and classify only text and if there's no text and only emojis, we classify emoji, which gives consistent results. Overall it may sound hard, but it's not, it's just a cycle of trial and error while you get the version that works for you best.

Please explain it with an example. That will be helpful. Thanks @giorgobiani

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:rasa-oss 🎡 Anything related to the open source Rasa framework area:rasa-oss/ml/nlu-components Issues focused around rasa's NLU components help wanted type:discussion 👨‍👧‍👦 Early stage of an idea or validation of thoughts. Should NOT be closed by PR. type:enhancement ✨ Additions of new features or changes to existing ones, should be doable in a single PR
Projects
None yet
Development

No branches or pull requests