Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Leftover utf-codes in emoji analysis texts #23

Open
MarijnRoelvink opened this issue Oct 21, 2022 · 0 comments
Open

Leftover utf-codes in emoji analysis texts #23

MarijnRoelvink opened this issue Oct 21, 2022 · 0 comments

Comments

@MarijnRoelvink
Copy link

Hi there,

We have used the emoji dataset for a paper, but during our research we found there were many sentences in this dataset that contained leftover emoji utf-codes resulting in an dataset that was incompletely masked.

When using the dataset from https://raw.githubusercontent.com/cardiffnlp/tweeteval/main/datasets/emoji/train_text.txt, you can find the utf-code 65039 hidden within the texts. These were once part of a combinatory emoticon depicting for example a couple (see picture of combinations).

heart emoji

Due to these leftovers, NNs trained on this are "surpisingly good" at predicting heart emoticons. With the code below you can detect which lines in the text contain this utf-character.

f = open("example.txt", "r", encoding="utf-8")
lines = f.readlines()
for i,l in enumerate(lines):
    ls = [*l]
    leftovers = list(filter(lambda x: ord(x) == 65039, ls))
    if(len(leftovers) > 0):
        print(f"Line {i} has leftover emoji's: \"{l}\"")

See for further explanation of the problem this excerpt of our paper:
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant