You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have used the emoji dataset for a paper, but during our research we found there were many sentences in this dataset that contained leftover emoji utf-codes resulting in an dataset that was incompletely masked.
Due to these leftovers, NNs trained on this are "surpisingly good" at predicting heart emoticons. With the code below you can detect which lines in the text contain this utf-character.
f = open("example.txt", "r", encoding="utf-8")
lines = f.readlines()
for i,l in enumerate(lines):
ls = [*l]
leftovers = list(filter(lambda x: ord(x) == 65039, ls))
if(len(leftovers) > 0):
print(f"Line {i} has leftover emoji's: \"{l}\"")
See for further explanation of the problem this excerpt of our paper:
The text was updated successfully, but these errors were encountered:
Hi there,
We have used the emoji dataset for a paper, but during our research we found there were many sentences in this dataset that contained leftover emoji utf-codes resulting in an dataset that was incompletely masked.
When using the dataset from https://raw.githubusercontent.com/cardiffnlp/tweeteval/main/datasets/emoji/train_text.txt, you can find the utf-code 65039 hidden within the texts. These were once part of a combinatory emoticon depicting for example a couple (see picture of combinations).
Due to these leftovers, NNs trained on this are "surpisingly good" at predicting heart emoticons. With the code below you can detect which lines in the text contain this utf-character.
See for further explanation of the problem this excerpt of our paper:
The text was updated successfully, but these errors were encountered: