Leftover utf-codes in emoji analysis texts #23

MarijnRoelvink · 2022-10-21T13:30:24Z

Hi there,

We have used the emoji dataset for a paper, but during our research we found there were many sentences in this dataset that contained leftover emoji utf-codes resulting in an dataset that was incompletely masked.

When using the dataset from https://raw.githubusercontent.com/cardiffnlp/tweeteval/main/datasets/emoji/train_text.txt, you can find the utf-code 65039 hidden within the texts. These were once part of a combinatory emoticon depicting for example a couple (see picture of combinations).

Due to these leftovers, NNs trained on this are "surpisingly good" at predicting heart emoticons. With the code below you can detect which lines in the text contain this utf-character.

f = open("example.txt", "r", encoding="utf-8")
lines = f.readlines()
for i,l in enumerate(lines):
    ls = [*l]
    leftovers = list(filter(lambda x: ord(x) == 65039, ls))
    if(len(leftovers) > 0):
        print(f"Line {i} has leftover emoji's: \"{l}\"")

See for further explanation of the problem this excerpt of our paper:

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Leftover utf-codes in emoji analysis texts #23

Leftover utf-codes in emoji analysis texts #23

MarijnRoelvink commented Oct 21, 2022

Leftover utf-codes in emoji analysis texts #23

Leftover utf-codes in emoji analysis texts #23

Comments

MarijnRoelvink commented Oct 21, 2022