Skip to content

This dataset contains all the 2021 COVID-19 related data from the paper "An Augmented Multilingual Twitter Dataset for Studying the COVID-19 Infodemic"

Notifications You must be signed in to change notification settings

wesseloblink/COVID19_Tweets_Dataset

 
 

Repository files navigation

This repo only contatins the data and statistics for 2021. For the data of 2020 please visit:https://github.com/lopezbec/COVID19_Tweets_Dataset_2020


The repository contains an ongoing collection of tweets associated with the novel coronavirus COVID-19 since January 22nd, 2020.

As of 09/25/2021 there were a total of 2,230,640,905 tweets collected. The tweets are collected using Twitter’s trending topics and selected keywords. Moreover, the tweets from Chen et al. (2020) was used to supplement the dataset by hydrating non-duplicated tweets. These tweets are just a sample of all the tweets generated that are provided by Twitter, and it does not represent the whole population of tweets at any given point.

Citation

Christian Lopez, and Caleb Gallemore (2020) An Augmented Multilingual Twitter Dataset for Studying the COVID-19 Infodemic. DOI: 10.21203/rs.3.rs-95721/v1 https://www.researchsquare.com/article/rs-95721/v1

Data Organization

The dataset is organized by hour (UTC) , month, and by tables. The description of all the features in all five tables is provided below. For example, the path “./Summary_Details/2020_01/2020_01_22_00_Summary_Details.csv” contains all the summary details of the tweets collection on January 22nd at 00:00 UTC time.

Features Description
Table Feature Name Description
Primary key Tweet\_ID Integer representation of the tweets unique identifier
1.Summary\_Details Language When present, indicates a BCP47 language identifier corresponding to the machine-detected language of the Tweet text
Geolocation\_cordinate Indicates whether or not the geographic location of the tweet was reported
RT Indicates if the tweet is a retweet (YES) or original tweet (NO)
Likes Number of likes for the tweet
Retweets Number of times the tweet was retweeted
Country When present, indicates a list of uppercase two-letter country codes from which the tweet comes
Date\_Created UTC date and time the tweet was created
2.Summary\_Hastag Hashtag Hashtag (\#) present in the tweet
3.Summary\_Mentions Mentions Mention (@) present in the tweet
4.Summary\_Sentiment Sentiment\_Label Most probable tweet sentiment (neutral, positive, negative)
Logits\_Neutral Non-normalized prediction for neutral sentiment
Logits\_Positive Non-normalized prediction for positive sentiment
Logits\_Negative Non-normalized prediction for negative sentiment
5.Summary\_NER NER\_text Text stating a named entity recognized by the NER algorithm
Start\_Pos Initial character position within the tweet of the NER\_text
End\_Pos End character position within the tweet of the NER\_text
NER\_Label Prob Label and probability of the named entity recognized by the NER algorithm
6.Summary\_Sentiment\_ES Sentiment\_Label Most probable tweet sentiment (neutral, positive, negative)
Probability\_pos Probability of the tweets sentiment being positive (\<=0.33 is negative, \>0.33 OR \<0.66 is neutral, else positve)
7.Summary\_NER\_ES NER\_text Text stating a named entity recognized by the NER algorithm
Start\_Pos Initial character position within the tweet of the NER\_text
End\_Pos End character position within the tweet of the NER\_text
NER\_Label Prob Label and probability of the named entity recognized by the NER algorithm

For more information visit: Twitter API and the Documentation for API Tweet-object

Data Statistics

General Statistics

As of 09/25/2021:

Total Number of tweets: 2,230,640,905

Average daily number of tweets: 150,228

Summary Statistics per Month
Year Month Daily Avg. Original Daily Avg. Retweets Daily Avg. Tweets Total of Orignal Total of Retweets Total of Tweets Total with Geolocation Max No. Retweets Max No. Likes
2020 1 5,947 30,576 35,501 1,958,346 7,852,504 9,810,850 1,773 674,151 334,802
2020 2 10,978 29,918 40,604 7,624,648 21,944,443 29,568,948 8,103 469,739 637,589
2020 3 13,095 44,714 56,283 12,610,824 46,659,589 59,270,412 19,952 1,064,693 1,255,858
2020 4 30,091 89,513 119,859 20,591,357 60,301,889 80,893,244 38,213 649,823 662,005
2020 5 35,163 99,928 135,709 26,258,213 73,618,083 99,876,289 47,684 1,007,616 929,811
2020 6 51,033 142,569 193,096 34,786,076 95,171,388 129,957,461 58,138 790,652 882,693
2020 7 53,720 155,042 209,738 39,611,015 111,876,344 151,487,359 56,808 615,768 1,287,117
2020 8 51,330 143,291 195,037 37,549,475 102,834,375 140,383,850 55,912 2,183,434 860,162
2020 9 50,068 132,040 182,947 35,861,979 92,957,247 128,819,226 32,381 1,925,489 839,689
2020 10 54,489 137,225 198,708 41,062,885 104,195,279 144,962,625 319,101 946,810 785,385
2020 11 64,125 111,686 177,062 45,096,171 77,885,575 122,981,746 26,488 1,187,438 619,643
2020 12 64,840 121,149 186,852 49,065,436 87,366,002 133,179,589 3,277,244 1,402,911 1,038,164
2021 1 58,225 134,387 192,272 40,878,618 92,341,359 133,219,977 24,293 1,437,164 867,275
2021 2 47,789 104,467 152,780 30,916,912 65,130,838 96,047,732 23,977 971,119 644,697
2021 3 51,889 117,776 168,768 37,803,773 83,103,448 120,907,221 28,788 1,083,628 599,385
2021 4 47,350 128,902 176,534 34,252,762 90,730,535 124,983,296 24,117 1,111,306 653,537
2021 5 45,779 120,864 166,235 34,427,222 89,269,622 123,696,843 22,669 3,194,460 697,980
2021 6 37,931 84,426 122,204 28,310,536 63,462,978 91,773,014 17,693 824,584 413,875
2021 7 47,667 107,313 156,563 34,944,432 77,322,490 112,265,717 16,277 1,108,703 633,347
2021 8 47,626 109,563 157,721 35,681,168 81,535,924 117,217,091 13,943 1,271,696 732,266
2021 9 40,210 89,489 130,323 24,831,001 54,507,414 79,338,415 9,575 1,107,188 378,328

There is a total of 4,123,129 tweets with geolocation information, which are shown on a map below:

Language Statistics

Tweets Language Summary
Languages Total No. Tweets Percentage of Tweets
English 1,442,630,792 64.78
Spanish; Castilian 274,311,820 12.32
Portuguese 97,078,676 4.36
Bahasa 70,425,043 3.16
French 68,159,825 3.06
Others 274,485,479 12.32

English Sentiment Analaysis

The sentiment of all the English tweets was estimated using a state-or-the-art Twitter Sentiment algorithm BB_twtr. (See code here) .

English Named Entity Recognition, Mentions, and Hashtags

The Named Entity Recognition algorithm of flairNLP was used to extract topics of conversation about PERSON, LOCATION, ORGANIZATION, and others. Below are the top 5 NER, Mentions (@) and Hastags (#)

Top 5 Mentions, Hashtags, and NER
Mentions Hashtags NER Person NER Location NER Organization NER Miscellaneous
@realDonaldTrump \#covid19 trump us cdc covid-19
14,106,218 119,608,149 20,963,213 17,679,605 12,841,729 25,278,623
@realdonaldtrump \#coronavirus biden uk covid covid
7,159,966 43,004,769 12,996,856 11,181,359 8,017,439 20,174,488
@mippcivzla \#covid covid india pfizer americans
4,217,090 14,722,111 12,920,565 11,049,324 3,463,031 10,921,402
@joebiden \#whatshappeninginmyanmar fauci china senate covid19
3,486,876 3,502,685 3,390,172 8,453,754 2,679,552 5,743,141
@narendramodi \#covid\19 joe biden florida congress republicans
3,122,874 2,252,353 2,297,696 3,941,757 1,605,602 2,499,222

Spanish Sentiment Analaysis

The sentiment of all the Spanish tweets was estimated using sentiment analysis in spanish based on neural networks model of the the python library sentiment-analysis-spanish 0.0.25.

Spanish Named Entity Recognition

The Spanish Named Entity Recognition algorithm of flairNLP was used to extract topics of conversation about PERSON, LOCATION, ORGANIZATION, and others. Below are the top 5 NER of all the Spanish tweets (* some special character in Spanish are not correctly represented in the readme file, like character with accent mark)

Top 5 Mentions, Hashtags, and NER
NER Person NER Location NER Organization NER Miscellaneous
covid venezuela mippcivzla covid-19
4,072,431 3,804,198 3,832,944 23,230,640
nicolasmaduro méxico covid covid
2,186,125 3,016,636 1,960,002 14,632,052
mippcivzla españa vtvcanal8 covid19
1,185,259 1,560,026 1,922,380 13,139,186
lopezobrador argentina gobierno coronavirus
294,089 680,109 958,936 4,157,520
trump madrid oms delta
264,058 678,516 680,663 246,831

Data Collection Process Inconsistencies

Only tweets in English were collected from 22 January to 31 January 2020, after this time the algorithm collected tweets in all languages. There are also some known gaps of data shown below:

Known gaps
Date Time
2020-08-06 07:00 UTC
2020-08-08 07:00 UTC
2020-08-09 07:00 UTC
2020-08-14 07:00 UTC
2021-05-06 16:00 UTC

Hydrating Tweets

Using our TWARC Notebook

The notebook Automatically_Hydrate_TweetsIDs_COVID190_v2.ipynb will allow you to automatically hydrate the tweets-ID from our COVID19_Tweets_dataset GitHub repository.

You can run this notebook directly on the cloud using Google Colab (see how to tutorials) and Google Drive.

In order to hydrate the tweet-IDs using TWARC you need to create a Twitter Developer Account.

The Twitter API’s rate limits pose an issue to fetch data from tweed-IDs. So, we recommended using Hydrator to convert the list of tweed-IDs, into a CSV file containing all data and meta-data relating to the tweets. Hydrator also manages Twitter API Rate Limits for you.

For those who prefer a command-line interface over a GUI, we recommend using Twarc.

Using Hydrator

Follow the instructions on the Hydrator github repository.

Using Twarc

Follow the instructions on the Twarc github repository.

Inquiries & Requests

If you would like to filter the tweets’ ID based on some metadata not provided on the repo (e.g., geolocation), if you would like to run some additional analyses on the full tweet text data (e.g., sentiment analysis using another language model, topic modeling, etc.), or if you have any questions about the dataset, please contact Dr. Christian Lopez at [email protected]

Existing filters performed are located in ‘Tweets_ID_Filter_requests’ directory

Licensing

This dataset is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License (CC BY-NC-SA 4.0). By using this dataset, you agree to abide by the stipulations in the license, remain in compliance with Twitter’s Terms of Service, and cite the following manuscript:

Christian Lopez, and Caleb Gallemore (2020) An Augmented Multilingual Twitter Dataset for Studying the COVID-19 Infodemic. DOI: 10.21203/rs.3.rs-95721/v1 https://www.researchsquare.com/article/rs-95721/v1

References

Emily Chen, Kristina Lerman, and Emilio Ferrara. 2020. #COVID-19: The First Public Coronavirus Twitter Dataset. arXiv:cs.SI/2003.07372, 2020

https://github.com/echen102/COVID-19-TweetIDs

About

This dataset contains all the 2021 COVID-19 related data from the paper "An Augmented Multilingual Twitter Dataset for Studying the COVID-19 Infodemic"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • HTML 99.9%
  • Other 0.1%