- This repo only contatins the data and statistics for 2021. For the data of 2020 please visit:https://github.com/lopezbec/COVID19_Tweets_Dataset_2020
- Data Organization
- Data Statistics
- Hydrating Tweets
- Inquiries & Requests
- Licensing
- References
This repo only contatins the data and statistics for 2021. For the data of 2020 please visit:https://github.com/lopezbec/COVID19_Tweets_Dataset_2020
The repository contains an ongoing collection of tweets associated with the novel coronavirus COVID-19 since January 22nd, 2020.
As of 09/25/2021 there were a total of 2,230,640,905 tweets collected. The tweets are collected using Twitter’s trending topics and selected keywords. Moreover, the tweets from Chen et al. (2020) was used to supplement the dataset by hydrating non-duplicated tweets. These tweets are just a sample of all the tweets generated that are provided by Twitter, and it does not represent the whole population of tweets at any given point.
Citation
Christian Lopez, and Caleb Gallemore (2020) An Augmented Multilingual Twitter Dataset for Studying the COVID-19 Infodemic. DOI: 10.21203/rs.3.rs-95721/v1 https://www.researchsquare.com/article/rs-95721/v1
The dataset is organized by hour (UTC) , month, and by tables. The description of all the features in all five tables is provided below. For example, the path “./Summary_Details/2020_01/2020_01_22_00_Summary_Details.csv” contains all the summary details of the tweets collection on January 22nd at 00:00 UTC time.
Table | Feature Name | Description |
---|---|---|
Primary key | Tweet\_ID | Integer representation of the tweets unique identifier |
1.Summary\_Details | Language | When present, indicates a BCP47 language identifier corresponding to the machine-detected language of the Tweet text |
Geolocation\_cordinate | Indicates whether or not the geographic location of the tweet was reported | |
RT | Indicates if the tweet is a retweet (YES) or original tweet (NO) | |
Likes | Number of likes for the tweet | |
Retweets | Number of times the tweet was retweeted | |
Country | When present, indicates a list of uppercase two-letter country codes from which the tweet comes | |
Date\_Created | UTC date and time the tweet was created | |
2.Summary\_Hastag | Hashtag | Hashtag (\#) present in the tweet |
3.Summary\_Mentions | Mentions | Mention (@) present in the tweet |
4.Summary\_Sentiment | Sentiment\_Label | Most probable tweet sentiment (neutral, positive, negative) |
Logits\_Neutral | Non-normalized prediction for neutral sentiment | |
Logits\_Positive | Non-normalized prediction for positive sentiment | |
Logits\_Negative | Non-normalized prediction for negative sentiment | |
5.Summary\_NER | NER\_text | Text stating a named entity recognized by the NER algorithm |
Start\_Pos | Initial character position within the tweet of the NER\_text | |
End\_Pos | End character position within the tweet of the NER\_text | |
NER\_Label Prob | Label and probability of the named entity recognized by the NER algorithm | |
6.Summary\_Sentiment\_ES | Sentiment\_Label | Most probable tweet sentiment (neutral, positive, negative) |
Probability\_pos | Probability of the tweets sentiment being positive (\<=0.33 is negative, \>0.33 OR \<0.66 is neutral, else positve) | |
7.Summary\_NER\_ES | NER\_text | Text stating a named entity recognized by the NER algorithm |
Start\_Pos | Initial character position within the tweet of the NER\_text | |
End\_Pos | End character position within the tweet of the NER\_text | |
NER\_Label Prob | Label and probability of the named entity recognized by the NER algorithm |
For more information visit: Twitter API and the Documentation for API Tweet-object
As of 09/25/2021:
Total Number of tweets: 2,230,640,905
Average daily number of tweets: 150,228
Year | Month | Daily Avg. Original | Daily Avg. Retweets | Daily Avg. Tweets | Total of Orignal | Total of Retweets | Total of Tweets | Total with Geolocation | Max No. Retweets | Max No. Likes |
---|---|---|---|---|---|---|---|---|---|---|
2020 | 1 | 5,947 | 30,576 | 35,501 | 1,958,346 | 7,852,504 | 9,810,850 | 1,773 | 674,151 | 334,802 |
2020 | 2 | 10,978 | 29,918 | 40,604 | 7,624,648 | 21,944,443 | 29,568,948 | 8,103 | 469,739 | 637,589 |
2020 | 3 | 13,095 | 44,714 | 56,283 | 12,610,824 | 46,659,589 | 59,270,412 | 19,952 | 1,064,693 | 1,255,858 |
2020 | 4 | 30,091 | 89,513 | 119,859 | 20,591,357 | 60,301,889 | 80,893,244 | 38,213 | 649,823 | 662,005 |
2020 | 5 | 35,163 | 99,928 | 135,709 | 26,258,213 | 73,618,083 | 99,876,289 | 47,684 | 1,007,616 | 929,811 |
2020 | 6 | 51,033 | 142,569 | 193,096 | 34,786,076 | 95,171,388 | 129,957,461 | 58,138 | 790,652 | 882,693 |
2020 | 7 | 53,720 | 155,042 | 209,738 | 39,611,015 | 111,876,344 | 151,487,359 | 56,808 | 615,768 | 1,287,117 |
2020 | 8 | 51,330 | 143,291 | 195,037 | 37,549,475 | 102,834,375 | 140,383,850 | 55,912 | 2,183,434 | 860,162 |
2020 | 9 | 50,068 | 132,040 | 182,947 | 35,861,979 | 92,957,247 | 128,819,226 | 32,381 | 1,925,489 | 839,689 |
2020 | 10 | 54,489 | 137,225 | 198,708 | 41,062,885 | 104,195,279 | 144,962,625 | 319,101 | 946,810 | 785,385 |
2020 | 11 | 64,125 | 111,686 | 177,062 | 45,096,171 | 77,885,575 | 122,981,746 | 26,488 | 1,187,438 | 619,643 |
2020 | 12 | 64,840 | 121,149 | 186,852 | 49,065,436 | 87,366,002 | 133,179,589 | 3,277,244 | 1,402,911 | 1,038,164 |
2021 | 1 | 58,225 | 134,387 | 192,272 | 40,878,618 | 92,341,359 | 133,219,977 | 24,293 | 1,437,164 | 867,275 |
2021 | 2 | 47,789 | 104,467 | 152,780 | 30,916,912 | 65,130,838 | 96,047,732 | 23,977 | 971,119 | 644,697 |
2021 | 3 | 51,889 | 117,776 | 168,768 | 37,803,773 | 83,103,448 | 120,907,221 | 28,788 | 1,083,628 | 599,385 |
2021 | 4 | 47,350 | 128,902 | 176,534 | 34,252,762 | 90,730,535 | 124,983,296 | 24,117 | 1,111,306 | 653,537 |
2021 | 5 | 45,779 | 120,864 | 166,235 | 34,427,222 | 89,269,622 | 123,696,843 | 22,669 | 3,194,460 | 697,980 |
2021 | 6 | 37,931 | 84,426 | 122,204 | 28,310,536 | 63,462,978 | 91,773,014 | 17,693 | 824,584 | 413,875 |
2021 | 7 | 47,667 | 107,313 | 156,563 | 34,944,432 | 77,322,490 | 112,265,717 | 16,277 | 1,108,703 | 633,347 |
2021 | 8 | 47,626 | 109,563 | 157,721 | 35,681,168 | 81,535,924 | 117,217,091 | 13,943 | 1,271,696 | 732,266 |
2021 | 9 | 40,210 | 89,489 | 130,323 | 24,831,001 | 54,507,414 | 79,338,415 | 9,575 | 1,107,188 | 378,328 |
There is a total of 4,123,129 tweets with geolocation information, which are shown on a map below:
Languages | Total No. Tweets | Percentage of Tweets |
---|---|---|
English | 1,442,630,792 | 64.78 |
Spanish; Castilian | 274,311,820 | 12.32 |
Portuguese | 97,078,676 | 4.36 |
Bahasa | 70,425,043 | 3.16 |
French | 68,159,825 | 3.06 |
Others | 274,485,479 | 12.32 |
The sentiment of all the English tweets was estimated using a state-or-the-art Twitter Sentiment algorithm BB_twtr. (See code here) .
The Named Entity Recognition algorithm of flairNLP was used to extract topics of conversation about PERSON, LOCATION, ORGANIZATION, and others. Below are the top 5 NER, Mentions (@) and Hastags (#)
Mentions | Hashtags | NER Person | NER Location | NER Organization | NER Miscellaneous |
---|---|---|---|---|---|
@realDonaldTrump | \#covid19 | trump | us | cdc | covid-19 |
14,106,218 | 119,608,149 | 20,963,213 | 17,679,605 | 12,841,729 | 25,278,623 |
@realdonaldtrump | \#coronavirus | biden | uk | covid | covid |
7,159,966 | 43,004,769 | 12,996,856 | 11,181,359 | 8,017,439 | 20,174,488 |
@mippcivzla | \#covid | covid | india | pfizer | americans |
4,217,090 | 14,722,111 | 12,920,565 | 11,049,324 | 3,463,031 | 10,921,402 |
@joebiden | \#whatshappeninginmyanmar | fauci | china | senate | covid19 |
3,486,876 | 3,502,685 | 3,390,172 | 8,453,754 | 2,679,552 | 5,743,141 |
@narendramodi | \#covid\19 | joe biden | florida | congress | republicans |
3,122,874 | 2,252,353 | 2,297,696 | 3,941,757 | 1,605,602 | 2,499,222 |
The sentiment of all the Spanish tweets was estimated using sentiment analysis in spanish based on neural networks model of the the python library sentiment-analysis-spanish 0.0.25.
The Spanish Named Entity Recognition algorithm of flairNLP was used to extract topics of conversation about PERSON, LOCATION, ORGANIZATION, and others. Below are the top 5 NER of all the Spanish tweets (* some special character in Spanish are not correctly represented in the readme file, like character with accent mark)
NER Person | NER Location | NER Organization | NER Miscellaneous |
---|---|---|---|
covid | venezuela | mippcivzla | covid-19 |
4,072,431 | 3,804,198 | 3,832,944 | 23,230,640 |
nicolasmaduro | méxico | covid | covid |
2,186,125 | 3,016,636 | 1,960,002 | 14,632,052 |
mippcivzla | españa | vtvcanal8 | covid19 |
1,185,259 | 1,560,026 | 1,922,380 | 13,139,186 |
lopezobrador | argentina | gobierno | coronavirus |
294,089 | 680,109 | 958,936 | 4,157,520 |
trump | madrid | oms | delta |
264,058 | 678,516 | 680,663 | 246,831 |
Only tweets in English were collected from 22 January to 31 January 2020, after this time the algorithm collected tweets in all languages. There are also some known gaps of data shown below:
Date | Time |
---|---|
2020-08-06 | 07:00 UTC |
2020-08-08 | 07:00 UTC |
2020-08-09 | 07:00 UTC |
2020-08-14 | 07:00 UTC |
2021-05-06 | 16:00 UTC |
The notebook Automatically_Hydrate_TweetsIDs_COVID190_v2.ipynb will allow you to automatically hydrate the tweets-ID from our COVID19_Tweets_dataset GitHub repository.
You can run this notebook directly on the cloud using Google Colab (see how to tutorials) and Google Drive.
In order to hydrate the tweet-IDs using TWARC you need to create a Twitter Developer Account.
The Twitter API’s rate limits pose an issue to fetch data from tweed-IDs. So, we recommended using Hydrator to convert the list of tweed-IDs, into a CSV file containing all data and meta-data relating to the tweets. Hydrator also manages Twitter API Rate Limits for you.
For those who prefer a command-line interface over a GUI, we recommend using Twarc.
Follow the instructions on the Hydrator github repository.
Follow the instructions on the Twarc github repository.
If you would like to filter the tweets’ ID based on some metadata not provided on the repo (e.g., geolocation), if you would like to run some additional analyses on the full tweet text data (e.g., sentiment analysis using another language model, topic modeling, etc.), or if you have any questions about the dataset, please contact Dr. Christian Lopez at [email protected]
Existing filters performed are located in ‘Tweets_ID_Filter_requests’ directory
This dataset is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License (CC BY-NC-SA 4.0). By using this dataset, you agree to abide by the stipulations in the license, remain in compliance with Twitter’s Terms of Service, and cite the following manuscript:
Christian Lopez, and Caleb Gallemore (2020) An Augmented Multilingual Twitter Dataset for Studying the COVID-19 Infodemic. DOI: 10.21203/rs.3.rs-95721/v1 https://www.researchsquare.com/article/rs-95721/v1
Emily Chen, Kristina Lerman, and Emilio Ferrara. 2020. #COVID-19: The First Public Coronavirus Twitter Dataset. arXiv:cs.SI/2003.07372, 2020
https://github.com/echen102/COVID-19-TweetIDs