THIS REPO IS NO LONGER NECESSARY AND IS NOT BEING MAINTAINED GIVEN STAFFED RESOURCES SUCH AS https://factba.se/
orangetext
is an #rstats project to keep track of The 馃崐 One's speeches and include some code snippets for text analysis on them.
Gladly accepting PRs for legit new transcripts and more analysis scripts.
2016-01-19-presidential-candidacy-anouncement-NewYorkCity-NY.txt
2016-08-31-immigration-Phoenix-AZ.txt
2016-10-13-addressing-sexual-assault-WestPalmBeach-FL.txt
2017-01-20-inaugural.txt
2017-01-21-cia.txt
2017-01-28-may.txt
2017-01-29-weekly-address.txt
2017-01-31-gorsuch.txt
2017-02-01-black-history-month.txt
2017-02-032-national-prayer.txt
2017-02-03-weekly-address.txt
2017-02-07-major_cities_chiefs_association_conference
#
library(ngram)
library(tidyverse)
library(magrittr)
library(ggalt)
library(hrbrmisc)
library(stringi)
library(rprojroot)
Read all the speeches in:
rprojroot::find_rstudio_root_file() %>%
file.path("data", "speeches") %>%
list.files("*.txt", full.names=TRUE) %>%
map(read_lines) %>%
flatten_chr() %>%
stri_enc_toascii() %>%
stri_trim_both() %>%
discard(equals, "") %>%
paste0(collapse=" ") %>%
stri_replace_all_regex("[[:space:]]+", " ") %>%
preprocess(case="lower", remove.punct=TRUE,
remove.numbers=TRUE, fix.spacing=TRUE) -> texts
What have we got:
string.summary(texts)
## Chars: 127786
## Letters: 103672
## Whitespace: 23463
## Punctuation: 0
## Digits: 0
## Words: 23464
## Sentences: 0
## Lines: 1
## Wordlens: 728 869 898 1004 1784 1861 2879 3794 4634 5013
## 1 1 1 1 1 1 1 1 1 1
## Senlens: 0
## 10
## Syllens: 0 8 19 192 829 2174 5859 14331
## 3 1 1 1 1 1 1 1
The 1-grams are kinda useless but this makes a big tibble for 1:8-grams.
map_df(1:8, ~ngram(texts, n=.x) %>%
get.phrasetable() %>%
tbl_df() %>%
rename(words=ngrams) %>%
mutate(words=stri_trim_both(words)) %>%
mutate(ngram=sprintf("ngrams: %s", .x))) %>%
mutate(ngram=factor(ngram, levels=unique(ngram))) %>%
select(ngram, freq, prop, words) -> grams
glimpse(grams)
## Observations: 154,149
## Variables: 4
## $ ngram <fctr> ngrams: 1, ngrams: 1, ngrams: 1, ngrams: 1, ngrams: 1, ...
## $ freq <int> 984, 903, 654, 492, 458, 420, 383, 355, 311, 299, 291, 2...
## $ prop <dbl> 0.041936584, 0.038484487, 0.027872486, 0.020968292, 0.01...
## $ words <chr> "the", "and", "to", "of", "a", "i", "we", "that", "our",...
filter(grams, ngram=="ngrams: 3")
## # A tibble: 20,791 脳 4
## ngram freq prop words
## <fctr> <int> <dbl> <chr>
## 1 ngrams: 3 30 0.0012786634 the united states
## 2 ngrams: 3 27 0.0011507970 going to be
## 3 ngrams: 3 24 0.0010229307 one of the
## 4 ngrams: 3 21 0.0008950644 were going to
## 5 ngrams: 3 20 0.0008524422 we have to
## 6 ngrams: 3 18 0.0007671980 by the way
## 7 ngrams: 3 16 0.0006819538 not going to
## 8 ngrams: 3 15 0.0006393317 and by the
## 9 ngrams: 3 15 0.0006393317 the american people
## 10 ngrams: 3 15 0.0006393317 of our country
## # ... with 20,781 more rows
filter(grams, ngram=="ngrams: 4")
## # A tibble: 22,630 脳 4
## ngram freq prop words
## <fctr> <int> <dbl> <chr>
## 1 ngrams: 4 12 0.0005114871 and by the way
## 2 ngrams: 4 10 0.0004262393 of the united states
## 3 ngrams: 4 9 0.0003836154 the new york times
## 4 ngrams: 4 9 0.0003836154 we are going to
## 5 ngrams: 4 9 0.0003836154 all over the place
## 6 ngrams: 4 9 0.0003836154 thank you thank you
## 7 ngrams: 4 8 0.0003409914 we will make america
## 8 ngrams: 4 8 0.0003409914 we have people that
## 9 ngrams: 4 7 0.0002983675 make america great again
## 10 ngrams: 4 6 0.0002557436 is going to be
## # ... with 22,620 more rows
filter(grams, ngram=="ngrams: 5")
## # A tibble: 23,181 脳 4
## ngram freq prop words
## <fctr> <int> <dbl> <chr>
## 1 ngrams: 5 5 0.0002131287 all you have to do
## 2 ngrams: 5 5 0.0002131287 the new york times and
## 3 ngrams: 5 4 0.0001705030 will make america great again
## 4 ngrams: 5 4 0.0001705030 we will vote for the
## 5 ngrams: 5 4 0.0001705030 that i can tell you
## 6 ngrams: 5 4 0.0001705030 we will bring back our
## 7 ngrams: 5 4 0.0001705030 the united states supreme court
## 8 ngrams: 5 4 0.0001705030 movement the likes of which
## 9 ngrams: 5 4 0.0001705030 we will make america great
## 10 ngrams: 5 4 0.0001705030 we have people that are
## # ... with 23,171 more rows
filter(grams, ngram=="ngrams: 6")
## # A tibble: 23,350 脳 4
## ngram freq prop words
## <fctr> <int> <dbl> <chr>
## 1 ngrams: 6 4 0.0001705103 all you have to do is
## 2 ngrams: 6 4 0.0001705103 we will make america great again
## 3 ngrams: 6 3 0.0001278827 make america great again thank you
## 4 ngrams: 6 3 0.0001278827 you have to do is look
## 5 ngrams: 6 3 0.0001278827 were going to bring our jobs
## 6 ngrams: 6 3 0.0001278827 bless you and god bless america
## 7 ngrams: 6 3 0.0001278827 going to bring our jobs back
## 8 ngrams: 6 3 0.0001278827 god bless you and god bless
## 9 ngrams: 6 3 0.0001278827 to bring our jobs back home
## 10 ngrams: 6 3 0.0001278827 have to do is look at
## # ... with 23,340 more rows