-
Notifications
You must be signed in to change notification settings - Fork 0
/
Last_Statements.Rmd
372 lines (299 loc) · 15.2 KB
/
Last_Statements.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
---
title: "Last Statements by Texas Death Row Inmates"
author: "Zeynep Aslan"
output:
github_document:
toc: true
---
<!-- IMPORTANT: To be able to reproduce this report, you need the death_row.csv data. Use Data_Scraping.R file to generate this data before trying to reproduce this report.
-->
# Overview
This report analyses the [last statements of Texas death row inmates](https://www.tdcj.texas.gov/death_row/dr_executed_offenders.html). Specifically, this report examines the last words of the inmates to reveal the common topics underlying their statements and to explore overall emotional tone of these statements. The report also includes specific analyses looking at the relationship between the inmates' last statements and their demographic information, such as race and age.
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE, message = FALSE)
# Load required packages
library(tidyverse)
library(tidytext)
library(textrecipes)
library(topicmodels)
library(here)
library(ggwordcloud)
library(textdata)
library(tidymodels)
library(themis)
library(ranger)
library(kableExtra)
# Set seed
set.seed(100)
# Set a global theme for the plots
theme_set(theme_minimal())
# Read data
death_row <- read_csv(here("death_row.csv"), show_col_types = FALSE)
```
```{r remove-no-statement}
# Remove inmates that made no statement
death_row <- death_row %>%
filter(!Statement %in% c("None", "None.", "No statement was made.", "Spoken:No",
"Spoken: No.", "No statement given.", "June 27, 2018", "No",
"This inmate declined to make a last statement.",
"No last statement.", "Profanity directed toward staff."))
```
```{r tidy-data}
# Remove the unrelated parts of the text
death_row <- death_row %>%
# Remove parts in parentheses - these either specify modality (spoken vs. written)
# or the language of the statement or describes the behaviors of the inmate
mutate(Statement = str_remove_all(Statement, pattern = "\\s*\\([^\\)]+\\)")) %>%
# Remove modality info that are not in parentheses
mutate(Statement = str_remove_all(Statement, pattern = "Spoken:")) %>%
mutate(Statement = str_remove_all(Statement, pattern = "Written:")) %>%
mutate(Statement = str_remove_all(Statement, pattern = "Written")) %>%
mutate(Statement = str_remove_all(Statement, pattern = "Verbal:"))
# Tidy the data and remove the stop words
tidy_death_row <- death_row %>%
unnest_tokens(output = word, input = Statement) %>%
anti_join(stop_words)
# Select numbers that are identified as words
nums <- tidy_death_row %>%
filter(str_detect(word, "^[0-9]")) %>%
select(word) %>%
unique()
# Remove the numbers
tidy_death_row <- tidy_death_row %>%
anti_join(nums, by = "word")
```
## Most frequent words
Before getting into more detailed analyses, let's check first the most frequent words used by the inmates in their last statements.
```{r word-cloud, fig.align='left'}
# Build a word cloud with the most frequent 100 words
word_cloud <- tidy_death_row %>%
count(word) %>%
slice_max(order_by = n, n = 100) %>%
mutate(angle = 90 * sample(c(0, 1), n(), replace = TRUE, prob = c(70, 30))) %>%
ggplot(aes(label = word, size = n, angle = angle, color = n)) +
geom_text_wordcloud(rm_outside = TRUE) +
scale_size_area(max_size = 20) +
labs(title = "Figure 1: Most frequent 100 tokens in the last statements of Texas death row \ninmates") +
scale_color_gradient(low = "mediumpurple1", high = "darkslateblue")
# Print the graph
word_cloud
```
The word cloud in Figure 1 demonstrates 100 most frequent words used by the Texas death row inmates in their last statements. Darker colors and bigger fonts suggest higher frequency. Looking at the graph, we can see that the word "Love" appears most frequently in the last statements of inmates.
```{r common-words, fig.align='left'}
# Visualize the most frequent words spoken by the inmates
common_words <- tidy_death_row %>%
# only count each word once per statement
distinct(Execution, `First Name`, `Last Name`,
TDCJNumber, Age, Date, Race, County,
word, .keep_all = TRUE) %>%
count(word) %>%
arrange(desc(n)) %>%
top_n(10) %>%
ggplot(mapping = aes(x = reorder(word, n), y = n)) +
geom_col(fill = "darkseagreen3", color = "darkseagreen4") +
labs(x = "",
y = "Frequency",
title = "Figure 2: Most frequent 10 words in the last statements of death row inmates") +
coord_flip()
#Print graph
common_words
```
Similar to Figure 1, Figure 2 shows the most frequent 10 words used in the last statements of Texas death row inmates. Even from this preliminary analysis, we can observe some clear patterns: Most inmates use their last statements as a chance to express their *love* for their *families* and friends and seek *forgiveness* for their crimes by turning to *religion*.
### Topic Modeling
Now, let's apply *topic modeling* to the data to uncover the potential themes underlying the last statements of the inmates.
```{r topic-modeling}
# Find out the underlying topic structure of the statements
# Build a recipe to prepare the data
statements_rec <- recipe(~., data = death_row) %>%
# Tokenize the statements
step_tokenize(Statement) %>%
# Remove the stopwords
step_stopwords(Statement) %>%
# Calculate the n-grams
step_ngram(Statement, num_tokens = 5, min_num_tokens = 1) %>%
# Keep only the most commonly used 2500 tokens
step_tokenfilter(Statement, max_tokens = 2500) %>%
# Calculate term-frequency for each token
step_tf(Statement)
# Prepare the recipe and extract the data frame
statements_prep <- prep(statements_rec)
statements_df <- bake(statements_prep, new_data = NULL)
# Convert the data frame into DocumentTermMatrix
statements_dtm <- statements_df %>%
# Tidy the data frame
pivot_longer(
cols = -c(Execution:County),
names_to = "token",
values_to = "n"
) %>%
# Filter out tokens with frequency = 0
filter(n != 0) %>%
# Clean the token column name
# Drop empty rows
mutate(
token = str_remove(string = token, pattern = "tf_Statement_"),
id = fct_drop(f = as.factor(TDCJNumber))
) %>%
# Convert the data frame into a DTM
cast_dtm(document = id, term = token, value = n)
# Estimate an LDA model, setting k = 6
statements_lda4 <- LDA(statements_dtm, k = 6, control = list(seed = 100))
statements_lda4_td <- tidy(statements_lda4)
# Visualize the top terms for each topic
top_terms <- statements_lda4_td %>%
group_by(topic) %>%
top_n(5, beta) %>%
ungroup() %>%
arrange(topic, -beta) %>%
mutate(
topic = factor(topic),
term = reorder_within(term, beta, topic)
) %>%
ggplot(aes(term, beta, fill = topic)) +
geom_bar(alpha = 0.8, stat = "identity", show.legend = FALSE) +
scale_x_reordered() +
scale_fill_brewer(palette = "Set2") +
facet_wrap(facets = vars(topic), scales = "free", ncol = 3) +
labs(title = "Figure 3: Latent topics underlying the last statements of death row inmates",
x = "",
y = "") +
coord_flip()
# Print the graph
top_terms
```
Figure 3 shows 6 potential topics underlying the last statements of the death row inmates. All 6 topics look pretty much the same, there are no clear patterns that can differentiate the topics from each other. 5 out of 6 topics include the word *love* (topics 1, 2, 4, 5, and 6) and half of the topics reveal the remorse of the inmate with the words *sorry* and *forgive* (topics 2, 3, and 6). Some of the topics also include religious references such as words *holy* and *god* (topics 3 and 6). It is important to note that the topics does not become any more distinguishable/interpretable by increasing or decreasing the number of models specified in model structure.
## Most Frequent Words by Race
In this part of the report, we will analyze whether or not we can estimate the races of inmates depending on their last statements. Before running a predictive model to test this question, let's check the most frequent words used by inmates of each race.
```{r by-race, fig.align='left'}
# Visualize the most frequent words by race
word_byrace <- tidy_death_row %>%
# filter out the race "other"
filter(Race != "Other") %>%
distinct(Execution, `First Name`, `Last Name`,
TDCJNumber, Age, Date, Race, County,
word, .keep_all = TRUE) %>%
count(Race, word) %>%
group_by(Race) %>%
slice_max(order_by = n, n = 10) %>%
mutate(word = reorder_within(word, n, Race)) %>%
ggplot(aes(x = word, y = n, fill = Race)) +
geom_col(show.legend = FALSE) +
scale_x_reordered() +
scale_fill_brewer(palette = "Pastel1") +
labs(x = "",
y = "Frequency",
title = "Figure 4: Most frequent words in the last statements of death row inmates",
subtitle = "By race") +
facet_wrap(facets = vars(Race), scales = "free") +
coord_flip()
# Print the graph
word_byrace
```
Figure 4 demonstrates the most frequent words used by inmates of different races in their last statements. When we compare the graphs, we can see that there is a great deal of consistency between race groups in terms of the themes of the last statements: As also shown in Figure 1 and 2, inmates of all races express their love for their families and ask for forgiveness from the God in their last statements.
### Predictive Modeling
Now, let's estimate a *random forest model* with 10-folds cross-validation to test if we can predict the race of an inmate based purely on their last statements.
```{r predictive-modeling}
# Based on their last statements, can we predict the races of the inmates?
# Filter out the "other" as there very few data points
death_row_race <- death_row %>%
filter(Race != "Other")
# Split the data into training and testing sets
statements_split <- initial_split(data = death_row_race, strata = Race, prop = 0.75)
statements_train <- training(statements_split)
statements_test <- testing(statements_split)
# Create cross-validation folds
statements_folds <- vfold_cv(data = statements_train, strata = Race)
# Build the recipe
statements_rec <- recipe(Race ~ Statement, data = death_row_race) %>%
# Tokenize the last statements
step_tokenize(Statement) %>%
# Remove stop words
step_stopwords(Statement) %>%
# Keep top 500 tokens
step_tokenfilter(Statement, max_tokens = 500) %>%
# Calculate tf-idf scores
step_tfidf(Statement) %>%
# Correct the names of the scores
step_rename_at(starts_with("tfidf_Statement_"),
fn = ~ str_replace_all(
string = .,
pattern = "tfidf_Statement_",
replacement = ""
)
) %>%
# Downsample the observations by race
step_downsample(Race)
# Define a random forest model
ranger_spec <- rand_forest(trees = 1000) %>%
set_mode("classification") %>%
set_engine("ranger")
# Build the workflow
ranger_workflow <- workflow() %>%
add_recipe(statements_rec) %>%
add_model(ranger_spec)
# Fit the model
ranger_cv <- ranger_workflow %>%
fit_resamples(
resamples = statements_folds,
control = control_resamples(save_pred = TRUE)
)
ranger_cv_metrics <- collect_metrics(ranger_cv)
# Build a table to demonstrate the output
kable(ranger_cv_metrics[1, 1:5], align = 'c',
col.names = c("Metric", "Estimator", "Mean", "n", "Standard Error"),
caption = "Table 1: Random Forest Model (10-folds)",
digits = 2)
# Draw a confusion matrix
conf_mat_resampled(x = ranger_cv, tidy = FALSE) %>%
autoplot(type = "heatmap") +
scale_fill_gradient(low = "wheat", high = "darkorange1", trans = "log") +
labs(title = "Figure 5: Confusion Matrix")
```
Table 1 shows the performance of the model. The average accuracy of the model was `r formatC((ranger_cv_metrics$mean[1])*100, digits =2)`%, which is not great at all. Confusion matrix also displays the sub-optimal performance of the model. We can see that the model most accurately predicts the races of White individuals but even that performance is very bad. Overall, the results of the predictive model suggests that the last statements of inmates do not carry much informative value about the races of their speakers.
## Sentiment Analyses
Finally, let's estimate the sentiment of the last statements of the death row inmates.
```{r bing}
# Visualize the most frequent positive and negative words
death_row_bing <- tidy_death_row %>%
# filter for emotional words using `bing` dataset
inner_join(get_sentiments("bing")) %>%
distinct(Execution, `First Name`, `Last Name`,
TDCJNumber, Age, Date, Race, County,
word, sentiment, .keep_all = TRUE) %>%
group_by(sentiment) %>%
# count the number of occurrences
count(word) %>%
group_by(sentiment) %>%
slice_max(order_by = n, n = 10) %>%
mutate(word = reorder_within(word, n, sentiment)) %>%
ggplot(mapping = aes(word, n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
scale_x_reordered() +
scale_fill_brewer(palette = "Set2") +
facet_wrap(facets = vars(sentiment), scales = "free_y") +
labs(
title = "Figure 6: Sentimental words used in the last statements of the death row inmates",
x = NULL,
y = "Frequency"
) +
coord_flip()
# Print graph
death_row_bing
```
Figure 6 demonstrates the top 10 most frequently used positive and negative words in the last statements of the death row inmates. The last statements of the inmates generally have a positive rather than negative tone. This is interesting and quite tragic, considering that these words were uttered by people who were in the last few minutes of their lives. However, when positive words are examined more closely, it can be understood where this positivity stems from: It seems like inmates made peace with the fact that their lives were about to end and they were ready to pass to the 'other side'. On another note, the word ***love*** outweighs in the last statements of inmates (i.e., more than half of the inmates who made a statement articulated the word *love* in their statements) and this is because inmates also use their last statements to express their love for their families who were there for their execution.
```{r age-sentiment}
# Visualize the relationship between sentiment of words and age of the inmates
death_row_bing <- tidy_death_row %>%
# filter for emotional words using `afinn` dataset
inner_join(get_sentiments("afinn")) %>%
group_by(Age) %>%
summarise(mean_sentiment = mean(as.numeric(value))) %>%
ggplot(mapping = aes(x = Age, y = mean_sentiment)) +
geom_point() +
geom_smooth(method = "lm", color = "magenta4") +
labs(y = "Sentiment Score",
title = "Figure 7: The relationship between age and sentiment score of last statements")
# Print graph
death_row_bing
```
Lastly, Figure 7 demonstrates the relationship between the ages of the inmates at the time of their execution and the sentiment level of their last statements. The sentiment level of each word in a last statement is coded using the [AFINN](https://search.r-project.org/CRAN/refmans/textdata/html/lexicon_afinn.html) dataset. AFINN is a lexicon of English words rated for valence with a range between -5 (negative) to 5 (positive). As can be seen from Figure 1, across all ages, the last statements of the inmates are slightly positive. However, there seems to be no relationship between inmates' ages and the sentiment level of their last statements.