README.Rmd

---
output: github_document
---

<!-- README.md is generated from README.Rmd. Please edit that file -->

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-"
)
```

# Lexicon-based Sentiment Analysis for Economic and Financial Applications

The `FiGASR` package allows R users to leverage on cutting-hedge NLP techniques to easily run sentiment analysis on economic news content:
this package is a wrapper of the [`SentiBigNomics`](https://github.com/sergioconsoli/SentiBigNomics) python package.
Given a list of texts as input and a list of tokens of interest (ToI), the algorithm analyses the texts and compute the economic sentiment associated each ToI. 
Two key features characterize this approach.
First, it is *fine-grained*, since words are assigned a polarity score
that ranges in [-1,1] based on a dictionary.
Second, it is *aspect-based*, since the algorithm selects the chunk of text that relates to the ToI based on a set of semantic rules and calculates the sentiment only on that text, rather than the full article. 

The package includes some additional of features, like automatic negation handling, tense detection, location filtering and excluding some words from the sentiment computation.
`FiGASR` only supports English language, as it relies on the *en_core_web_lg* language model from the `spaCy` Python module.


## Installation

You can install the package from GitHub as follows:

```{r gh-installation, eval = FALSE}
install.packages("devtools")
devtools::install_github("lucabarbaglia/FiGASR")
```

If it is the first time that you are using `FiGASR`, then set up the associated environment:

```{r env, eval = FALSE}
FiGASR::figas_install()
```


## A start-up example

Let's assume that you want to compute the sentiment associated to two tokens of interest, namely *unemployment* and *economy*, given the two following sentences.

```{r test}
library(FiGASR)
text <- list("Unemployment is rising at high speed",
             "The economy is slowing down and unemployment is booming")
include = list("unemployment", "economy")

get_sentiment(text = text, include = include)

```

The output of the function `get_sentiment` is a list, containing two objects:

- a tibble "sentiment" containing the average sentiment computed for each text;

- a tibble "sentiment_by_chunk" containing the sentiment computed for each chunk detected in the texts.

The first element of the output list provides the overall average sentiment score of each text, while the second provides the detailed score of each chunk of text that relates to one of the ToI.


## ECB Economic Bulletin

Among the available data sets, the package provides access to `senti_bignomics`, a fine-grained dictionary customized for economic sentiment analysis, and to `ecb_bulletin`,
the [ECB Economic Bulletin](https://www.ecb.europa.eu/pub/economic-bulletin/html/index.en.html)^[ECB Economic Bulletin copyright notice disclaimer  "All rights reserved. Reproduction for educational and non-commercial purposes is permitted provided that the source is acknow ledged".] released between 1999 and 2019, and to `beige_book`,
the [FED Beige Book](https://www.federalreserve.gov/monetarypolicy/beige-book-default.htm) released between 1983 and 2019.

Let's provide an example of some additional features of the package: assume that we want to extract the sentiment about "economic activity" on the ECB Economic Bulletin releases in 2007-13. The figure below plots the economic sentiment computed by `FiGASR`, which timely identifies the recessionary period indicated by the shadowed area following the [EABCN business cycles reference dates](https://eabcn.org/dc/chronology-euro-area-business-cycles).


```{r ECB Economic Bulletin, fig.height=4, fig.width=5}
data("ecb_bulletin")
ecb_sub <- ecb_bulletin[ecb_bulletin$Date >= as.Date("2007-01-01") & ecb_bulletin$Date <= as.Date("2013-01-01"), ]
text <- as.list(as.data.frame(ecb_sub)[, "Text"])

## Compute sentiment about "economic activity"
ecb_sent      <- get_sentiment(text = text,
                              include = list("economic activity")) #, exclude = list("stock market"))

## Add dates for original Doc_id
library(ggplot2)
library(dplyr, warn.conflicts=FALSE)
ecb_dates <- ecb_sub %>%
  mutate(Doc_id = row_number()) %>%
  select(Doc_id, Date) %>%
  left_join(ecb_sent$sentiment, by="Doc_id")

## Plot the time series of the average sentiment
ecb_dates %>%
  ggplot(aes(x = Date, y = Average_sentiment)) +
  geom_line(color="#000080") +
  theme_bw() +
  annotate("rect", xmin = as.Date("2008-03-01"), xmax = as.Date("2009-06-01"),
           ymin=-Inf, ymax=Inf, alpha = .2) +
  annotate("rect", xmin = as.Date("2011-09-01"), xmax = as.Date("2013-03-01"),
           ymin=-Inf, ymax=Inf, alpha = .2) +
  geom_hline(yintercept = 0, col="grey") +
  ylab("Average sentiment")

```

The `FiGASR` algorithm leverages on a set of semantic rules to identify the part of text that relates and characterize the token of interest. The argument `oss` allows to run a *naive* sentiment computation by assigning a score to each word in the text without the usage of semantic rules (i.e., overall sentiment score). The figure below shows the sentiment computed with the proposed algorithm (in blue) and in the naive way (in red): the former captures the recessionary period more **timely** and **accurately** than the latter.


```{r OSS ECB Economic Bulletin, fig.height=4, fig.width=6}
## Overall sentiment score
ecb_sent_OSS      <- get_sentiment(text = text,
                              include = list("economic activity"),
                              oss = TRUE)

ecb_sent_comparison <- left_join(ecb_sent$sentiment, ecb_sent_OSS$sentiment, by="Doc_id")
colnames(ecb_sent_comparison) <- c("Doc_id", "FiGASR", "Naive")

## Plot the time series of the average sentiment
library(tidyr)
cbind(ecb_sent_comparison, ecb_sub) %>%
  gather(var, val, 'FiGASR', Naive) %>%
  ggplot(aes(x = Date, y = val, color=var, group=var)) +
  geom_line() +
  theme_bw() +
  annotate("rect", xmin = as.Date("2008-03-01"), xmax = as.Date("2009-06-01"),
           ymin=-Inf, ymax=Inf, alpha = .2) +
  annotate("rect", xmin = as.Date("2011-09-01"), xmax = as.Date("2013-03-01"),
           ymin=-Inf, ymax=Inf, alpha = .2) +
  geom_hline(yintercept = 0, col="grey") +
  ylab("Average sentiment") +
  scale_colour_manual(values=c("#000080", "#E41A1C")) +
  theme(legend.title = element_blank())

```


## Daily sentiment data


The daily sentiment indicators for the US by [Barbaglia et al. (2023)](https://doi.org/10.1080/07350015.2022.2060988) and for Europe by [Barbaglia et al. (202X)](https://doi.org/10.1002/jae.3027) can be accessed with the command `data("sentiment")`.
The figure below plots the economic sentiment indicators for the US, which timely identifies the recessionary period indicated by the shadowed area following the [NBER business cycles reference dates](https://www.nber.org/cycles.html).

```{r US sentiment, echo=FALSE, message=FALSE, warning=FALSE, out.width='80%'}
nber.min <- c("1990-07-01", "2001-03-01", "2007-12-01",  "2020-01-01") 
nber.max <- c("1991-03-01", "2001-11-01", "2009-06-01", "2020-12-31")
nber.min <- as.Date(nber.min)
nber.max <- as.Date(nber.max)

data("sentiment")

library(ggplot2)
library(dplyr, warn.conflicts=FALSE)
sentiment %>%
  filter(country == "US") %>%
  select(-country) %>%
  tidyr::pivot_longer(!date, names_to="topic", values_to="sentiment") %>%
  ## QUARTERLY
  mutate(date = zoo::as.yearqtr(date)) %>%
  group_by(date, topic) %>%
  summarise(sentiment = mean(sentiment, na.rm=T)) %>%
  ggplot(aes(x=date, y=sentiment, group=topic)) +
  geom_line(color="#000080") +
  theme_bw() +
  facet_wrap(.~topic, scales = "free") +
  annotate("rect", xmin = zoo::as.yearqtr(nber.min), xmax = zoo::as.yearqtr(nber.max),
           ymin=-Inf, ymax=Inf, alpha = .2) 

```


## Economic Lexicon

Within the package, we also provide access to the **Economic Lexicon (EL)**, a dictionary with a fine-grained score in [-1,+1] for 4,165 terms. 
The EL is built with human-annotation and targets specifically economic applications.
More details are provide in [Barbaglia et al. (2022)](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4106936).

```{r EL}
data("EL")
EL
```


## Citation:

If you use this package, please *cite* the following references:

<!-- ## References: -->

* Barbaglia, Consoli, Manzan (202X). Forecasting GDP in Europe with Textual Data. *Journal of Applied Econometrics*: [https://doi.org/10.1002/jae.3027](https://doi.org/10.1002/jae.3027)

* Barbaglia, Consoli, Manzan (2023). Forecasting with Economic News. *Journal of Business & Economic Statistics*: [https://doi.org/10.1080/07350015.2022.2060988](https://doi.org/10.1080/07350015.2022.2060988)

* Consoli, Barbaglia, Manzan (2022). Fine-Grained  Aspect-Based Sentiment Analysis on Economic and Financial Lexicon. *Knowledge-Based Systems*: [https://doi.org/10.1016/j.knosys.2022.108781](https://doi.org/10.1016/j.knosys.2022.108781)