Skip to content

Latest commit

 

History

History
214 lines (183 loc) · 10 KB

README.md

File metadata and controls

214 lines (183 loc) · 10 KB

Detecting Financial Fraud with First Digit Law

Feng Xi, 2201212354

1. Introduction to First Digit Law

Have you found that, numbers in real life usually begin with 1, and seldom begin with 9?

First Digit Law, or Benford's Law, tells us that, if we collect the first digit of positive numbers, we will probably see the following distribution:

Benford's Law with stock price Benford's Law with GDP Benford's Law with physics constants Benford's Law with many numbers

First digit Law:

The probability that a positive number in b-base begins with a digit in [d,d+l) can be expressed as:

$$ P_{b,l,d}= \log_b⁡(1+\frac{l}{d}) $$

We let $b=10$ (in decimal system) and $l=1$,

$$ P_{d}= \log_{10}⁡(1+\frac{1}{d}) $$

图片

2. Proof of First Digit Law

2.1 Intuition

To grow from 1 to 2, you need to grow by 100%.

To grow from 8 to 9, you just need to grow by 12.5%.

So it's more difficult to grow from 1 to 2, which means there are more time that you stay in [1,2).

2.2 Proof with Laplace Transform

Reference: First Digit Law from Laplace Transform

F(x) : PDF (probability density function)

$P_d$: the probability that $x \in [d\cdot 10^n,(d+1)\cdot 10^n)$

$$ P_{d}= \sum_{n=-\infty}^{+\infty}\int_{d\cdot 10^n}^{(d+1)\cdot 10^n} F(x) {\rm{d}} x $$

$$ =\int_0^{\infty}F(x)g_d(x){\rm{d}}x$$

where

$$ g_d(x) = \sum_{n=-\infty}^{+\infty} [\eta(x-d\cdot 10^n)-\eta(x-(d+1)\cdot 10^n)] $$

$$ \eta(x) = \begin{cases} 1 \quad {\rm{if}} \quad x\geq 0 \\ 0 \quad {\rm{if}} \quad x>0 \\ \end{cases} $$

In interval [1,30), the gap between the shaded areas in $g_2(x)$ is wider than that in $g_2(x)$

图片

Above intuitively explains the inequality among the 9 digits, where smaller leading digits are more likely to appear.

The idea of the proof is to use

  • G(x): the Laplace transform of g(x)

  • f(x): the inverse Laplace transform of F(x),

  • and the property of Laplace Transform:

$$ \int_{0}^{+\infty} F(x)g(x) {\rm{d}} x = \int_{0}^{+\infty} f(t)G(t) {\rm{d}} t $$

and finally proved that:

$$ P_d = \log(1+\frac{1}{d}) + \int_{-\infty}^{+\infty}\tilde{f} (s) \tilde{\Delta} (s) {\rm{d}}s$$

where the second term is a small error term.

2.3 Proof with Fourier Transform

My undergraduate thesis is to prove First Digit Law with Fourier Transform.

Basic idea is similar.

  • Benford's Law with physics constants

The proof with Laplace/Fourier transform requires that the PDF of x should meet the requirements of having Laplace/Fourier Transform.

The requiments of having Fourier Transform are lower than that of having Laplace Transform.

3. Motivation

First Digit Law can be used to detect financial Fraud, because numbers in financial statements also follow the First Digit Law. If not, there is possibility that someone manupulates the numbers.

Below we show the first digit distribution of all the positive numbers in the 2021Q3 quaterly financial statements of two companies BYD and Gotion High-TECH, who both make EV batteries:

图片

Gotion's distribution seems to violate the First Digit Law, and was really caught financial fraud in July 2022.

So, we suggest that, if a company's first digit frequency differs a lot from the Benford's Law, it's more likely that the company has made financial fraud.

We will use different machine learning methods to prove our thoughts, where independent varaibles are the difference between a company's first digit frequency, and the dependent variable is whether the company has made financial fraud.

In previous literature,

To the best of my knowledge, our method, using Benford's Law on all positive original numbers of three financial statments to detect financial fraud for public traded companiess, has not been conducted before.

4. Data

4.1 Variable

X: the difference between

  • the distribution of the first digits in a company's 3 financial statements

  • and the Benford distribution

y: whether the company was reported financial fraud

  • 1: Yes

  • 0: No

y is determinde by the auditor's opinion on the financial statements:

  • y = 0, Standard unqualified opinion

  • y = 1, Unqualified opinion with emphasis paragraph

  • y = 1, reserved opinion

  • y = 1, inability to express opinion

  • y = 1, negative opinion

4.2 Time period

X: annual financial statements in 2015~2019

y: whether the company was reported financial fraud in 2015~2022

  • The average time interval between a company's financial fraud and its discovery is 2.97 years

Source: *Research on Financial Fraud Identification of Listed Companies Based

4.3 Companies

We choose CSI 500 excluding finance stocks:

  • CSI 500: China Securities ranking 301~800 in market cap
  • why exclude finance stocks: finance companies have many unique financial accounts, which don't apply to non-finance companies

4.4 Data aquisition

We get the ~300 positive numbers (or financial accounts) of each financial statement through WIND API

Then we compute the first digit frequency, and the distribution is as follows:

Samples_Benford's Law

We get auditor's opinion on financial statements through WIND EXCEL Plugger

4.5 Data processing

$X_i$ describes the difference between real frequency and Benford frequency:

$$ X_i = \frac{{\rm{frequency \ \ of \ \ beginning \ \ with \ \ digit \ \ i}}}{{\rm{Benford \ \ frequency\ \ [i]}}} $$

We use the largest $X_i$ within the 5 years (2015~2019).

Obviously, the first digit frequency of Fraud Group differs more from Benford Frequency than that of No Fraud Group: output

A big issue is that the dataset is imbalanced, there are too few samples with y= 0.

So within the trainning set we do under-sampling using RandomUnderSampler

we randomly drop some negative samples (y=0) until the number of negative samples equal to the number of positive samples.

  • Before RandomUnderSampler, 36/279=12% samples are y=1.
  • After RandomUnderSampler, 36/72=50% samples are y=1.

4.6 The Processed Data

图片

5. Model Results

Our task is to tell whether there is financial fraud, so we care about two index:

    1. recall
    1. auc

Logistic Regression and SVM get the best result, while Decision Tress gives the worst result:

Logistic Regression MLPClassfier SVM Decision Tree Random Forest
Recall 69% 62% 69% 44% 62%
AUC 63% 62% 63% 56% 63%

img

5.1 Logistic Regression;Recall = 69%, AUC = 63%

5.2 MLPClassifier: Recall = 62%, AUC = 62%

5.3 SVM: Recall = 69%, AUC = 63%

5.4 Decision Tree:Recall = 44%, AUC = 56%

5.5 Random Forest: Recall = 62%, AUC = 63%