Have you found that, numbers in real life usually begin with 1, and seldom begin with 9?
First Digit Law, or Benford's Law, tells us that, if we collect the first digit of positive numbers, we will probably see the following distribution:
First digit Law:
The probability that a positive number in b-base begins with a digit in [d,d+l) can be expressed as:
$$ P_{b,l,d}= \log_b(1+\frac{l}{d}) $$ We let
$b=10$ (in decimal system) and$l=1$ ,
$$ P_{d}= \log_{10}(1+\frac{1}{d}) $$
To grow from 1 to 2, you need to grow by 100%.
To grow from 8 to 9, you just need to grow by 12.5%.
So it's more difficult to grow from 1 to 2, which means there are more time that you stay in [1,2).
Reference: First Digit Law from Laplace Transform
F(x) : PDF (probability density function)
$P_d$ : the probability that$x \in [d\cdot 10^n,(d+1)\cdot 10^n)$
$$ P_{d}= \sum_{n=-\infty}^{+\infty}\int_{d\cdot 10^n}^{(d+1)\cdot 10^n} F(x) {\rm{d}} x $$
$$ =\int_0^{\infty}F(x)g_d(x){\rm{d}}x$$ where
$$ g_d(x) = \sum_{n=-\infty}^{+\infty} [\eta(x-d\cdot 10^n)-\eta(x-(d+1)\cdot 10^n)] $$
$$ \eta(x) = \begin{cases} 1 \quad {\rm{if}} \quad x\geq 0 \\ 0 \quad {\rm{if}} \quad x>0 \\ \end{cases} $$
In interval [1,30), the gap between the shaded areas in
$g_2(x)$ is wider than that in$g_2(x)$ Above intuitively explains the inequality among the 9 digits, where smaller leading digits are more likely to appear.
The idea of the proof is to use
G(x): the Laplace transform of g(x)
f(x): the inverse Laplace transform of F(x),
and the property of Laplace Transform:
$$ \int_{0}^{+\infty} F(x)g(x) {\rm{d}} x = \int_{0}^{+\infty} f(t)G(t) {\rm{d}} t $$ and finally proved that:
$$ P_d = \log(1+\frac{1}{d}) + \int_{-\infty}^{+\infty}\tilde{f} (s) \tilde{\Delta} (s) {\rm{d}}s$$ where the second term is a small error term.
My undergraduate thesis is to prove First Digit Law with Fourier Transform.
Basic idea is similar.
The proof with Laplace/Fourier transform requires that the PDF of x should meet the requirements of having Laplace/Fourier Transform.
The requiments of having Fourier Transform are lower than that of having Laplace Transform.
First Digit Law can be used to detect financial Fraud, because numbers in financial statements also follow the First Digit Law. If not, there is possibility that someone manupulates the numbers.
Below we show the first digit distribution of all the positive numbers in the 2021Q3 quaterly financial statements of two companies BYD and Gotion High-TECH, who both make EV batteries:
Gotion's distribution seems to violate the First Digit Law, and was really caught financial fraud in July 2022.
So, we suggest that, if a company's first digit frequency differs a lot from the Benford's Law, it's more likely that the company has made financial fraud.
We will use different machine learning methods to prove our thoughts, where independent varaibles are the difference between a company's first digit frequency, and the dependent variable is whether the company has made financial fraud.
In previous literature,
To the best of my knowledge, our method, using Benford's Law on all positive original numbers of three financial statments to detect financial fraud for public traded companiess, has not been conducted before.
X: the difference between
the distribution of the first digits in a company's 3 financial statements
and the Benford distribution
y: whether the company was reported financial fraud
1: Yes
0: No
y is determinde by the auditor's opinion on the financial statements:
y = 0, Standard unqualified opinion
y = 1, Unqualified opinion with emphasis paragraph
y = 1, reserved opinion
y = 1, inability to express opinion
y = 1, negative opinion
X: annual financial statements in 2015~2019
y: whether the company was reported financial fraud in 2015~2022
- The average time interval between a company's financial fraud and its discovery is 2.97 years
Source: *Research on Financial Fraud Identification of Listed Companies Based
We choose CSI 500 excluding finance stocks:
- CSI 500: China Securities ranking 301~800 in market cap
- why exclude finance stocks: finance companies have many unique financial accounts, which don't apply to non-finance companies
We get the ~300 positive numbers (or financial accounts) of each financial statement through WIND API
Then we compute the first digit frequency, and the distribution is as follows:
We get auditor's opinion on financial statements through WIND EXCEL Plugger
$X_i$ describes the difference between real frequency and Benford frequency:
$$ X_i = \frac{{\rm{frequency \ \ of \ \ beginning \ \ with \ \ digit \ \ i}}}{{\rm{Benford \ \ frequency\ \ [i]}}} $$ We use the largest
$X_i$ within the 5 years (2015~2019).
Obviously, the first digit frequency of Fraud Group differs more from Benford Frequency than that of No Fraud Group:
A big issue is that the dataset is imbalanced, there are too few samples with y= 0.
So within the trainning set we do under-sampling using RandomUnderSampler:
we randomly drop some negative samples (y=0) until the number of negative samples equal to the number of positive samples.
- Before RandomUnderSampler, 36/279=12% samples are y=1.
- After RandomUnderSampler, 36/72=50% samples are y=1.
Our task is to tell whether there is financial fraud, so we care about two index:
- recall
- auc
Logistic Regression and SVM get the best result, while Decision Tress gives the worst result:
Logistic Regression | MLPClassfier | SVM | Decision Tree | Random Forest | |
---|---|---|---|---|---|
Recall | 69% | 62% | 69% | 44% | 62% |
AUC | 63% | 62% | 63% | 56% | 63% |