Skip to content

Commit

Permalink
Update and rename README.rst to README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
Florents-Tselai authored Apr 8, 2024
1 parent 61b74c9 commit 7cfcded
Show file tree
Hide file tree
Showing 2 changed files with 351 additions and 295 deletions.
351 changes: 351 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,351 @@
# vasco: discover hidden correlations in your data

**vasco** is a Postgres extension that helps you discover hidden
correlations in your data. It is based on the [MINE family of
statistics](http://www.exploredata.net).


### Exploring a table

The generic approach is the following.

``` sql
SELECT * FROM vasco_explore('my_schema.my_table')
```

This will explore the relationships between all possible column pairs in
the table and return a detailed table of the results. Including all MINE
statistics and additional metadata.

Coming up: an option to reduce the set of columns to consider.

### Exploring association strength

The **Maximal Information Coefficient (MIC)** measures how strong is the
association.

``` sql
SELECT mic(rand_x, rand_y), -- 0.13 approaching to 0 as sample grows (random independent variables)
mic(x, ident), -- 1 identity function hence easy to estimate even with a small sample size
mic(x, cubic), -- 0.999 approaching to 1 as sample grows
mic(x, periodic) -- 1.
FROM vasco_data;
```

### Exploring the nature of the association

No algorithm can magically detect the function of the relationship
between two variables, but MINE statistics can shed some light into the
nature of that relationship.

The **Maximum Asymmetry Score (MAS)** measures how much the relationship
deviates from monotonicity.

``` sql
SELECT mas(X, Y)
```

The **Maximum Edge Value (MEV)** measures the degree to which the
dataset appears to be sampled from a continuous function.

``` sql
SELECT mev(X, Y)
```

The **Minimum Cell Number (MCN)** measures the complexity of the
association.

``` sql
SET vasco.mine_mcn_eps = 0.0 -- default
SELECT mcn(X, Y)
```

The **Minimum Cell Number General (MCNG)** returns the MCN with eps =
1 - MIC .

``` sql
SELECT mcn_general(X, Y)
```

The **Total Information Coefficient (TIC)** .

``` sql
SET vasco.mine_tic_norm = true -- normalized or not (default = true)
SELECT tic(X, Y)
```

The **Generalized Mean Information Coefficient (GMIC)** , a
generalization of MIC which incorporates a tuning parameter that can be
used to modify the complexity of the association favored by the measure
[\[Luedtke2013\]](#Luedtke2013){.citation} .

``` sql
SET vasco.mine_gmic_p = 0.0
SELECT gmic(X, Y)
```


Using the Automobile dataset found in `demo/data` as an example.

``` sql
SELECT vasco_corr_matrix('vasco_demo."Automobile_data"', 'auto_corr_matrix')
```

**vasco** will explore the table `Automobile_data` for correlations
between its columns pairs. A symmetric matrix of these correlations will
be stored in the table `auto_corr_matrix`. You can use that table for BI
and analytics.

You can also use the utility script below to plot a heatmap of that
matrix.

``` sh
./scripts/plot_corr_matrix.py 'public.auto_corr_matrix'
```

![image](docs/img/public.automob_corr_matrix_heatmap.png)

The main workhorse behind vasco is the
[MIC](https://en.wikipedia.org/wiki/Maximal_information_coefficient)
[\[Reshef2011\]](#Reshef2011){.citation}: an information theory-based
measure of association that can capture a wide range of functional and
non-functional relationships between variables.

`MIC(X,Y)` is symmetric and normalized score into a range `[0, 1]`. A
high MIC value suggests a dependency between the investigated variables,
whereas `MIC=0` describes the relationship between two independent
variables.

![image](docs/img/mic_comparison.png)

## Installation

``` sh
cd /tmp
git clone [email protected]:Florents-Tselai/vasco.git
cd vasco
make all # WITH_PGVECTOR=1 to enable pgvector support
make install # may need sudo
```

Then in a Postgres session run

``` sql
CREATE EXTENSION vasco
```

## Usage

**vasco** exposes a set of Postgres functions to compute MINE statistics
between two series `(X,Y)` . In Postgres terms `X` and `X` can be
arrays, vectors or columns.

Thus, each score function is available in three flavors: using Postgres
arrays as argument `f(float8[], float8[])`, ,
[pgvector](https://github.com/pgvector/pgvector) vectors
`f(vector, vector)` or columns (hence `f` is an aggregate function).
Necessary MINE parameters can be set as
[GUC](https://www.postgresql.org/docs/current/config-setting.html) ,
(prefixed as `vasco.*`)

Let\'s discuss the supported statistics and their interpretation. Start
by creating a sample dataset

``` sql
SET extra_float_digits = 0;

CREATE TABLE vasco_data
AS (SELECT RANDOM() AS rand_x,
RANDOM() AS rand_y,
x AS x,
x AS ident,
4 * pow(x, 3) + pow(x, 2) - 4 * x AS cubic,
COS(12 * PI() + x * (1 + x)) AS periodic
FROM GENERATE_SERIES(0, 1, 0.001) x);
```

### Choosing an estimator

There have been proposed a number of algorithms to estimate the MIC.
Currently in **vasco** you can choose between `ApproxMIC` from
[\[Reshef2011\]](#Reshef2011){.citation} or `MIC_e` from
[\[Reshef2016\]](#Reshef2016){.citation} .

``` sql
SET vasco.mic_estimator = ApproxMIC
SET vasco.mic_estimator = MIC_e
```

### pgvector support

**vasco** can be build with
[pgvector](https://github.com/pgvector/pgvector) support .

In that case all MINE statistics can be computed between `vector` types
too.

``` sql
SELECT mic( ARRAY [0,1.3,2,0,1.3,20,1.3,20,1.3,20,1.3,20,1.3,2]::float4[]::vector,
ARRAY [0,1.3,2,0,1.3,20,1.3,20,1.3,20,1.3,20,1.3,2]::float4[]::vector
)
```

### Configuration parameters

The following MINE parameters can be set via GUC.

- `vasco.mine_c`
- `vasco.mine_alpha`
- `vasco.mic_estimator`
- `vasco.mine_mcn_eps`
- `vasco.mine_tic_norm`
- `vasco.mine_gmic_p`

## How it works

As described in [\[Reshef2011\]](#Reshef2011){.citation} :

> The maximal information coefficient (MIC) is a measure of two-variable
> dependence designed specifically for rapid exploration of
> many-dimensional data sets. MIC is part of a larger family of maximal
> information-based nonparametric exploration (MINE) statistics, which
> can be used not only to identify important relationships in data sets
> but also to characterize them.
>
> Intuitively, MIC is based on the idea that if a relationship exists
> between two variables, then a grid can be drawn on the scatterplot of
> the two variables that partitions the data to encapsulate that
> relationship.
>
> Thus, to calculate the MIC of a set of two-variable data, we explore
> all grids up to a maximal grid resolution, dependent on the sample
> size computing for every pair of integers `(x,y)` the largest possible
> mutual information achievable by any x-by-y grid applied to the data.
> We then normalize these mutual information values to ensure a fair
> comparison between grids of different dimensions and to obtain
> modified values between 0 and 1.
>
> These different combination of grids form the so-called
> **characteristic matrix M(x,y)** of the data. Each element `(x,y)` of
> M stores the highest normalized mutual information achieved by any
> x-by-y grid. Computing `M` is the core of the algorithmic process and
> is computationally expensive. The maximum of `M` is the MIC and the
> rest of MINE statistics are derived from that matrix as well.
**TL;DR**: Computing the *Characteristic Matrix* is the big deal; Once
that is done, computing the statistics is trivial.

![image](docs/img/mine_family.png)

![image](docs/img/computing_mic.jpg)

## Next Steps

- Try out ChiMIC [\[Chen2013\]](#Chen2013){.citation} and BackMIC
[\[Cao2021\]](#Cao2021){.citation}:
- Currently `M` is re-computed every time a function score is called.
That\'s a huge waste of resources. Caching `M` or sharing it between
runs should be the first optimization to be done.
- A potential next step would be continuously updating the CM as
columns are updated (think a trigger or bgw process).
- Make an extension for SQLite and DuckDB as well
- Build convenience functions to create variable pairs and explore
tables in one pass.

## Thanks

For MINE statistics, **vasco** currently uses the implementation
provided by [\[Albanese2013\]](#Albanese2013){.citation} via the
[minepy](https://github.com/minepy/minepy) package.

Alternative implementations are coming up.

## Resources

::: {#citations}

[Albanese2013]{#Albanese2013 .citation-label}

: Albanese, D., Filosi, M., Visintainer, R., Riccadonna, S., Jurman,
G., & Furlanello, C. (2013). Minerva and minepy: a C engine for the
MINE suite and its R, Python and MATLAB wrappers. Bioinformatics,
29(3), 407-408.

[Albanese2018]{#Albanese2018 .citation-label}

: Davide Albanese, Samantha Riccadonna, Claudio Donati, Pietro
Franceschi; A practical tool for Maximal Information Coefficient
analysis, GigaScience, giy032,
<https://doi.org/10.1093/gigascience/giy032>

[Cao2021]{#Cao2021 .citation-label}

: Cao, D., Chen, Y., Chen, J., Zhang, H., & Yuan, Z. (2021). An
improved algorithm for the maximal information coefficient and its
application. Royal Society open science, 8(2), 201424.
[PDF](https://royalsocietypublishing.org/doi/pdf/10.1098/rsos.201424)
[GitHub](https://github.com/Caodan82/BackMIC)

[Chen2013]{#Chen2013 .citation-label}

: Chen Y, Zeng Y, Luo F, Yuan Z. 2016 A new algorithm to optimize
maximal information coefficient. PLoS ONE 11, e0157567. (<doi:10>.
1371/journal.pone.0157567)
[GitHub](https://github.com/chenyuan0510/Chi-MIC)

[Ge2016]{#Ge2016 .citation-label}

: Ge, R., Zhou, M., Luo, Y. et al. McTwo: a two-step feature selection
algorithm based on maximal information coefficient. BMC
Bioinformatics 17, 142 (2016).
<https://doi.org/10.1186/s12859-016-0990-0>

[Luedtke2013]{#Luedtke2013 .citation-label}

: Luedtke A., Tran L. The Generalized Mean Information Coefficient
<https://doi.org/10.48550/arXiv.1308.5712>

[Matejka2017]{#Matejka2017 .citation-label}

: J. Matejka and G. Fitzmaurice. Same Stats, Different Graphs:
Generating Datasets with Varied Appearance and Identical
Statistics through Simulated Annealing. ACM SIGCHI Conference on
Human Factors in Computing Systems, 2017.

[Reshef2011]{#Reshef2011 .citation-label}

: Reshef, D. N., Reshef, Y. A., Finucane, H. K., Grossman, S. R.,
McVean, G., Turnbaugh, P. J., \... & Sabeti, P. C. (2011). Detecting
novel associations in large data sets. science, 334(6062),
1518-1524.

[Reshef2016]{#Reshef2016 .citation-label}

: Yakir A. Reshef, David N. Reshef, Hilary K. Finucane and Pardis C.
Sabeti and Michael Mitzenmacher. Measuring Dependence Powerfully and
Equitably. Journal of Machine Learning Research, 2016.
[PDF](https://jmlr.csail.mit.edu/papers/volume17/15-308/15-308.pdf)

[Shao2021]{#Shao2021 .citation-label}

: Shao, F. & Liu, H. (2021). The Theoretical and Experimental Analysis
of the Maximal Information Coefficient Approximate Algorithm.
Journal of Systems Science and Information, 9(1), 95-104.
<https://doi.org/10.21078/JSSI-2021-095-10>

[Xu2016]{#Xu2016 .citation-label}

: Xu, Z., Xuan, J., Liu, J., & Cui, X. (2016, March). MICHAC: Defect
prediction via feature selection based on maximal information
coefficient with hierarchical agglomerative clustering. In 2016 IEEE
23rd International Conference on Software Analysis, Evolution, and
Reengineering (SANER) (Vol. 1, pp. 370-381). IEEE.
<http://cstar.whu.edu.cn/paper/saner_16.pdf>

[Zhang2014]{#Zhang2014 .citation-label}

: Zhang Y, Jia S, Huang H, Qiu J, Zhou C. 2014 A novel algorithm for
the precise calculation of the maximal information coefficient. Sci.
Rep.-UK 4, 6662. (<doi:10.1038/> srep06662)
<http://lxy.depart.hebust.edu.cn/SGMIC/SGMIC.htm>
:::
Loading

0 comments on commit 7cfcded

Please sign in to comment.