Implementation of ID3 algorithm in Python
Thomas MARTIN, Victor CAVERO, Séraphin HENRY
In the CLI mode, we have a CSV file which is already filled with data. We use it to generate the tree using the ID3 algorithm. The CLI will ask questions to the user, until a point when it announces which result was predicted. The user has the ability to confirm, or not, the predicted result.
The web app works the same, except that the data is stored in a postgres SQL database.
If the predicted result is confirmed by the user, then it's all fine. If not, then the CLI asks the user to enter some more additional data.
It is composed of a client made with React, two microservices (py-id3
made with Python, and pg-controller
made with Go), as well as a Postgres SQL Database.
The React client interacts with these two microservices to retrieve a decision tree (py-id3
) and post data to the database (pg-controller
).
Documentation regarding py-id3, the ID3 tree generator exposed as a REST API
Documentation regarding pg-controller, the postgres db controller, exposed as a REST API
The two microservices are exposed as REST APIs, fully open, because we prefered focusing on ID3 rather than security for now. However, the admin access to the postgres database is fully secured, by using docker secrets and docker swarm.
Here you can find the database model that the webApp uses : (link)
The apps runs at https://furio.team and has been deployed thanks to Docker and Docker Swarm.
Run : docker-compose -f docker-compose.base.yml -f docker-compose.dev.yml up
and cd client && yarn start
To build our tree we follow a few key steps :
- Read the .csv to stock it in a variable
df
- To calcule the entropy :
- Isolate the column result to count the occurences of
yes
andno
- Calculate general entropy
- Isolate the column result to count the occurences of
- To build the tree we use the function
buildTree()
- Find the attribute with the best information gain
find_winner()
- Get distinct values of this attribute
- If an attribute only have
yes
or onlyno
, then it becomes a leaf and is marked byyes
orno
- If an attribute have
yes
andno
, then the algorithm call the functionbuildTree()
recursively with the subtable.
- Find the attribute with the best information gain
Function | Parameters | Description |
---|---|---|
calc_df_entropy |
df : records of the .csv ; attribute : stock result unique values |
Calculates the entropy of the whole dataset |
calc_subtable_entropy |
df : dataset of a subtable |
Calculate the entropy of a subtable |
calc_subtable_attribute_entropy |
df : dataset of a subtable, attribute : unique attribute of a subtable |
Calculates the entropy of a subtable for an unique attribute |
find_best_attribute |
df : dataset of a subtable |
Find the attribute wich give the most information |
get_subtable |
df : dataset of a subtable ; attribute : which attribute we will explore ; value : data |
Filters the dataset to get only the explored branch |
build_tree |
df : dataset explored |
Build the tree stocked in a variable tree[][] |
generate_decision_tree |
csv_filename : Name of the .csv which provide data |
General function to read the csv, calculate the entropy and build the tree to return the final tree |
git clone
the project (or dowload andunzip
the archive)cd
inside of itcd py-id3
and
You'll need pipenv
, unless you're ready to install the dependencies by hand.
pip install pipenv
pipenv install
and pipenv shell
python3.8 main.py
or pipenv run python3.8 main.py
docker build -f Dockerfile.cli -t id3py .
Then, if you're using Linux/MacOS/Unix :
docker run -v $(pwd)/db:/app/db -it id3py
Or if you're using Windows :
docker run -v ${PWD}/db:/app/db -it id3py
The data.csv
file in the db
folder will be overwritten when the program estimates that it's necessary to update the db.
You can also specify another folder, which contains a file named data.csv
that you want to use :
docker run -v PATH_TO_YOUR_FOLDER:/app/db -it id3py
For now, the program uses the file called db/data.csv
as a knowledge base.
You can use yours, as long as the column for the output is called result
, and that it contains yes
and no
.