Decision tree in C++

How to compile and run tests locally

Ensure you have CMake installed.

Install CMake on MacOS:

$ brew install cmake
$ cmake --version

This project has been successfully compiled using clang++ compiler on MacOS (AppleClang 11.0.3). Ensure you have Clang installed.

Compiling project:

$ mkdir build
$ /usr/local/bin/cmake -S . -B ./build/ -D CMAKE_CXX_COMPILER=/usr/bin/clang++ -DCMAKE_VERBOSE_MAKEFILE=ON
$ cd build && make

For running tests after successful compilation:

$ cd test && ./DecisionTests && cd ../..

Using this decision tree classifier

An example jupyter notebook calling this classifier training on the Titanic dataset can be found here.

As can be seen in the notebook, the classifier implemented here does not produce identical results to the scikit-learn Decision Tree Classifier, but we see 85% accurate results on a test data set compared with the sci-kit learn implementation.

Overview of this implementation of a decision tree classifier

Decision trees are a simple machine learning algorithm that use a series of features of an observation to create a prediction of a target outcome class.

For example, the target outcome class could be whether a company should interview a candidate for a job, and the series of features could be:

The university they attended
The subject they studied at university
Their highest degree qualification
Their previous employer
Their previous job title

The decision tree is built ("trained") using a series of observations (training data) with their corresponding outcome class to create a tree that best fits the training data. This tree can then be used to make predictions on a series of observations without knowing the target outcome class.

For the interview screening tool outlined the training data might look like:

PersonID	University	UniSubject	Degree	PrevEmployer	PrevTitle	Interviewed
1	Harvard	Math	MSc	Google	Data Scientist	Yes
2	Stanford	Math	MSc	Microsoft	Data Analyst	Yes
3	Cornell	English	MA	New York Times	Reporter	No

Where the outcome class used for training is whether the candidate was interviewed or not.

In this repo basic categorical-only decision tree classifier algorithm is implemented in C++ as an exercise to learn a low-level language.

Training a decision tree classifier

Have a number of measurements of categorical feature vectors and a corresponding outcome class.
For each feature calculate the Gini Gain.
Select the feature to split on that maximizes the Gini gain.
Split the data based on that feature to create two new sub-trees.
Repeat splitting of nodes in each sub-tree until only one outcome class predicted after splitting (gini gain = 0).

Inputs and outputs

Inputs

To simplify the problem to the core business logic of a categorical decision tree classifier, we will assume that all categories are encoded as integers then input to the decision tree as a train csv and a test csv. The header (feature names) will be omitted from the input and each row will contain all the observations for a feature, a column will represent one individual observation. Note this is rotated 90 degrees from a traditional csv representation.

Assume training data features contain at least one instance of every class that can exist and that classes are indexed from zero.

The last row will be assumed to be the outcomes/target variable in the training dataset. There will be one less row in the test dataset as the outcome would be unknown.

Required data structures

Training and test data structure

An array of pointers for all features these pointers point to arrays with the value of that feature for each training or test observation.

An array of outcome classes for each training and test observation.

Decision tree architecture structure

Tree architecture.

Each node has following attributes:

ChildLeft - pointer to left child node
ChildRight - pointer to right child node
SplitFeature - feature used at split
SplitCategory - category of that feature used at split. Observations containing that category are passed to the right child tree. If leaf, predicted class.
GiniGain - gain as a result of split.

UML

Tree class

Tree()
~Tree()
Node* head
Traverse()
Fit()
Transform()
CSVReader

Stretch nice to haves:

Load()
Save()
Look into Hdf5 format for saving

Node class

Node()
~Node()
Node* children
DataFrame data
int splitFeature
int splitCategory
float giniImpurity

Gini gain

Gini impurity is the probability of incorrectly classifying a randomly chosen element in the dataset if it were randomly labeled according to the class distribution in the dataset.

Gini gain is calculated as the original gini impurity of the dataset minus the weighted resultant gini impurities as a result of the split.

We choose the branch that maximises the Gini gain.

Development using VSCode

Using CMake and GoogleTest framework in VSCode based on this video.

Build the binaries with F7. Run tests from the ribbon at the bottom of the VSCode UI.

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
.github/workflows		.github/workflows
googletest @ 965f8ec		googletest @ 965f8ec
notebooks		notebooks
presentations		presentations
test		test
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
CSVReader.cpp		CSVReader.cpp
CSVReader.hpp		CSVReader.hpp
DecisionTree.cpp		DecisionTree.cpp
DecisionTree.hpp		DecisionTree.hpp
Main.cpp		Main.cpp
Node.cpp		Node.cpp
Node.hpp		Node.hpp
README.md		README.md
decision_tree.py		decision_tree.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Decision tree in C++

How to compile and run tests locally

Using this decision tree classifier

Overview of this implementation of a decision tree classifier

Training a decision tree classifier

Inputs and outputs

Inputs

Required data structures

Training and test data structure

Decision tree architecture structure

UML

Gini gain

Development using VSCode

About

Releases

Packages

Languages

GerHobbelt/decision-tree

Folders and files

Latest commit

History

Repository files navigation

Decision tree in C++

How to compile and run tests locally

Using this decision tree classifier

Overview of this implementation of a decision tree classifier

Training a decision tree classifier

Inputs and outputs

Inputs

Required data structures

Training and test data structure

Decision tree architecture structure

UML

Gini gain

Development using VSCode

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages