-
Notifications
You must be signed in to change notification settings - Fork 42
Build your own workflow
If you are a data science teacher, a data scientist, or a researcher you may have a new data set and a predictive problem for which you want to build a starting kit. In this subsection we walk you through what you need to do.
Your goal is not necessarily to launch an open RAMP, you may just want to organize your local experiments, make reusable building blocks, log your local submissions, etc. But once you have a working starting kit, it is also quite easy to launch a RAMP.
The basic gist is that each starting kit contains a python file problem.py
that parametrizes the setup. It uses building blocks from this library (ramp-workflow), like choosing from a menu. As an example, we will walk you through the problem.py
of the titanic starting kit. Other problems may use more complex workflows or cross-validation schemes, but this complexity is usually hidden in the implementation of those elements in ramp-workflow. The goal was to keep the script problem.py
as simple as possible.
problem_title = 'Titanic survival classification'
The prediction types are in rampwf/prediction_types
_prediction_label_names = [0, 1]
prediction_type = rw.prediction_types.make_multiclass(
label_names=_prediction_label_names)
Typical prediction types are multiclass
and regression
. For multiclass (or multi-target regression) you need to pass the label names that we usually put into a local variable _prediction_label_names
. It can be a list of strings or integers.
Available workflows are in rampwf/workflows
.
workflow = rw.workflows.FeatureExtractorClassifier()
Typical workflows are a single classifier
or a feature extractor followed by a classifier used here, but we have more complex workflows, named after the first problem that used them (e.g., drug_spectra
, two feature extractors, a classifier, and a regressor; or air_passengers
, a feature extractor followed by a regressor, but also an external_data.csv
that the feature extractor can merge with the training set). Each workflow implements a class which has train_submission
and test_submission
member functions that train and test submissions, and a workflow_element_names
field containing the file names that test_submission
expects in submissions/starting_kit
or submissions/<new-submission_name>
.
Score types are metrics from rampwf/score_types
score_types = [
rw.score_types.ROCAUC(name='auc'),
rw.score_types.Accuracy(name='acc'),
rw.score_types.NegativeLogLikelihood(name='nll')]
Typical score types are accuracy
or RMSE
. Each score type implements a class with member functions score_function
and __call__
(the former usually using the latter) and fields
-
name
, thatramp_test_submission
uses in the logs; also the column name in the RAMP leaderboard, -
precision
: the number of decimal digits, -
is_lower_the_better
: a boolean which isTrue
if the score is the lower the better,False
otherwise, -
minimum
: the smallest possible score, -
maximum
: the largest possible score.
Define a function get_cv
returning a cross-validation object
def get_cv(X, y):
cv = StratifiedShuffleSplit(n_splits=8, test_size=0.2, random_state=57)
return cv.split(X, y)
The workflow needs two functions that read the training and test data sets.
_target_column_name = 'Survived'
_ignore_column_names = ['PassengerId']
def _read_data(path, f_name):
data = pd.read_csv(os.path.join(path, 'data', f_name))
y_array = data[_target_column_name].values
X_df = data.drop([_target_column_name] + _ignore_column_names, axis=1)
return X_df, y_array
def get_train_data(path='.'):
f_name = 'train.csv'
return _read_data(path, f_name)
def get_test_data(path='.'):
f_name = 'test.csv'
return _read_data(path, f_name)
The convention is that these sets are found in /data
and called train.csv
and test.csv
, but we kept this element flexible to accommodate a large number of possible input data connectors.
The script is used by testing.py
which reads the files, implements the cross-validation split, instantiates the workflow with the submission, and trains and tests it. It is rather instructive to read this script to understand how we train the workflows. It is quite straightforward so we do not detail it here.
Copyright (c) 2014 - 2018 Paris-Saclay Center for Data Science (http://www.datascience-paris-saclay.fr/)