Skip to content

Build your own workflow

Balazs Kegl edited this page May 10, 2018 · 4 revisions

If you are a data science teacher, a data scientist, or a researcher you may have a new data set and a predictive problem for which you want to build a starting kit. In this subsection we walk you through what you need to do.

Your goal is not necessarily to launch an open RAMP, you may just want to organize your local experiments, make reusable building blocks, log your local submissions, etc. But once you have a working starting kit, it is also quite easy to launch a RAMP.

The basic gist is that each starting kit contains a python file problem.py that parametrizes the setup. It uses building blocks from this library (ramp-workflow), like choosing from a menu. As an example, we will walk you through the problem.py of the titanic starting kit. Other problems may use more complex workflows or cross-validation schemes, but this complexity is usually hidden in the implementation of those elements in ramp-workflow. The goal was to keep the script problem.py as simple as possible.

1. Choose a title.

problem_title = 'Titanic survival classification'

2. Choose a prediction type.

The prediction types are in rampwf/prediction_types

_prediction_label_names = [0, 1]
prediction_type = rw.prediction_types.make_multiclass(
    label_names=_prediction_label_names)

Typical prediction types are multiclass and regression. For multiclass (or multi-target regression) you need to pass the label names that we usually put into a local variable _prediction_label_names. It can be a list of strings or integers.

3. Choose a workflow.

Available workflows are in rampwf/workflows.

workflow = rw.workflows.FeatureExtractorClassifier()

Typical workflows are a single classifier or a feature extractor followed by a classifier used here, but we have more complex workflows, named after the first problem that used them (e.g., drug_spectra, two feature extractors, a classifier, and a regressor; or air_passengers, a feature extractor followed by a regressor, but also an external_data.csv that the feature extractor can merge with the training set). Each workflow implements a class which has train_submission and test_submission member functions that train and test submissions, and a workflow_element_names field containing the file names that test_submission expects in submissions/starting_kit or submissions/<new-submission_name>.

4. Choose score types.

Score types are metrics from rampwf/score_types

score_types = [
    rw.score_types.ROCAUC(name='auc'),
    rw.score_types.Accuracy(name='acc'),
    rw.score_types.NegativeLogLikelihood(name='nll')]

Typical score types are accuracy or RMSE. Each score type implements a class with member functions score_function and __call__ (the former usually using the latter) and fields

  1. name, that ramp_test_submission uses in the logs; also the column name in the RAMP leaderboard,
  2. precision: the number of decimal digits,
  3. is_lower_the_better: a boolean which is True if the score is the lower the better, False otherwise,
  4. minimum: the smallest possible score,
  5. maximum: the largest possible score.

5. Write the cross-validation scheme.

Define a function get_cv returning a cross-validation object

def get_cv(X, y):
    cv = StratifiedShuffleSplit(n_splits=8, test_size=0.2, random_state=57)
    return cv.split(X, y)

6. Write the I/O methods.

The workflow needs two functions that read the training and test data sets.

_target_column_name = 'Survived'
_ignore_column_names = ['PassengerId']


def _read_data(path, f_name):
    data = pd.read_csv(os.path.join(path, 'data', f_name))
    y_array = data[_target_column_name].values
    X_df = data.drop([_target_column_name] + _ignore_column_names, axis=1)
    return X_df, y_array


def get_train_data(path='.'):
    f_name = 'train.csv'
    return _read_data(path, f_name)


def get_test_data(path='.'):
    f_name = 'test.csv'
    return _read_data(path, f_name)

The convention is that these sets are found in /data and called train.csv and test.csv, but we kept this element flexible to accommodate a large number of possible input data connectors.

The script is used by testing.py which reads the files, implements the cross-validation split, instantiates the workflow with the submission, and trains and tests it. It is rather instructive to read this script to understand how we train the workflows. It is quite straightforward so we do not detail it here.