Skip to content

Algorithms to predict, classify and analyze features and labels of Scrum data

License

Notifications You must be signed in to change notification settings

grip-on-software/prediction

Repository files navigation

Prediction

This repository contains models and runners for building and using pattern recognition models (based on machine learning and effort estimation) for classification and estimation of features that describe Scrum sprints.

The runners and contexts for the classification/estimation models ensure that the data set is provided in a certain format, selects features, and splits into train/test/validation sets, which allows runs on combinations of features or time-travel through the data set (providing different train slices split temporally).

The prediction models are based on TensorFlow and Keras. Some models are hand-made, based on equations described for those models (such as analogy-based effort estimation, ABE) and some are based on existing TF or Keras models (such as MLP and DNN).

Installation

The predictions run using Python 3.6 or 3.7. A number of Python libraries are required to run the predictions, which are listed in requirements.txt. To install these, run pip3 install tensorflow==$TENSORFLOW_VERSION and then pip3 install -r requirements.txt, where you replace the $TENSORFLOW_VERSION variable with a supported TensorFlow version, and assuming pip3 is a Pip executable for the Python 3 environment (if you do not have Pip, then install it through your package manager or use another method, see Pip installation for other sources to obtain Pip). If you do not have permission to install system-wide libraries, then either add --user after the install command, or use a virtual environment.

The predictions can also be run using Docker. This repository contains a Dockerfile which can be used to build a Docker image. If you have Docker installed, then run docker build -t gros-prediction . to build this image. If you want to use a different TensorFlow version, then add --build-arg TENSORFLOW_VERSION=$TENSORFLOW_VERSION, where you replace the $TENSORFLOW_VERSION variable with an appropriate tag of the upstream tensorflow image for a supported TensorFlow version. If you add -gpu at the end, then you can select a GPU device to pin data and models to, but in this case you must have nvidia-docker, CUDA toolkit and an NVIDIA driver for your GPU installed. Details for setting this up are outside the scope of this repository, but some documentation for GPU-enabled Jenkins agents may be found elsewhere in the Grip on Software documentation.

To run a Docker image for a prediction, use the following command:

docker run -v /path/to/data:/data -v $PWD:$PWD -w $PWD -u $(id -u):$(id -g) \
    gros-prediction python tensor.py --filename /data/sprint_features.arff \
    --results /data/sprint_results.json \
    --label num_not_done_points+num_removed_points+num_added_points --binary \
    --roll-sprints --roll-validation --roll-labels --replace-na --model dnn \
    --test-interval 200 --num-epochs 1000 --log INFO

Adjust the -v arguments to volume specifications with some valid paths. Adjust the other arguments at the end that you would use to configure the runner and model; the example here is for a DNN run. For a GPU-enabled image, you must instead use docker run --runtime=nvidia ... and also select an appropriate GPU device by adding --device /gpu:0 at the end. The number in this argument selects the GPU to pin to if there are multiple, starting from index 0. In case of multiple GPUs, it is also recommended to set the environment variable CUDA_VISIBLE_DEVICES=0 to this index, so that other GPUs are not used by TensorFlow to reserve memory. This allows multiple GPUs to be used concurrently (for example on a Jenkins agent with multiple executors, or just to keep the other GPU available to other users or a graphical desktop).

TensorFlow version

For both the Docker-based and direct (pip/virtualenv) installation, we have to install a specific version of TensorFlow to work with the current models. This means that the most recent version will not function. The code has been tested with TensorFlow versions 1.12.0 and 1.13.2, but may work with later TensorFlow versions before version 2. Note that these versions may get stale by now which causes them to not support recent Python versions. This may mean that the PyPI registry does not provide packages for pip to install for your Python version. This means that we currently require Python 3.7 in order to install TensorFlow properly. If this hinders the direct installation because your package manager no longer provides Python 3.7, then consider using Docker instead.

Older versions of Python and TensorFlow are typically not supported. However, earlier versions of the code have worked with Python 2.7 and 3.6 and with TensorFlow 1.3+.

Configuration

The configuration of the models and runners takes place through command line arguments provided to the tensor.py script. The --help argument indicates the available options.

When the --store argument is set to owncloud, then additional configuration is loaded from a config.yml file (the file path can be changed with the --config argument). This YAML file must have an object with an owncloud key where the value is an object with the following keys:

  • url: The URL to the ownCloud instance from which to load files.
  • verify: A boolean that indicates whether to verify SSL certificates when connecting to the ownCloud instance.
  • username: The username to log in as.
  • password: The password for the username to log in as. If the keyring configuration item is true, then this value is ignored.
  • keyring: Whether to obtain the password from a keyring (e.g. GNOME). The keyring must have a section called owncloud and the password is obtained from the username determined by the username configuration item.

Data and running

The prediction runner expects an input dataset to be provided in the form of an ARFF file. The first two columns of the ARFF file must be project_id and sprint_id, respectively, and if there is a column named organization, then it must be the last column and be a nominal attribute of quoted strings.

All other columns (attributes) are considered to be numeric. Each row (instance) in the data set describes features of a single sprint. Except for the three metadata attribute fields, the attributes can have missing values (question marks) and may be selected by the runner.

A time attribute, if provided and indicated by the --time argument, may be used to split the dataset temporally for time-travel, and should be a number based on a sprint's start date at a high enough resolution (e.g. the number of days since an origin) to make a realistic split (further binning to reduce the number of splits, or combine them, is controlled by the --time-size and --time-bin arguments).

Selection by the runner is controlled by the --index, --remove and --label arguments. Like --time, these should indicate indexes to add, remove or consider as a label to predict. For --index and --remove, multiple indexes can be provided using commas or spaces as separation. Indexes may be either a positive number (starting at 0 until the number of columns, exclusive), a negative number (from -1 for the last column until the additive inverse of the number of columns, inclusive), or the name of the column.

Additional column can be generated using --assign, where the argument must be a Python assignment expression. The expression can refer to other names of columns as variables, and a limited number of functions is available. Note that it may be preferable to perform this calculation beforehand, which may allow tracking more metadata outside of this repository. Note that the --label argument may also be an expression, but without an assignment.

The ARFF file can be provided by the data-analysis repository, using the features.r script to collect, analyze and output features. This repository contains a Jenkinsfile with appropriate steps for a Jenkins CI deployment, with example arguments to the scripts.

After a prediction has finished, the tensor.py runner outputs a JSON file to sprint_results.json (or another path determined by the --results argument) which contains predictions for a validation set, model metrics, metadata and configuration. This format can be read the sprint_results.r script from the data-analysis repository in order to combine the prediction model results with other sprint data so that it can be used in an API for the prediction-site visualization.