Skip to content

The Midas Class

Oracen edited this page Sep 22, 2018 · 16 revisions

This page serves as the primary guide to the components of Midas.


Contents

Class Overview
A Note on VAE Functionality
Code Example
Model Hyperparameters
Match Model to Data
Training Hyperparameters
Imputation Generation
VAE-specific Methods
Misc/Utility Functions


Class Overview

Midas is implemented as a class similar (though not identical) to Sklearn. Due to the number of hyperparameters which can be specified by a user, the procedure is broken down into model-instantiation params (i.e. the call to .__init__() ), data specific params (i.e. the call to .build_model() ) and training loop params (i.e. the call to .train_model() ). This is also in-part to allow for datasets to be streamed from disk, thus obviating the need to hold the whole dataset in memory for the model to be fit. Finally, there exists a variety of ways to extract imputations, varying from a bulk generation into a list of imputed datasets to building a generator which yields an imputed dataset on each loop iteration. This order must be obeyed. Attempting to train the model before it is built will raise an error. Imputing from a non-trained model will not raise an error.

Many hyperparameters exist for either planned expansions or debugging. These parameters are not designed for use by end-users, and will be documented in a separate page.

Midas is built to integrate solely with Pandas. This enables better indexing and data manipulation.

A Note on VAE Functionality

The variational autoencoder (VAE) is a modification to a conventional autoencoder which leverages variational inference to improve semi-supervised learning. Rather than encoding a vector into a compressed hidden state, the variational autoencoder maps a vector into an isotropic Gaussian distribution which can be efficiently sampled and explored. This distribution can then be sampled either independently or conditional on some input. The VAE component of Midas was added on request of a user looking to generate "new observations" as part of an inferential check. Should the VAE capability be enabled, it will be constructed between the encode and decode layers of the model.

Beyond generating samples based on a learned latent distribution, VAEs are also particularly well-suited for dimensionality reduction due to yielding a robust latent representation. Finally, by 'sweeping' samples through the latent space in a non-random manner, distinct regions of covariance can be explored. Users looking to leverage this capacity are advised to familiarise themselves with VAEs before attempting to make inferences based on generated samples. This blog post provides a good starting point, and many similar blogs (of varying technicality) have been penned for the non-expert.

Code Example

The standard workflow for using Midas is as follows:

import pandas as pd
from midas import Midas

df = pd.read_csv('/path/to/file/containing/missing/data.csv')

# Convert categorical columns, normalise data
cats = pd.get_dummies(df['categorical_column'])
df.drop('categorical_column', axis=1, inplace=True)
norm_mean = df.mean()
norm_std = df.std()
df = df.sub(norm_mean).div(norm_std)
df = pd.concat([df, cats], axis=1)

# Build a two layer neural network with 128 units
imputer = Midas([128])
# Match data to the model
imputer.build_model(df, categorical_columns = [cats.columns])
# Train model. Model complexity is generally assessed here before training
# by means of the .overimpute() method
imputer.train_model(training_epochs= 50)

# Generate 5 imputations
imputer.generate_imputations(m=5)

# Conduct inference on each imputation
for item in imputer.output_list:
  # inference code goes here

The rest of the guide will be broken into the broad sections outlined above, followed by miscellaneous methods.


Model Hyperparameters

Midas(
    layer_structure= [256, 256, 256],
    learn_rate= 1e-4,
    input_drop= 0.8,
    train_batch = 16,
    savepath= 'tmp/MIDAS',
    seed= None,
    output_layers= 'reversed',
    dropout_level = 0.5,
    weight_decay = 'default',
    act = tf.nn.elu, 
    # VAE specific parameters
    vae_layer= False, 
    latent_space_size = 4,
    vae_alpha = 1.0
    )

The class constructor defines the main features of the imputation model. Each parameter will be defined, then followed by a tuning tip.

  • layer_structure: List of integers. Specifies the pattern by which the model construction methods will instantiate a neural network. The ordering defined in this list will specify the shape of the input 'encode' layers. Under the default output layer settings, the output 'decode' layers also follow this pattern but in reverse. For reference, the default layer structure is a six-layer network, with each layer containing 256 units. Larger networks can learn more complex representations of data, but also require longer and longer training times. Due to the large degree of regularisation used by MIDAS, making a model "too big" is less of a problem than making it "too small". If training time is relatively quick, but you want improved performance, try increasing the size of the layers or, as an alternative, add more layers. (More layers generally corresponds to learning more complex relationships.) As a rule of thumb, I keep the layers to powers of two - not only does this narrow my range of potential size values to search through, but apparently also helps with network assignment to memory.

  • learn_rate: Float. This specifies the default learning rate behaviour for the training stage. In general, larger numbers will give faster training, smaller numbers more accurate results. If a cost is exploding (ie. increasing rather than decreasing), then the first solution tried ought to be reducing the learning rate.

  • input_drop: Float between 0 and 1. The 'keep' probability of each input column per training batch. A higher value will allow more data into MIDAS per draw, while a lower number seems to render the aggregated posterior more robust to bias from the data. (This effect will require further investigation, but the central tendency of the posterior seems to fall closer to the true value.) Empirically, a number between 0.7 and 0.95 works best. Numbers close to 1 reduce or eliminate the regularising benefits of input noise, as well as converting the model to a simple neural network regression.

  • train_batch: Integer. The batch size of each training pass. Larger batches mean more stable loss, as biased sampling is less likely to present an issue. However, the noise generated from smaller batches, as well as the greater number of training updates per epoch, allows the algorithm to converge to better optima. Research suggests the ideal number lies between 8 and 512 observations per batch. I've found that 16 seems to be ideal, and would only consider reducing it on enormous datasets where memory management is a concern.

  • savepath: String. Specifies the location to which Tensorflow will save the trained model.

  • seed: Integer. Initialises the pseudorandom number generator to a set value. Important for replicability.

  • output_layers: 'reversed' or list. If 'reversed', the output decode layers are simply the opposite of layer_structure, as mentioned in that document string. If a list is passed, the network will be constructed according to the layer order specified.

  • dropout_level: Float between 0 and 1. In general, this parameter need not be modified. Much research suggests that 0.5 is the optimal number for a dropout scheme used as it is here. If the dataset is particularly small, however, perhaps reducing the dropout to around 0.8 will enable a better fit.

  • weight_decay: 'default' or float. The regularization term for the Adam-compatible weight penalty. This term penalises the squared magnitude of weights, thus helping the function defined by the model to remain a smooth curve. 'Default' is a strong penalty at 1/number_of_samples. Generally, numbers around 1e-5 give good results.

  • act: Tensorflow activation function. Most users should not modify this term unless they have a specific effect they are hoping to achieve. Per Gal, this is analogous to the covariance function of a deep GP, defining how variance should behave as the model function travels beyond the range of the data. tf.identity renders the DAE linear, tf.nn.relu is another popular activation function.

VAE specific parameters:

  • vae_layer: Boolean. Determines whether or not the model will encode into an explicit distribution. See the above section for further details concerning what this entails.

  • latent_space_size: Integer. The number of dimensions that the isotropic Gaussian in the centre will possess. The greater the number, the more information can be encoded into the latent space, but the greater the chance of overfitting.

  • vae_alpha: Float. The strength of the penalty used to encode the inputs into the isotropic Gaussian. Serves as a regularisation term. Should model performance prove limited, it is generally advisable to increase the latent dimensionality rather than decrease vae_alpha.


Match Model to Data

Midas().build_model(
    imputation_target,
    categorical_columns= None,
    softmax_columns= None,
    unsorted= True,
    additional_data = None,
    verbose= True,
    )

This method is called to construct the neural network that is the heart of MIDAS. This includes the assignment of loss functions to the appropriate data types.

THIS FUNCTION MUST BE CALLED BEFORE ANY TRAINING OR IMPUTATION OCCURS. Failing to do so will simply raise an error.

The categorical columns should be a list of column names. Softmax columns should be a list of lists of column names. This will allow the model to dynamically assign cost functions to the correct variables. If, however, the data comes pre-sorted, 'arranged' can be set to "True", in which case the arguments can be passed in as integers of size, ie. shape[1] attributes for each of the relevant categories.

In other words, if you're experienced at using MIDAS and understand how its indexing works, pre-sort your data and pass in the integers so specifying reindexing values doesn't become too onerous.

Alternatively, list(df.columns.values) will output a list of column names, which can be easily implemented in the 'for' loop which constructs your dummy variables.

  • imputation_target: DataFrame. Any data specified here will be rearranged and stored for the subsequent imputation process. The data must be preprocessed before it is passed to build_model.

  • categorical_columns: List of names. Specifies the binary (ie. non-exclusive categories) to be imputed. If unsorted = False, this value can be an integer

  • softmax_columns: List of lists. Every inner list should contain column names. Each inner list should represent a set of mutually exclusive categories, such as current day of the week. if unsorted = False, this should be a list of integers.

  • unsorted: Boolean. Specifies to MIDAS that data has been pre-sorted, and indices can simply be appended to the size index.

  • additional_data: DataFrame (optional). Any data that shoud be included in the imputation model, but is not required from the output. By passing data here, the data will neither be rearranged nor will it generate a cost function. This reduces the regularising effects of multiple loss functions, but reduces both networksize requirements and training time.

  • verbose: Boolean. Set to False to suppress messages printing to terminal.


Training Hyperparameters

Midas().train_model(
    training_epochs= 100,
    verbose= True,
    verbosity_ival= 1,
    excessive= False
    )

This is the standard method for optimising the model's parameters. Must be called before imputation can be performed. The model is automatically saved upon conclusion of training

  • training_epochs: Integer. Number of complete cycles through training dataset.

  • verbose: Boolean. Prints out messages, including loss

  • verbosity_ival: Integer. This number determines the interval between messages.

  • excessive: Boolean. Used for troubleshooting, this argument will cause the cost of each minibatch to be printed to the terminal.


Imputation Generation

Midas().generate_samples(
    m= 50,
    verbose= True
    )
Midas().yield_samples(
    m= 50,
    verbose= True
    )
Midas().batch_generate_samples(
    m= 50,
    b_size= 256,
    verbose= True
    )
Midas().batch_yield_samples(
    m= 50,
    b_size= 256,
    verbose= True
    )

Methods used to generate a set of m imputations. The "generate" family will construct m imputed datasets, storing them in the Midas().output_list attribute. The "yield" family will instantiate an iterable generator which can be used in a 'for' loop or comprehension.

If a model has been pre-trained, on subsequent runs this function can be directly called without having to train first. An 'if' statement checking the default save location is useful for this.

  • m: Integer. Number of imputations to generate.

  • b_size: Integer. Number of data entries to process at once. For managing wider datasets, smaller numbers may be required.

  • verbose: Boolean. Prints out messages.


VAE-specific Methods

Midas().inputs_to_z(
    b_size= 256,
    verbose= True
    )

Encodes the imputation target dataframe into the VAE latent space, then returns the parameterisation (mu and sigma) of each sample from within that space. These values can then analysed for clusters, uncertainty, etc. or passed into a tool such as t-SNE for visualisation.

  • b_size: Integer. Number of data entries to process at once. For managing wider datasets, smaller numbers may be required.

  • verbose: Boolean. Prints out messages.

Returns: x_mu, x_log_sigma

Midas().sample_from_z(
    sample_size= 256,
    verbose= True
    )

Takes sample_size draws from a Normal distribution and de-transforms these points, effectively creating new synthetic observations. These are primarily of interest for the purpose of posterior predictive checks, given VI's tendency to underestimate variance. This function enables the "generative" aspect of a VAE.

  • sample_size: Integer. Number of datapoints to sample from within the latent space.

  • verbose: Boolean. Prints out messages.

Midas().transform_from_z(
    data,
    b_size= 256,
    verbose= True
    )

This method enables direct interaction with the latent space for whatever purpose the user can imagine.

  • data: Pandas DataFrame or numpy array of width latent_space_size. These numbers can be sampled from some distribution, or can be structured so as to sweep through the latent space.

  • b_size: Integer. Number of data entries to process at once. For managing wider datasets, smaller numbers may be required.

  • verbose: Boolean. Prints out messages.


Misc/Utility Functions

Midas().overimpute(
    spikein = 0.1,
    training_epochs= 100,
    report_ival = 10,
    report_samples = 32,
    plot_all= True,
    verbose= True,
    verbosity_ival= 1,
    spike_seed= 42,
    cont_kdes = False,
    excessive= False
    )

This function spikes in additional missingness, so that known values can be used to help adjust the complexity of the model. As conventional train/validation splits can still lead to autoencoders overtraining, the method for limiting complexity is overimputation and early stopping. This gives an estimate of how the model will react to unseen variables.

Error is defined as RMSE for continuous variables, and classification error for binary and categorical variables (ie. 1 - accuracy). Note that this means that binary classification is inherently dependent on a selection threshold of 0.5, and softmax accuracy will naturally decrease as a function of the number of classes within the model. All three will be affected by the degree of imbalance within the dataset.

The accuracy measures provided here may not be ideal for all problems, but they are generally appropriate for selecting optimum complexity. Should the lines denoting error begin to trend upwards, this indicates overtraining and is a sign that the training_epochs parameter to the .train_model() method should be capped before this point.

The actual optimal point may differ from that indicated by the .overimpute() method for two reasons:

  • The loss that is spiked in reduces the overall data available to the algorithm to learn the patterns inherent, so there should be some improvement in performance when .train_model() is called. If this is a concern, then it should be possible to compare the behaviour of the loss figure between .train_model() and .overimpute().
  • The missingness inherent to the data may depend on some unobserved factor. In this case, the bias in the observed data may lead to inaccurate inference.

It is worth visually inspecting the distribution of the overimputed values against imputed values (using plot_all) to ensure that they fall within a sensible range.

  • spikein: Float, between 0 and 1. The proportion of total values to remove from the dataset at random. As this is a random selection, the sample should be representative. It should also equally capture known and missing values, therefore this sample represents the percentage of known data to remove. If concerns about sampling remain, adjusting this number or changing the seed can allow for validation. Larger numbers mean greater amounts of removed data, which may mean estimates of optimal training time might be skewed. This can be resolved by lowering the learning rate and aiming for a window.

  • training_epochs: Integer. Specifies the number of epochs model should be trained for. It is often worth specifying longer than expected to ensure that the model does not overtrain, or that another, better, optimum exists given slightly longer training time.

  • report_ival: Integer. The interval between sampling from the posterior of the model. Smaller intervals mean a more granular view of convergence, but also drastically slow training time.

  • report_samples: Integer. The number of Monte-Carlo samples drawn for each check of the posterior at report_ival. Greater numbers of samples means a longer runtime for overimputation. For low numbers of samples, the impact will be reduced, though for large numbers of Monte-Carlo samples, report_ival will need to be adjusted accordingly. I recommend a number between 5 and 25, depending on the complexity of the data.

  • plot_all: Boolean. Generates plots of the distribution of spiked in values v. the mean of the imputations. Continuous values have a density plot, categorical values a bar plot representing proportions. Only the mean is plotted at this point for simplicity's sake.

  • cont_kdes: Boolean. Determines whether or not a KDE should be generated for every individual imputation within a dataset. Useful to check for convergence of variance, or for subtle signs of overtraining.

  • verbose: Boolean. Prints out messages, including loss

  • verbosity_ival: Integer. This number determines the interval between messages.

  • spike_seed: A different seed, separate to the one used in the main call, used to initialise the RNG for the missingness spike-in.

  • excessive: Unlike .train_model()'s excessive arg, this argument prints the entire batch output to screen. This allows for inspection for unusual values appearing.

Midas().change_imputation_target(
    new_target,
    additional_data= None
    )

A small helper function written to facilitate standard train/test split routines. Realistically, imputation should be conducted on as much of the dataset as is feasible (to maximise total information), but under some circumstances it may be impossible to build a unified dataset. The only disclaimer is that, should additional_data have been passed to Midas.build_model(), a similar matching set of columns must be passed to this function.

  • new_target: Pandas DataFrame. Must have the same columns as the original input dataset.

  • additional_data: Pandas DataFrame (optional). Used to pass additional information into the imputation model regarding the new_target dataframe.