Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update abstract PyFunc to utilise returned treatment_config #164

Merged

Conversation

deadlycoconuts
Copy link
Contributor

@deadlycoconuts deadlycoconuts commented Feb 8, 2022

Context

Turing SDK offers an abstract class PyFunc from which the concrete PyFuncEnsembler is implemented. In particular, this PyFunc class implements the abstract method predict from its parent mlflow.pyfunc.PythonModel, which gets called whenever the mlflow model is used to generate predictions (or in this case, ensembling results).

This predict method partially implements the batch ensembling logic, i.e. it performs some simple DataFrame manipulations by separating a single input DataFrame into DataFrames corresponding to features, predictions and treatment_config, that subsequently get passed to a user-defined ensemble method, which contains the rest (and the crux) of the ensembling logic.

Previously, the treatment_config field has always been unused, since in batch ensembling, a treatment can equivalently be defined in the features columns, hence the None value being passed to ensemble when it gets called by predict.

However, as we are moving to implement real-time pyfunc ensemblers that ultimately use the same PyFuncEnsembler class (which would allow a user to use the very same ensembler in both batch and real-time ensembling, when the same batch columns/live payload naming convention is used), there is a need to handle any treatment_config that appears in a live-ensembling request.

This PR thus aims to expose this treatment_config to the user in the ensemble method (which has previously always been a null object), allowing users to manipulate the payload from a live Turing router with a user defined ensembling method in a more intuitive manner.

Extra Context

In batch ensembling, predictions are currently passed to predict through the model_input argument. These predictions are contained in columns with the __predictions__ header, due to a tabular join operation upstream by other batch ensembling components. In order to not break the existing naming convention of columns in model_input, the future engine for the real-time ensembler will similarly pass predictions as well as treatment_config as columns in model_input with the same naming convention.

Features

  • Extraction of treatment_config data that gets passed to predict as part of the model_input argument into a separate DataFrame, which then gets passed to the ensemble method as the treament_config argument:
  • Refactoring of a few lines of existing code used to extract column names with the __[prefix_name]__ prefix into a static method:
@staticmethod
def _get_columns_with_prefix(df: pandas.DataFrame, prefix: str):
    selected_columns = {
        col: col[len(prefix):]
        for col in df.columns if col.startswith(prefix)
    }
    return selected_columns

Modifications

  • sdk/turing/ensembler.py - addition of logic to consider treatment_config being passed to predict as part of the model_input argument

@deadlycoconuts deadlycoconuts requested a review from a team February 8, 2022 04:14
Copy link
Contributor

@romanwozniak romanwozniak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, it LGTM, but I have some concerns such as:

  1. Even though it's expected that PyFunc ensemblers might be not as performant as custom ensemblers implemented with more performant languages, we should still try to squeeze the maximum out of it and avoid unnecessary pandas DF manipulation.
  2. We can afford to introduce breaking changes to the current PyFunc interface (ensemble method) if we need it to achieve better performance.
  3. treatment_config in the ensemble method is unstructured dict, however it's expected to be of a certain structure: https://github.com/gojek/turing/blob/main/docs/how-to/create-a-router/configure-ensembler.md. It could be a good time to update the PyFunc interface to reflect this.

sdk/turing/ensembler.py Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants