-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Please fill out the form below.
System Information
- Framework (e.g. TensorFlow) / Algorithm (e.g. KMeans): SKLearn/Custom
- Framework Version: 0.20.0
- Python Version: 3.5
- CPU or GPU: CPU
- Python SDK Version: 1.18.2
- Are you using a custom image: No
Describe the problem
My prediction time is not proportional to the number of trees in a Random Forest
Minimal repro / logs
My estimation strategy consists on using a set of Random Forest models, each one concerns some
subset of data (ex : RF_A if feature == A). This has been said seek of completeness as I don't think this affects my issue.
My deployment strategy:
- Fit: return a pickle that contains a dictionary of fitted sklearn Random Forest models
- Deploy: load these dictionaries in memory.
- Inference:
--maps each observation to the correct model in the already loaded dictionary
--for each observation, computes predictions given by each tree in order to allow for elementary confidence interval computation
http://blog.datadive.net/prediction-intervals-for-random-forests/
Note that this last operation is the most time consuming in the inference and the time is proportional to the number of trees in my RF (loop w.r.t. trees).
My code (my custom code in lib) :
import argparse
import os
import sys
import pandas as pd
from sklearn.externals import joblib
module_path = os.path.abspath('/opt/ml/code')
if module_path not in sys.path:
sys.path.append(module_path)
from lib import training, prediction
from data.transactions import raw
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('--output-data-dir', type=str, default=os.environ['SM_OUTPUT_DATA_DIR'])
parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR'])
args = parser.parse_args()
grid_models_dict =\
training.train_models_in_dict(raw_training_data=raw)
joblib.dump(grid_models_dict, os.path.join(args.model_dir, "model"))
def model_fn(model_dir):
grid_models_dict = joblib.load(os.path.join(model_dir, "model"))
return grid_models_dict
def predict_fn(input_data, model):
predicted = prediction.predict(input_data, model)
return predicted
My problem :
I have two deployments scenarios : one with 100 trees/RF and one with 300 trees/RF.
Fit is performed without issues. On S3 : compressed 100 trees/RF pickle is 261 Mo and compressed 300 trees/RF is 784 Mo.
Deploy is done with some issues : some timeout with some workers with the 300 trees/RF already reported for example aws/amazon-sagemaker-examples#556, but it deploy at the end.
Prediction is performed :
- with the 100 trees/RF in around 500 ms, always, with the same observation
- with the 300 trees/RF: in paper, with the same observation, due my prediction nature which is a for loop w.r.t. trees, I am supposed to predict in maximum 1.5 seconds
- with the 300 trees/RF : in practice, with the same observation
-- sometimes (33% of cases) in 700 ms,
-- sometimes (33% of cases) in 40 to 50 seconds,
-- and sometimes (33% of cases) I have a timeout error (inference timeout is limited to 60 seconds) - This behavior remains when I deploy in a bigger/recent machine. (ml.t2.xlarge to ml.c5.4xlarge)
My guess is that there is a memory swapping mechanism or that the container's memory is not fully privately allocated to me after some threshold.
Is there any solution to predict consistently with more than 100trees/RF ?
Thanks in advance.