Batch transform: retain input values #358

chang2394 · 2018-08-20T14:59:32Z

I am doing batch transform on a deepAR model but the output produced contains only the mean and quantile values, is there some way to retain the input values on which the output was produced.
This is required since i have to plot prediction vs actual for records matching specified filtering criteria.

ChoiByungWook · 2018-08-21T00:15:48Z

When you do a non-batch transform (inference), are you able to retain your input values?

From my understanding based on https://aws.amazon.com/blogs/aws/sagemaker-nysummit2018/, batch transform is supposed to be a inference dealing with lots of data. The batch transform is still going to utilize the same inference code as the non-batch transform. So if your inference code normally outputs the inputs used, then it should show up in your batch transform output.

I don't know if it makes sense to output the input along with the inference. Is this a common practice?

chang2394 · 2018-08-21T07:56:26Z

I was under the impression that this could be used for doing model analysis in which case it does make sense to append the input along with the output values.
If that is not the case can you suggest some other way to solve this, i am looking for something similar to tensorflow-model-analysis for sagemaker models.

ChoiByungWook · 2018-08-21T19:51:03Z

Gotcha.

So from my understanding, you want the input values to be added into the output file.

I am not too sure if we will be able to honor this, but I'll add it as a feature request. I think my concern for this is that, it may potentially increase the size of the output file, however maybe we can add it as an optional flag to output the inputs too.

Otherwise, I believe the output is ordered, meaning that the output follows the same order as the input, which should be top down. So while annoying, you might be able to read from your input file and correspond it with your output.

peacej · 2018-09-05T02:37:48Z

This would definitely be useful. I have a postprocessing step that compares one of the input features to the prediction and tweaks the prediction if a certain condition is met. Including all the input columns might result in a very big file, so it would be nice to be able to only choose one or two of the input columns to include.

Removed attachedModel in cleanup to avoid error

Gloix · 2019-01-28T14:00:52Z

The way to circumvent this has been somewhat tedious, since our team was deciding on the solution to zip lines from the output and input files on AWS Glue or Lambda. We first tried a Glue Job that matched lines from both files, but we parsed the files as Spark dataframes and there was no function to enumerate rows (due to the given parallel nature of Spark, I guess) to match them.

We then thought Lambda was not going to be an issue, since the zipping should be an easy task, but we then realized it was too much info to process. The lambda timed out and would sometimes have memory issues due to an issue with Boto3 (boto/boto3#1670).

We thought again of a Glue job, without using the usual Glue APIs and just plan loading the files from S3, merging the lines and we could even perform further transforming to convert back the time series categories to meaningful values.

Now, the last issue we've had is the inconsistency in the number of lines outputted from DeepAR, since sometimes they don't match.

Our team has lost about 3 days of work on this due to not having meaningful (and custom) information in the output files which links them back to their corresponding input.

It would be nice to have these features in the short term to improve the usability of the product.

yangaws · 2019-01-29T00:06:03Z

Hi @chang2394 @peacej @Gloix ,

Thanks for contributing your thoughts. SageMaker is working on this feature request. We cannot provide any ETA now but we will get back to this issue when it's complete.

joseramoncajide · 2019-02-12T15:08:02Z

Hi! @chang2394
This is just what I need, the way to return the features along with the model predicted results!

lincolmr · 2019-02-28T16:19:45Z

This would be extremely helpful! We are looking for a way to retain the input values (mainly for identification purposes) so that we can match the corresponding input to the output results.

chrispruitt · 2019-02-28T16:49:40Z

Agreed! We would also benefit from this tremendously. Please keep us posted when you have an ETA @yangaws, thanks!

saritajoshi9389 · 2019-03-18T22:58:20Z

+1, Need to map the prediction result back to the original input

joseramoncajide · 2019-03-18T23:03:54Z

I solved it using tf.contrib.estimator.forward_features
I asked AWS team to update TF to 1.12 on their SageMaker instances and now it works. You can tie batch predictions with incoming data using a feature key as described here: https://www.tensorflow.org/api_docs/python/tf/contrib/estimator/forward_features
Hope it helps!

madmadmadman · 2019-03-27T09:24:57Z

I would also like to add an option that adds the original values or a specified column to the result, because in my case I do not use Tenserflow.

MrYoungblood · 2019-04-23T18:52:11Z

Is there an update on the feature request? I am using the pre-built XG boost algorithm and would love to see the feature there.

tnaduc · 2019-05-28T05:19:10Z

is there any update on this feature?

Ideally, could we have an option to specify a list of columns (normally IDs and keys) to make sure the batch inference results are mapped correctly to the input. In addition, an option to operate on the predictions is great as well (for calculating percentile, generate probability or label from probability, etc).
Thanks

j3ffreyjohn · 2019-07-08T16:21:47Z

Hi all, please checkout last week's update to SageMaker Batch Transform that will satisfy these use cases. Currently, we support CSV, JSON and JSON Lines formats.

Feature Documentation: https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform-data-processing.html

Blog Post showing an example with a standard UCI dataset: https://aws.amazon.com/blogs/machine-learning/associating-prediction-results-with-input-data-using-amazon-sagemaker-batch-transform/

Companion Notebooks: https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker_batch_transform/batch_transform_associate_predictions_with_input

Please let us know your feedback on this feature!

@tnaduc @MrYoungblood @madmadmadman @saritajoshi9389 @chrispruitt @lincolmr @joseramoncajide @Gloix @peacej @chang2394

martinhammar · 2019-10-03T14:25:26Z

I think this is a great and needed feature.
Can't get it working with BlazingText and jsonlines data, though. Is that supported?

adarsh-dattatri · 2020-01-22T08:45:11Z

Batch inference jobs on Sagemaker Autopilot model produce only labels as output. Can we get scores also? Is there an option to get scores? I need to do some post processing on scores.

maddy2u · 2020-04-28T22:11:08Z

We need to improve the overall solution that is there for association of inputs to outputs.

A common use case is to have inference pipelines set and in my case it is a container of SPARK ML followed by Xgboost. In this case, i pass a CSV file input that gets converted internally to a sparse vector frame that is passed on to the xgboost image. At this point, i cannot put in association of inputs with outputs. Also, all the hosted algorithms need to be enhanced to use the same set of inputs. For XGBoost, it uses csv and libsvm only but association works for csv only. I cannot use CSV as i have sparsevectors in my input frame and making them dense is not an option owing to the size that comes out when parsing and saving as CSV.

ChoiByungWook added the feature request label Aug 21, 2018

apacker pushed a commit to apacker/sagemaker-python-sdk that referenced this issue Nov 15, 2018

Merge pull request aws#358 from JonathanTaws/patch-4

a0437cd

Removed attachedModel in cleanup to avoid error

mvsusp closed this as completed Jul 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch transform: retain input values #358

Batch transform: retain input values #358

chang2394 commented Aug 20, 2018 •

edited

Loading

ChoiByungWook commented Aug 21, 2018

chang2394 commented Aug 21, 2018

ChoiByungWook commented Aug 21, 2018

peacej commented Sep 5, 2018 •

edited

Loading

Gloix commented Jan 28, 2019

yangaws commented Jan 29, 2019

joseramoncajide commented Feb 12, 2019

lincolmr commented Feb 28, 2019

chrispruitt commented Feb 28, 2019

saritajoshi9389 commented Mar 18, 2019

joseramoncajide commented Mar 18, 2019

madmadmadman commented Mar 27, 2019 •

edited

Loading

MrYoungblood commented Apr 23, 2019

tnaduc commented May 28, 2019

j3ffreyjohn commented Jul 8, 2019 •

edited

Loading

martinhammar commented Oct 3, 2019

adarsh-dattatri commented Jan 22, 2020

maddy2u commented Apr 28, 2020

Batch transform: retain input values #358

Batch transform: retain input values #358

Comments

chang2394 commented Aug 20, 2018 • edited Loading

ChoiByungWook commented Aug 21, 2018

chang2394 commented Aug 21, 2018

ChoiByungWook commented Aug 21, 2018

peacej commented Sep 5, 2018 • edited Loading

Gloix commented Jan 28, 2019

yangaws commented Jan 29, 2019

joseramoncajide commented Feb 12, 2019

lincolmr commented Feb 28, 2019

chrispruitt commented Feb 28, 2019

saritajoshi9389 commented Mar 18, 2019

joseramoncajide commented Mar 18, 2019

madmadmadman commented Mar 27, 2019 • edited Loading

MrYoungblood commented Apr 23, 2019

tnaduc commented May 28, 2019

j3ffreyjohn commented Jul 8, 2019 • edited Loading

martinhammar commented Oct 3, 2019

adarsh-dattatri commented Jan 22, 2020

maddy2u commented Apr 28, 2020

chang2394 commented Aug 20, 2018 •

edited

Loading

peacej commented Sep 5, 2018 •

edited

Loading

madmadmadman commented Mar 27, 2019 •

edited

Loading

j3ffreyjohn commented Jul 8, 2019 •

edited

Loading