Obtaining inverse transform values from tft.transform #185

agonojo · 2020-07-08T23:51:55Z

I'm throughly confused by the tft transform output.
I've read dozens and dozens of documentation pages and I've even tried (dumbly) to explore the saved_model.pb graph to see if I could find the constants computed during pre_processing with

tft.scale_to_z_score(outputs[key], name='z_scale_'+key) #for the sake of an example we can pretend key = 'height'

how can I obtain the std and mean computed during the AnalyzeAndTransformDataset step for my numeric column "height" ?

I clearly see in my transformed_data files that it's been transformed. In particular, this is important for my target prediction (regression problem). I feel silly. Someone please point me in the right direction?

If it helps, I'm using the census_v2 code as an example where the only major difference in code is our model architecture and loss function (mine is custom)

agonojo · 2020-07-09T00:05:14Z

If it's supposed to be saving something in my transform_model.variables or in the variables/ folder, it isn't.

agonojo · 2020-07-09T03:42:33Z

Ah, it seems my model is not returning this either which I realize could give me those stats..
pre_transform_statistics_path <- not being created

zoyahav · 2020-07-09T14:57:24Z

These constants are being saved in the graph as constants used for transformations.
If you wanted to see these in the trainer, then you could also save those statistics as output features of your preprocessing_fn.

def preprocessing_fn(inputs):
  height = inputs['height']

  scaled_height = tft.scale_to_z_score(height, name='z_scale_height')
  height_mean = tft.mean(height)
  height_var = tft.var(height)
  
  batch_size = tf.shape(input=height)[0]
  def feature_from_scalar(value):
    return tf.tile(tf.expand_dims(value, 0), multiples=[batch_size])

  return {
    'scaled_height': scaled_height,
    'height_mean': feature_from_scalar(height_mean),
    'height_var': feature_from_scalar(height_var),
  }

{pre|post}_transform_statistics_path is where statistics are stored about your data that can be used for your analysis, not so much using the contents during training or anything like that.

agonojo · 2020-07-09T15:16:04Z

Thank you so much for such a timely response Zoyahav. Indeed, I had checked the graph via:

pre_stats = tf.saved_model.load('/folder_path/tftransform_tmp/')

#print(pre_stats.graph.as_graph_def())
print(pre_stats.graph.as_graph_element("z_scale_heaight/mean_and_var/Const"))

but none of the constants were actually floats that look like a variance or mean. (They were either int64 values or floats = 1.0).

If they really should be stored at all, they're definitely not being saved on my end, either in the graph or in any output folders though my transformed data does appear to have undergone the transformations. I am testing to see if the few NaN values (which I imputed in a prior step) might be causing an issue.

But overall, having to compute something beam is already calculating to perform z scaling seems redundant and inefficient.. I figured they'd be present in the output object or graph but I can't seem to get them. Regardless, I can proceed as you're suggesting. Perhaps this is something a future update could provide? or even just an option to reverse a transformation. Say, for example, I scale my numeric target. My prediction will be scaled. I have no idea what that means in my normal scale since I can't reverse the transformation without those statistics.

Now that aside, do you find it concerning at all that my process isn't actually outputting {pre|post}_transform_statistics folders? I would like those values

agonojo · 2020-07-09T15:41:13Z

Also, I also included this in my Pipeline following decode. I'll run again and update if I can get something working.
| 'GenerateStatistics' >> stats_api.GenerateStatistics(stats_options) | 'WriteStatsOutput' >> stats_api.WriteStatisticsToTFRecord( output_path))

agonojo · 2020-07-14T18:49:56Z

These constants are being saved in the graph as constants used for transformations.
If you wanted to see these in the trainer, then you could also save those statistics as output features of your preprocessing_fn.
def preprocessing_fn(inputs):
  height = inputs['height']

  scaled_height = tft.scale_to_z_score(height, name='z_scale_height')
  height_mean = tft.mean(height)
  height_var = tft.var(height)
  
  batch_size = tf.shape(input=height)[0]
  def feature_from_scalar(value):
    return tf.tile(tf.expand_dims(value, 0), multiples=[batch_size])

  return {
    'scaled_height': scaled_height,
    'height_mean': feature_from_scalar(height_mean),
    'height_var': feature_from_scalar(height_var),
  }
{pre|post}_transform_statistics_path is where statistics are stored about your data that can be used for your analysis, not so much using the contents during training or anything like that.

Side note, I was unable to get this to work. My preprocessing function would return an empty dict:

Traceback (most recent call last):
  File "tf_transform-2.py", line 1009, in <module>
    test_size=TEST_SIZE
  File "tf_transform-2.py", line 904, in main
    transform_data(TRAIN_FILES, TEST_FILES, working_dir)
  File "tf_transform-2.py", line 332, in transform_data
    raw_dataset | tft_beam.AnalyzeAndTransformDataset(preprocessing_fn))
  File "/home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/apache_beam/transforms/ptransform.py", line 562, in __ror__
    result = p.apply(self, pvalueish, label)
  File "/home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/apache_beam/pipeline.py", line 655, in apply
    pvalueish_result = self.runner.apply(transform, pvalueish, self._options)
  File "/home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/apache_beam/runners/runner.py", line 198, in apply
    return m(transform, input, options)
  File "/home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/apache_beam/runners/runner.py", line 228, in apply_PTransform
    return transform.expand(input)
  File "/home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow_transform/beam/impl.py", line 1029, in expand
    dataset | 'AnalyzeDataset' >> AnalyzeDataset(self._preprocessing_fn))
  File "/home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/apache_beam/transforms/ptransform.py", line 998, in __ror__
    return self.transform.__ror__(pvalueish, self.label)
  File "/home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/apache_beam/transforms/ptransform.py", line 562, in __ror__
    result = p.apply(self, pvalueish, label)
  File "/home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/apache_beam/pipeline.py", line 612, in apply
    return self.apply(transform, pvalueish)
  File "/home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/apache_beam/pipeline.py", line 655, in apply
    pvalueish_result = self.runner.apply(transform, pvalueish, self._options)
  File "/home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/apache_beam/runners/runner.py", line 198, in apply
    return m(transform, input, options)
  File "/home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/apache_beam/runners/runner.py", line 228, in apply_PTransform
    return transform.expand(input)
  File "/home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow_transform/beam/impl.py", line 976, in expand
    None, input_metadata))
  File "/home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow_transform/beam/impl.py", line 835, in expand
    raise ValueError('The preprocessing function returned an empty dict')
ValueError: The preprocessing function returned an empty dict

zoyahav · 2020-07-15T11:04:31Z

Does your preprocessing_fn return an empty dict? (perhaps look if there's an issue with indentation or something like that)
In the snippet above it doesn't, so this error shouldn't occur.

haifengkao · 2020-08-20T08:45:09Z

pre_transform_statistics_path folder is empty. How should I get the mean and var?

zoyahav · 2020-09-23T13:27:26Z

The mean and var are not in the pre_transform_statistics_path.
The mean and var can be obtained by calling tft.mean(), tft.var() in the preprocessing_fn.
Outside of the preprocessing_fn, for example in the trainer code, those can be obtained by returning a mean/var feature from your preprocessing_fn.

haifengkao · 2020-09-24T08:22:21Z

So I have two options:

store the same mean and same var in every row of my data
run the preprocessing_fn twice, the first pass for data standardization, and the second pass to compute mean and var (which are already computed in the first one, just no way to get the values). I think it defeats the purpose of preprocessing_fn. preprocessing_fn should compute the whole stuff in a single pass.

zoyahav · 2020-09-24T13:49:37Z

Could you please explain the second option?
I was suggesting the first option above.

haifengkao · 2020-09-25T01:06:43Z

Sure. From the doc, it says AnalyzeAndTransformDataset can "may be more efficient since it avoids multiple passes over the data".
No, it won't be more efficient because

Option 1. it wastes spaces to store duplicated data if we want to preprocess the data in a single pass
Option 2. it wastes time because we need 2 passes over the data in order to get the statistics

We could be more efficient because
the mean and var are stored in the computation graph. We just don't have any API to get them.

https://www.tensorflow.org/tfx/transform/api_docs/python/tft_beam/AnalyzeAndTransformDataset

zoyahav · 2020-09-25T10:30:12Z

I'm still not understanding option 2, "run the preprocessing_fn twice" I'm assuming means a completely different pipeline with a different preprocessing_fn. This is not a good idea because tracking compatibility between the pipelines is not trivial and defeats the purpose of hermetic preprocessing used for training and serving in order to avoid training/inference skew.

Option 1 wastes some space yes, but you also don't need to call TransformDataset() (or AnalyzeAndTransformDataset), you could just produce the TFT output in the form of a SavedModel and apply these transformations for training and serving, having access to additional features.

haifengkao · 2020-09-27T03:03:17Z

Ok, now I understand your point.
The problem is I don't know how to apply TFT output in the form of a SavedModel to TensorflowLite.
Do you have any tutorial or doc on this topic?

dimileeh · 2021-09-06T09:26:42Z

I can't figure it out either. TFT has got such nice functions as scale_to_z_score and scale_to_0_1, and I've been using them with AnalyzeDataset and TransformDataset in Beam, including for the label in the regression problem.

But after I run inference and get predictions, also through Beam, how do I apply the inverse transform for the labels to get the actual predictions?

I found a way in this thread: by saving label_mean and label_var (or label_min and label_max) into every TFExample, and then running the inverse transformation myself. But that's way too inelegant :(

Did anyone find a solution for such a problem?

zoyahav · 2021-09-07T12:25:06Z

We can keep this in mind as a feature request, but yes, in the meantime it has to be done manually.
Regarding documentation for applying the TFT output, I don't believe we have any specific to TFLite, there's just the generic one here.

UsharaniPagadala · 2021-11-11T11:18:17Z

@agonojo

Could you please confirm if this issue can be closed.Thanks

rmothukuru self-assigned this Jul 9, 2020

rmothukuru added stat:awaiting tensorflower type:support labels Jul 9, 2020

rmothukuru assigned zoyahav and unassigned rmothukuru Jul 9, 2020

zoyahav assigned iindyk Jul 9, 2020

agonojo closed this as completed Jul 14, 2020

agonojo reopened this Jul 14, 2020

UsharaniPagadala added stat:awaiting response and removed stat:awaiting tensorflower labels Nov 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Obtaining inverse transform values from tft.transform #185

Obtaining inverse transform values from tft.transform #185

agonojo commented Jul 8, 2020 •

edited

Loading

agonojo commented Jul 9, 2020 •

edited

Loading

agonojo commented Jul 9, 2020

zoyahav commented Jul 9, 2020

agonojo commented Jul 9, 2020

agonojo commented Jul 9, 2020 •

edited

Loading

agonojo commented Jul 14, 2020 •

edited

Loading

zoyahav commented Jul 15, 2020

haifengkao commented Aug 20, 2020

zoyahav commented Sep 23, 2020

haifengkao commented Sep 24, 2020

zoyahav commented Sep 24, 2020

haifengkao commented Sep 25, 2020 •

edited

Loading

zoyahav commented Sep 25, 2020

haifengkao commented Sep 27, 2020

dimileeh commented Sep 6, 2021 •

edited

Loading

zoyahav commented Sep 7, 2021

UsharaniPagadala commented Nov 11, 2021

Obtaining inverse transform values from tft.transform #185

Obtaining inverse transform values from tft.transform #185

Comments

agonojo commented Jul 8, 2020 • edited Loading

agonojo commented Jul 9, 2020 • edited Loading

agonojo commented Jul 9, 2020

zoyahav commented Jul 9, 2020

agonojo commented Jul 9, 2020

agonojo commented Jul 9, 2020 • edited Loading

agonojo commented Jul 14, 2020 • edited Loading

zoyahav commented Jul 15, 2020

haifengkao commented Aug 20, 2020

zoyahav commented Sep 23, 2020

haifengkao commented Sep 24, 2020

zoyahav commented Sep 24, 2020

haifengkao commented Sep 25, 2020 • edited Loading

zoyahav commented Sep 25, 2020

haifengkao commented Sep 27, 2020

dimileeh commented Sep 6, 2021 • edited Loading

zoyahav commented Sep 7, 2021

UsharaniPagadala commented Nov 11, 2021

agonojo commented Jul 8, 2020 •

edited

Loading

agonojo commented Jul 9, 2020 •

edited

Loading

agonojo commented Jul 9, 2020 •

edited

Loading

agonojo commented Jul 14, 2020 •

edited

Loading

haifengkao commented Sep 25, 2020 •

edited

Loading

dimileeh commented Sep 6, 2021 •

edited

Loading