Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Obtaining inverse transform values from tft.transform #185

Open
agonojo opened this issue Jul 8, 2020 · 17 comments
Open

Obtaining inverse transform values from tft.transform #185

agonojo opened this issue Jul 8, 2020 · 17 comments

Comments

@agonojo
Copy link

agonojo commented Jul 8, 2020

I'm throughly confused by the tft transform output.
I've read dozens and dozens of documentation pages and I've even tried (dumbly) to explore the saved_model.pb graph to see if I could find the constants computed during pre_processing with

tft.scale_to_z_score(outputs[key], name='z_scale_'+key) #for the sake of an example we can pretend key = 'height'

how can I obtain the std and mean computed during the AnalyzeAndTransformDataset step for my numeric column "height" ?

I clearly see in my transformed_data files that it's been transformed. In particular, this is important for my target prediction (regression problem). I feel silly. Someone please point me in the right direction?

If it helps, I'm using the census_v2 code as an example where the only major difference in code is our model architecture and loss function (mine is custom)

@agonojo
Copy link
Author

agonojo commented Jul 9, 2020

If it's supposed to be saving something in my transform_model.variables or in the variables/ folder, it isn't.

@agonojo
Copy link
Author

agonojo commented Jul 9, 2020

Ah, it seems my model is not returning this either which I realize could give me those stats..
pre_transform_statistics_path <- not being created

@zoyahav
Copy link
Member

zoyahav commented Jul 9, 2020

These constants are being saved in the graph as constants used for transformations.
If you wanted to see these in the trainer, then you could also save those statistics as output features of your preprocessing_fn.

def preprocessing_fn(inputs):
  height = inputs['height']

  scaled_height = tft.scale_to_z_score(height, name='z_scale_height')
  height_mean = tft.mean(height)
  height_var = tft.var(height)
  
  batch_size = tf.shape(input=height)[0]
  def feature_from_scalar(value):
    return tf.tile(tf.expand_dims(value, 0), multiples=[batch_size])

  return {
    'scaled_height': scaled_height,
    'height_mean': feature_from_scalar(height_mean),
    'height_var': feature_from_scalar(height_var),
  }

{pre|post}_transform_statistics_path is where statistics are stored about your data that can be used for your analysis, not so much using the contents during training or anything like that.

@agonojo
Copy link
Author

agonojo commented Jul 9, 2020

Thank you so much for such a timely response Zoyahav. Indeed, I had checked the graph via:

pre_stats = tf.saved_model.load('/folder_path/tftransform_tmp/')

#print(pre_stats.graph.as_graph_def())
print(pre_stats.graph.as_graph_element("z_scale_heaight/mean_and_var/Const"))

but none of the constants were actually floats that look like a variance or mean. (They were either int64 values or floats = 1.0).

If they really should be stored at all, they're definitely not being saved on my end, either in the graph or in any output folders though my transformed data does appear to have undergone the transformations. I am testing to see if the few NaN values (which I imputed in a prior step) might be causing an issue.

But overall, having to compute something beam is already calculating to perform z scaling seems redundant and inefficient.. I figured they'd be present in the output object or graph but I can't seem to get them. Regardless, I can proceed as you're suggesting. Perhaps this is something a future update could provide? or even just an option to reverse a transformation. Say, for example, I scale my numeric target. My prediction will be scaled. I have no idea what that means in my normal scale since I can't reverse the transformation without those statistics.

Now that aside, do you find it concerning at all that my process isn't actually outputting {pre|post}_transform_statistics folders? I would like those values

@agonojo
Copy link
Author

agonojo commented Jul 9, 2020

Also, I also included this in my Pipeline following decode. I'll run again and update if I can get something working.
| 'GenerateStatistics' >> stats_api.GenerateStatistics(stats_options) | 'WriteStatsOutput' >> stats_api.WriteStatisticsToTFRecord( output_path))

@agonojo agonojo closed this as completed Jul 14, 2020
@agonojo
Copy link
Author

agonojo commented Jul 14, 2020

These constants are being saved in the graph as constants used for transformations.
If you wanted to see these in the trainer, then you could also save those statistics as output features of your preprocessing_fn.

def preprocessing_fn(inputs):
  height = inputs['height']

  scaled_height = tft.scale_to_z_score(height, name='z_scale_height')
  height_mean = tft.mean(height)
  height_var = tft.var(height)
  
  batch_size = tf.shape(input=height)[0]
  def feature_from_scalar(value):
    return tf.tile(tf.expand_dims(value, 0), multiples=[batch_size])

  return {
    'scaled_height': scaled_height,
    'height_mean': feature_from_scalar(height_mean),
    'height_var': feature_from_scalar(height_var),
  }

{pre|post}_transform_statistics_path is where statistics are stored about your data that can be used for your analysis, not so much using the contents during training or anything like that.

Side note, I was unable to get this to work. My preprocessing function would return an empty dict:

Traceback (most recent call last):
  File "tf_transform-2.py", line 1009, in <module>
    test_size=TEST_SIZE
  File "tf_transform-2.py", line 904, in main
    transform_data(TRAIN_FILES, TEST_FILES, working_dir)
  File "tf_transform-2.py", line 332, in transform_data
    raw_dataset | tft_beam.AnalyzeAndTransformDataset(preprocessing_fn))
  File "/home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/apache_beam/transforms/ptransform.py", line 562, in __ror__
    result = p.apply(self, pvalueish, label)
  File "/home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/apache_beam/pipeline.py", line 655, in apply
    pvalueish_result = self.runner.apply(transform, pvalueish, self._options)
  File "/home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/apache_beam/runners/runner.py", line 198, in apply
    return m(transform, input, options)
  File "/home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/apache_beam/runners/runner.py", line 228, in apply_PTransform
    return transform.expand(input)
  File "/home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow_transform/beam/impl.py", line 1029, in expand
    dataset | 'AnalyzeDataset' >> AnalyzeDataset(self._preprocessing_fn))
  File "/home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/apache_beam/transforms/ptransform.py", line 998, in __ror__
    return self.transform.__ror__(pvalueish, self.label)
  File "/home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/apache_beam/transforms/ptransform.py", line 562, in __ror__
    result = p.apply(self, pvalueish, label)
  File "/home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/apache_beam/pipeline.py", line 612, in apply
    return self.apply(transform, pvalueish)
  File "/home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/apache_beam/pipeline.py", line 655, in apply
    pvalueish_result = self.runner.apply(transform, pvalueish, self._options)
  File "/home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/apache_beam/runners/runner.py", line 198, in apply
    return m(transform, input, options)
  File "/home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/apache_beam/runners/runner.py", line 228, in apply_PTransform
    return transform.expand(input)
  File "/home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow_transform/beam/impl.py", line 976, in expand
    None, input_metadata))
  File "/home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow_transform/beam/impl.py", line 835, in expand
    raise ValueError('The preprocessing function returned an empty dict')
ValueError: The preprocessing function returned an empty dict

@agonojo agonojo reopened this Jul 14, 2020
@zoyahav
Copy link
Member

zoyahav commented Jul 15, 2020

Does your preprocessing_fn return an empty dict? (perhaps look if there's an issue with indentation or something like that)
In the snippet above it doesn't, so this error shouldn't occur.

@haifengkao
Copy link

pre_transform_statistics_path folder is empty. How should I get the mean and var?

@zoyahav
Copy link
Member

zoyahav commented Sep 23, 2020

The mean and var are not in the pre_transform_statistics_path.
The mean and var can be obtained by calling tft.mean(), tft.var() in the preprocessing_fn.
Outside of the preprocessing_fn, for example in the trainer code, those can be obtained by returning a mean/var feature from your preprocessing_fn.

@haifengkao
Copy link

So I have two options:

  1. store the same mean and same var in every row of my data
  2. run the preprocessing_fn twice, the first pass for data standardization, and the second pass to compute mean and var (which are already computed in the first one, just no way to get the values). I think it defeats the purpose of preprocessing_fn. preprocessing_fn should compute the whole stuff in a single pass.

@zoyahav
Copy link
Member

zoyahav commented Sep 24, 2020

Could you please explain the second option?
I was suggesting the first option above.

@haifengkao
Copy link

haifengkao commented Sep 25, 2020

Sure. From the doc, it says AnalyzeAndTransformDataset can "may be more efficient since it avoids multiple passes over the data".
No, it won't be more efficient because

Option 1. it wastes spaces to store duplicated data if we want to preprocess the data in a single pass
Option 2. it wastes time because we need 2 passes over the data in order to get the statistics

We could be more efficient because
the mean and var are stored in the computation graph. We just don't have any API to get them.

https://www.tensorflow.org/tfx/transform/api_docs/python/tft_beam/AnalyzeAndTransformDataset
截圖 2020-09-25 上午8 58 34

@zoyahav
Copy link
Member

zoyahav commented Sep 25, 2020

I'm still not understanding option 2, "run the preprocessing_fn twice" I'm assuming means a completely different pipeline with a different preprocessing_fn. This is not a good idea because tracking compatibility between the pipelines is not trivial and defeats the purpose of hermetic preprocessing used for training and serving in order to avoid training/inference skew.

Option 1 wastes some space yes, but you also don't need to call TransformDataset() (or AnalyzeAndTransformDataset), you could just produce the TFT output in the form of a SavedModel and apply these transformations for training and serving, having access to additional features.

@haifengkao
Copy link

Ok, now I understand your point.
The problem is I don't know how to apply TFT output in the form of a SavedModel to TensorflowLite.
Do you have any tutorial or doc on this topic?

@dimileeh
Copy link

dimileeh commented Sep 6, 2021

I can't figure it out either. TFT has got such nice functions as scale_to_z_score and scale_to_0_1, and I've been using them with AnalyzeDataset and TransformDataset in Beam, including for the label in the regression problem.

But after I run inference and get predictions, also through Beam, how do I apply the inverse transform for the labels to get the actual predictions?

I found a way in this thread: by saving label_mean and label_var (or label_min and label_max) into every TFExample, and then running the inverse transformation myself. But that's way too inelegant :(

Did anyone find a solution for such a problem?

@zoyahav
Copy link
Member

zoyahav commented Sep 7, 2021

We can keep this in mind as a feature request, but yes, in the meantime it has to be done manually.
Regarding documentation for applying the TFT output, I don't believe we have any specific to TFLite, there's just the generic one here.

@UsharaniPagadala
Copy link

@agonojo

Could you please confirm if this issue can be closed.Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants