-
Notifications
You must be signed in to change notification settings - Fork 215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scale_to_z_score_per_key should give caller control over OOV behavior #252
Comments
In order to expedite the trouble-shooting process, please provide a code snippet to reproduce the issue reported here. Thanks |
Sure: """Simple Example of tf.Transform usage."""
import pprint
import tempfile
import tensorflow as tf
import tensorflow_transform as tft
import tensorflow_transform.beam as tft_beam
from tensorflow_transform.tf_metadata import dataset_metadata
from tensorflow_transform.tf_metadata import schema_utils
def main():
def preprocessing_fn(inputs):
"""Preprocess input columns into transformed columns."""
x = inputs['x']
s = inputs['s']
return {
'scale_per_key': tft.scale_to_z_score_per_key(x=x, key=s)
}
training_data = [
{'x': 1, 's': 'hello'},
{'x': 2, 's': 'world'},
{'x': 3, 's': 'hello'},
{'x': 4, 's': 'world'},
]
test_data = [
{'x': 2, 's': 'hello'},
{'x': 5, 's': 'world'},
{'x': 1000000, 's': 'foo'},
]
raw_data_metadata = dataset_metadata.DatasetMetadata(
schema_utils.schema_from_feature_spec({
's': tf.io.FixedLenFeature([], tf.string),
'x': tf.io.FixedLenFeature([], tf.float32),
}))
with tft_beam.Context(temp_dir=tempfile.mkdtemp()):
transformed_dataset, transform_fn = ( # pylint: disable=unused-variable
(training_data, raw_data_metadata) | tft_beam.AnalyzeAndTransformDataset(
preprocessing_fn))
transformed_test_dataset = (
((test_data, raw_data_metadata), transform_fn)
| tft.beam.TransformDataset())
transformed_data, transformed_metadata = transformed_dataset # pylint: disable=unused-variable
transformed_test_data, transformed_metadata = transformed_test_dataset
pprint.pprint(transformed_data)
pprint.pprint(transformed_test_data)
if __name__ == '__main__':
main() Output:
Note that this is not a bug. This is working as designed and as documented. My point is that it may not be desirable to simply pass through |
Thanks for the feedback. I actually thought this behaviour changed already, but I see that this was only the case for scale_by_min_max / scale_by_0_1 per key, perhaps a short-term workaround is to switch to this type of scaling:
For mean/var we assume 0 for OOV keys, but perhaps we should scale those too as is suggested here. @iindyk what do you think? |
For my purposes, allowing the user to specify a default value to impute for OOV keys would be most preferable. However, I can understand if it is undesirable to make the function too complicated by passing along too many options. People with more specialized needs can always implement their own mapper that does what they want (which is what I ultimately had to do anyway). |
sigmoid made sense for scale_by_0_1 because the result is bounded in a specific interval and there's no such interval for scale_to_z_score (we could make one up though). Specifying a default value does seem sound but adds to the list of args and may have rare usage. Moreover, we'd want to be consistent between all *per_key mappers, so we'd need to add the arg to them all. Adding the arg in a backwards compatible manner will make it super confusing (e.g. if default value is set, then use it, if not - scale in the existing way) |
Related to #220
Currently, if
tft.scale_to_z_score_per_key
is used and at inference timekey
is OOV, the value is returned unscaled. Per the docs:But this may not be the desired behavior for OOV entries. In some use cases, I may want OOV keys to be mapped to 0, or to some large negative number. It seems fairly application-dependent. It would be good to give the caller control over this behavior.
The text was updated successfully, but these errors were encountered: