Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using tfdv to validate text based data #215

Open
Capsar opened this issue Jun 9, 2022 · 0 comments
Open

Using tfdv to validate text based data #215

Capsar opened this issue Jun 9, 2022 · 0 comments

Comments

@Capsar
Copy link

Capsar commented Jun 9, 2022

Hi,

After searching online whether tfdv could be used to validate data that contains text. For instance, for a dataset with sentences that have to be mapped to labels. I could not find any real useful tutorials, as the ones that I could find only go into numerical data regarding the dataset. For instance, height, weights, etc.

After looking around in the data-validation package I have found a couple of files that seem to be related to this.
https://github.com/tensorflow/data-validation/blob/master/tensorflow_data_validation/statistics/generators/natural_language_stats_generator.py
And
https://github.com/tensorflow/data-validation/blob/master/tensorflow_data_validation/statistics/generators/natural_language_domain_inferring_stats_generator.py

Furthermore on the Tensorflow website about the StatsOptions class I found the following:
https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/StatsOptions

Arguments Description
enable_semantic_domain_stats If True statistics for semantic domains are generated (e.g: image, text domains).
semantic_domain_stats_sample_rate An optional sampling rate for semantic domain statistics. If specified, semantic domain statistics is computed over a sample.
vocab_paths An optional dictionary mapping vocab names to paths. Used in the schema when specifying a NaturalLanguageDomain. The paths can either be to GZIP-compressed TF record files that have a tfrecord.gz suffix or to text files.

These arguments and files do indicate that tfdv can be used to analyze and validate data that would be used in NLP / Text classification type problems.

However, it is unclear to me how one would go about and use these features to validate text-based data?
I have enabled the enable_semantic_domain_stats argument and this does give information like sequence length etc.
However, how would one extend on this, and validate vocabularies for known/unknown word ratio's; etc.

Any tips or thoughts are highly appreciated!
Kind Regards,
Caspar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants