Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter/Data Quality Component #6909

Open
pritamdodeja opened this issue Aug 29, 2024 · 0 comments
Open

Filter/Data Quality Component #6909

pritamdodeja opened this issue Aug 29, 2024 · 0 comments

Comments

@pritamdodeja
Copy link

If the feature is related to a specific library below, please raise an issue in
the respective repo directly:

TensorFlow Data Validation Repo

TensorFlow Model Analysis Repo

TensorFlow Transform Repo

TensorFlow Serving Repo

System information

  • TFX Version (you are using): 1.15.1

  • Environment in which you plan to use the feature (e.g., Local
    (Linux/MacOS/Windows), Interactive Notebook, Google Cloud, etc..):
    Local, GCP

  • Are you willing to contribute it (Yes/No): Yes

Describe the feature and the current behavior/state.
Whenever we have a tabular dataset, we compute statistics, and then a schema, and then we can validate examples. We can fine-tune the schema in order to be integrate domain knowledge into the validation process. Why cannot we filter "bad" data in the pipeline itself?

Will this change the current API? How?
It would introduce a new component that would allow filtering based on schema. The inputs into Transform and some of the other components that use examples could potentially change depending on user's preference.

Who will benefit with this feature?
Users who want to do the entire data processing under the management of ML Metadata.

Do you have a workaround or are completely blocked by this? :
A variety of workarounds exist, manually clean up the original data, write scripts in beam to filter out bad data, put it in a database and use SQL to do this.

Name of your Organization (Optional)
GCP Partner (Intuitive.cloud)

Any Other info.
Number of training steps etc., would be affected by filtering data. However, that is also something that, in my opinion, should be more dynamic than it currently is in tfx.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant