Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a diagram of the validation process using Great Expectations #175

Open
james-westwood opened this issue Jan 15, 2025 · 0 comments
Open

Comments

@james-westwood
Copy link
Collaborator

james-westwood commented Jan 15, 2025

The task is to produce a diagram, using Microsoft Whiteboard, that goes into the documentation.

This will help other developers and users understand the process.

Create a process box for each step, show any (example) data inputs, and any outputs. Also show a decision box for any Conditional steps. Inside the process box, write a very brief summary of what's happening in that process.

Use this whiteboard as an example and use the same key.

Process

Schema Creation and Schema Validation Process

1. Schema Creation (manual)

This is a manual process to create a toml file which details the data source, all of its column names, data types, and constraints on that data

2. Schema validation

This process is carried out by toml_schema_validator.py and validates the schema itself against various rules.

Data Validation Process

3. Data Source Connection

Establish a connection to your data source. This could be a Pandas DataFrame, a Spark DataFrame, a CSV file, a database, or other supported data sources. Great Expectations provides various methods for connecting to different data sources.

4. Expectation Suite Creation/Loading

Either create a new Expectation Suite or load an existing one. An Expectation Suite is a JSON file containing expectations about your data. If creating a new suite, you might use create_expectation_suite_from_toml from the GreatExpectationDataFrameValidator class

5. Data Asset Definition

Define a Data Asset, which represents the data you want to validate. This typically involves specifying the data source connection and any relevant parameters (e.g., table name, file path).

6. Validation

The core step. Great Expectations takes the Data Asset and runs it against the expectations defined in the Expectation Suite. This produces a Validation Result object.

7. Validation Result

The Validation Result contains detailed information about the validation process. It includes whether each expectation passed or failed, along with statistics and other diagnostic information. The results are typically available as a JSON object.

8. Result Interpretation/Actions

Programmatically access and interpret the Validation Results. Common actions include:
- Logging: Record the validation outcomes.
- Reporting: Generate reports summarizing data quality.
- Alerting: Trigger notifications (e.g., email, Slack) based on validation failures.
- Conditional Workflows: Use validation results to determine subsequent steps in a data pipeline (e.g., proceed with processing if data quality is acceptable, otherwise halt or reroute the pipeline).

9. Checkpoint (optional but recommended)

Checkpoints automate the validation process by storing configuration details (Data Asset location, Expectation Suite) and optionally persisting validation results. They streamline repeated validations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant