Create a diagram of the validation process using Great Expectations #175

james-westwood · 2025-01-15T17:49:34Z

The task is to produce a diagram, using Microsoft Whiteboard, that goes into the documentation.

This will help other developers and users understand the process.

Create a process box for each step, show any (example) data inputs, and any outputs. Also show a decision box for any Conditional steps. Inside the process box, write a very brief summary of what's happening in that process.

Use this whiteboard as an example and use the same key.

Process

Schema Creation and Schema Validation Process

1. Schema Creation (manual)

This is a manual process to create a toml file which details the data source, all of its column names, data types, and constraints on that data

2. Schema validation

This process is carried out by toml_schema_validator.py and validates the schema itself against various rules.

Data Validation Process

3. Data Source Connection

Establish a connection to your data source. This could be a Pandas DataFrame, a Spark DataFrame, a CSV file, a database, or other supported data sources. Great Expectations provides various methods for connecting to different data sources.

4. Expectation Suite Creation/Loading

Either create a new Expectation Suite or load an existing one. An Expectation Suite is a JSON file containing expectations about your data. If creating a new suite, you might use create_expectation_suite_from_toml from the GreatExpectationDataFrameValidator class

5. Data Asset Definition

Define a Data Asset, which represents the data you want to validate. This typically involves specifying the data source connection and any relevant parameters (e.g., table name, file path).

6. Validation

The core step. Great Expectations takes the Data Asset and runs it against the expectations defined in the Expectation Suite. This produces a Validation Result object.

7. Validation Result

The Validation Result contains detailed information about the validation process. It includes whether each expectation passed or failed, along with statistics and other diagnostic information. The results are typically available as a JSON object.

8. Result Interpretation/Actions

Programmatically access and interpret the Validation Results. Common actions include:
- Logging: Record the validation outcomes.
- Reporting: Generate reports summarizing data quality.
- Alerting: Trigger notifications (e.g., email, Slack) based on validation failures.
- Conditional Workflows: Use validation results to determine subsequent steps in a data pipeline (e.g., proceed with processing if data quality is acceptable, otherwise halt or reroute the pipeline).

9. Checkpoint (optional but recommended)

Checkpoints automate the validation process by storing configuration details (Data Asset location, Expectation Suite) and optionally persisting validation results. They streamline repeated validations.

The text was updated successfully, but these errors were encountered:

james-westwood mentioned this issue Jan 15, 2025

Create Issues to match all validation steps #176

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create a diagram of the validation process using Great Expectations #175

Create a diagram of the validation process using Great Expectations #175

james-westwood commented Jan 15, 2025 •

edited

Loading

Create a diagram of the validation process using Great Expectations #175

Create a diagram of the validation process using Great Expectations #175

Comments

james-westwood commented Jan 15, 2025 • edited Loading

Process

Schema Creation and Schema Validation Process

1. Schema Creation (manual)

2. Schema validation

Data Validation Process

3. Data Source Connection

4. Expectation Suite Creation/Loading

5. Data Asset Definition

6. Validation

7. Validation Result

8. Result Interpretation/Actions

9. Checkpoint (optional but recommended)

james-westwood commented Jan 15, 2025 •

edited

Loading