You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The task is to produce a diagram, using Microsoft Whiteboard, that goes into the documentation.
This will help other developers and users understand the process.
Create a process box for each step, show any (example) data inputs, and any outputs. Also show a decision box for any Conditional steps. Inside the process box, write a very brief summary of what's happening in that process.
This is a manual process to create a toml file which details the data source, all of its column names, data types, and constraints on that data
2. Schema validation
This process is carried out by toml_schema_validator.py and validates the schema itself against various rules.
Data Validation Process
3. Data Source Connection
Establish a connection to your data source. This could be a Pandas DataFrame, a Spark DataFrame, a CSV file, a database, or other supported data sources. Great Expectations provides various methods for connecting to different data sources.
4. Expectation Suite Creation/Loading
Either create a new Expectation Suite or load an existing one. An Expectation Suite is a JSON file containing expectations about your data. If creating a new suite, you might use create_expectation_suite_from_toml from the GreatExpectationDataFrameValidator class
5. Data Asset Definition
Define a Data Asset, which represents the data you want to validate. This typically involves specifying the data source connection and any relevant parameters (e.g., table name, file path).
6. Validation
The core step. Great Expectations takes the Data Asset and runs it against the expectations defined in the Expectation Suite. This produces a Validation Result object.
7. Validation Result
The Validation Result contains detailed information about the validation process. It includes whether each expectation passed or failed, along with statistics and other diagnostic information. The results are typically available as a JSON object.
8. Result Interpretation/Actions
Programmatically access and interpret the Validation Results. Common actions include:
- Logging: Record the validation outcomes.
- Reporting: Generate reports summarizing data quality.
- Alerting: Trigger notifications (e.g., email, Slack) based on validation failures.
- Conditional Workflows: Use validation results to determine subsequent steps in a data pipeline (e.g., proceed with processing if data quality is acceptable, otherwise halt or reroute the pipeline).
9. Checkpoint (optional but recommended)
Checkpoints automate the validation process by storing configuration details (Data Asset location, Expectation Suite) and optionally persisting validation results. They streamline repeated validations.
The text was updated successfully, but these errors were encountered:
The task is to produce a diagram, using Microsoft Whiteboard, that goes into the documentation.
This will help other developers and users understand the process.
Create a process box for each step, show any (example) data inputs, and any outputs. Also show a decision box for any Conditional steps. Inside the process box, write a very brief summary of what's happening in that process.
Use this whiteboard as an example and use the same key.
Process
Schema Creation and Schema Validation Process
1. Schema Creation (manual)
This is a manual process to create a
toml
file which details the data source, all of its column names, data types, and constraints on that data2. Schema validation
This process is carried out by
toml_schema_validator.py
and validates the schema itself against various rules.Data Validation Process
3. Data Source Connection
Establish a connection to your data source. This could be a Pandas DataFrame, a Spark DataFrame, a CSV file, a database, or other supported data sources. Great Expectations provides various methods for connecting to different data sources.
4. Expectation Suite Creation/Loading
Either create a new Expectation Suite or load an existing one. An Expectation Suite is a JSON file containing expectations about your data. If creating a new suite, you might use
create_expectation_suite_from_toml
from theGreatExpectationDataFrameValidator
class5. Data Asset Definition
Define a Data Asset, which represents the data you want to validate. This typically involves specifying the data source connection and any relevant parameters (e.g., table name, file path).
6. Validation
The core step. Great Expectations takes the Data Asset and runs it against the expectations defined in the Expectation Suite. This produces a Validation Result object.
7. Validation Result
The Validation Result contains detailed information about the validation process. It includes whether each expectation passed or failed, along with statistics and other diagnostic information. The results are typically available as a JSON object.
8. Result Interpretation/Actions
Programmatically access and interpret the Validation Results. Common actions include:
- Logging: Record the validation outcomes.
- Reporting: Generate reports summarizing data quality.
- Alerting: Trigger notifications (e.g., email, Slack) based on validation failures.
- Conditional Workflows: Use validation results to determine subsequent steps in a data pipeline (e.g., proceed with processing if data quality is acceptable, otherwise halt or reroute the pipeline).
9. Checkpoint (optional but recommended)
Checkpoints automate the validation process by storing configuration details (Data Asset location, Expectation Suite) and optionally persisting validation results. They streamline repeated validations.
The text was updated successfully, but these errors were encountered: