This is the second article in our series on integrating Great Expectations (GX) into the data pipeline. In the first article, we explored the role of data validation in pipelines, where validation steps can be integrated, and what aspects of the data should be validated.
In this post, we’ll introduce how GX can support you in executing data validation throughout your pipeline.
What are GX Core and GX Cloud?
GX Core and GX Cloud are the two major components of the GX platform.
The central concept of the GX platform is the Expectation: a verifiable assertion about data. Expectations are a clear and straightforward way for data teams to validate their data, similar to unit tests in software development.
GX Core is the world’s most popular open source data quality framework. It provides a declarative framework for defining Expectations using human-readable Python methods and enables data validation across sources such as Pandas, Spark, and SQL databases. GX Core also generates comprehensive data quality reports and documentation.
GX Cloud is a SaaS platform powered by GX Core. It enables quick and efficient implementation of a robust data quality framework, unifying your data testing in one place. By creating a shared, reliable source of data quality information, GX Cloud improves transparency across your organization. It also fosters collaboration across nontechnical and technical stakeholders by making Expectations available without code, in readable plain language via a central UI.
GX Core and GX Cloud are complementary and can be used individually or together to introduce data validation into your organization’s pipelines. GX Cloud creates a no-code, collaborative portal through which you can easily create data validation workflows. GX Core can extend the capabilities of GX Cloud to implement custom validation workflows, or it can be used to deploy your data validation workflows programmatically.
Think of the GX platform as a car. GX Core is the engine: a powerful machine, but it needs additional components to help you get anywhere. GX Cloud is the rest of the car, with everything you need to actually go places and harnessing the engine’s power to make it happen.
How the GX platform supports pipeline data validation
Integration with modern data pipeline stacks
Popular tools for building a data pipeline infrastructure include open source orchestration options like Airflow, Prefect, and Dagster, as well as cloud platforms like AWS Glue, Azure Data Factory and Google Cloud Dataflow.
These tools and most other modern options support customization using Python. Since GX Core is a Python library, it can be easily integrated into Python-enabled data pipeline tooling. With GX Core triggering data validation and programmatically handling the results, data quality test results can easily be sent downstream and support actions like stopping pipeline execution, identifying and fixing pad rows, quarantining or discarding bad rows from a dataset, and sending alerts.
Human-readable data tests
The GX platform offers a collection of built-in, preconfigured Expectations for testing across the critical data quality dimensions.
GX Core provides a declarative framework for defining Expectations using human-readable Python methods. This makes Expectations easy to create and interpret.
For example, here is an Expectation that verifies that the values in column `name` are not null:
1expectation = gx.expectations.ExpectColumnValuesToNotBeNull(column="name")
GX Cloud offers an intuitive no-code interface for creating and customizing Expectations:
Different business domains, organizations, and business units often have unique or custom requirements for their data quality testing.
In the GX platform, you can use row conditions to apply Expectations selectively within a dataset, and define completely custom Expectations using SQL. GX Core additionally allows data teams to define custom Expectation classes, while GX Cloud provides the ability to use dynamic parameters that set an Expectation’s parameters relative to previous results.
Data quality enforcement throughout the pipeline
Most data pipelines have hotspots that are particularly vulnerable to data degradation, including:
During data ingestion
Immediately before and after transformations
Immediately before the consumption layer
Using the GX platform to add data validation to the pipeline, data teams can prevent low-quality data from reaching and affecting downstream consumers.
Expectations can validate data across a variety of critical data dimensions:
Schema
Missingness
Set, numerical, and distribution
Cardinality and data volume
Scheduled data validations
While validating data in motion within the pipeline is valuable, it’s important to complement it with recurring validation of tables at rest in a database or data warehouse. Validating data at rest catches errors like row count differences and issues with numeric distribution or uniqueness that can be more difficult to detect in moving data.
Using GX Cloud, data teams can schedule data validation runs on a recurring basis from within its UI. This is a lightweight and easily-accessible way to implement automated data validation that takes a minimal amount of time to set up.
GX Core can also validate data on a schedule by integrating with third-party orchestration tools such as Airflow. This allows data teams total control over their recurring data validation.
Data validation results over time
While snapshot-in-time data validation results are useful for at-the-moment alerting, it’s also important to be able to see how your data’s quality has changed (or not) over time. The GX platform makes it easy to get both types of test results.
GX Core delivers data quality test results through Data Docs, which express Expectations, their results, and other GX metadata as human-readable language on static web pages hosted by your organization. Data Docs create a history of your organization’s data quality. They can be made widely available easily, by being hosted like any other webpage on the service of your choice.
GX Cloud evolves the functionality of Data Docs to be accessible and interactive through its UI. You can isolate the results of individual runs, or view validation results as a historical timeline.
Advantages of GX Cloud
Most data teams are pressed for time, short on people power, or both. For teams that need to accelerate the process of implementing, building, and maintaining a data quality solution, GX Cloud offers multiple compelling benefits that streamline this process:
Fast, simple setup that can be done entirely in the UI. Connect data, create Expectations, and begin running tests without typing a single line of code.
Enhanced collaborative capabilities make it possible for nontechnical teams to engage meaningfully with the data quality process, preventing communications breakdowns and fostering organizational trust in the process.
Direct and secure data connections using read-only and encrypted methods means that GX Cloud has a fast and seamless setup that’s SOC2-compliant. Alternate deployment patterns are also available for teams that need them.
Tests and results in plain language, paired with a clear and intuitive UI, makes it easy for data team members and stakeholders to build a shared understanding of data quality and collaborate effectively.
Conclusion
The GX platform offers you powerful capabilities for maintaining high data quality across your organization. Whether you prefer the flexibility and control of GX Core or want to leverage the streamlined setup and collaborative environment of GX Cloud—or both—you can use the GX platform to ensure that your data is validated, documented, and reliable at every stage of the pipeline.
In other entries of this series, we’ll dive deeper into the specifics of using GX for data validation, and explore practical, hands-on examples and best practices to help you get the most out of this powerful platform.
Thanks to Bruno Gonzalez for his contributions to this article.