When Great Expectations tests my data, does it move it? Copy it? Change it?
We get these questions a lot, and the short answer is: no! Not at all.
In fact, GX actively avoids doing any moving, downloading, copying, or alteration on the data it tests.
In this post, we’ll answer some of the most common questions we get about this.
Performance: GX does not require downloading or moving your data
We understand why people wonder about this.
But GX only ever uses resources you already control. Whenever possible, it’s the in-place location where the data is stored.
Here are some specifics of how that manifests for different Execution Engines.
Spark Data Source behavior
GX works with Spark by using Spark-native functions and building Spark queries, which it executes against the data in your Spark instance.
GX uses Spark’s native methods for accessing data, whether from a cloud storage bucket or a metastore or catalog on Databricks. Compute always happens somewhere you already control, and the data is not removed from its raw storage location during compute.
With Spark, GX pushes compute back to your Spark cluster whenever possible. If it’s not possible, GX will perform the compute on the machine local to your GX installation.
Summary: most compute needed for a GX test takes place in your Spark cluster, with only some final metadata computation happening on the machine local to your GX installation.
Pandas Data Source behavior
GX works with Pandas by leveraging the Pandas tools that operate on your data locally.
GX uses Pandas’ native methods for accessing data, whether from a cloud storage bucket or a local filesystem.
Throughout compute, your original data stays wherever you decided to put it.
For Pandas, the compute location is always the machine local to your GX installation. This is the standard behavior of a Pandas DataFrame.
Summary: with Pandas, all data and compute involved in GX testing takes place on the machine local to your GX installation.
SQL Data Source behavior
GX works with SQL by using SQLAlchemy to build and execute queries against your data in your DB.
Compute is pushed back to your SQL DB whenever possible. If necessary, GX will temporarily persist some data to the machine local to your GX installation as part of calculating some customized metrics.
Those are the only two places where SQL Datasource data and compute will be. GX does not move the data from your SQL DB to a Pandas DataFrame. We aren’t sure where this idea is coming from, but it’s somewhat common, and it’s wrong.
Summary: with SQL, all data and compute involved in GX testing takes place in your SQL DB and (only if needed) on the machine local to your GX installation.
Metadata
To construct your Validation Results and Data Docs, an Expectation typically has to use the metadata it created for some kind of last-mile operation. For example: to calculate the percentage of unexpected values, the Expectation uses the number of unexpected and total values.
These last-mile operations happen on the machine local to your GX installation by necessity.
Calculations a specific backend doesn’t support
The core Expectations can all be carried out as described above. Some of the experimental Expectations rely on complex calculations that may not be natively supported by a given backend.
In this circumstance, the data being operated on might be temporarily brought to the machine local to your GX installation. GX then completes the calculation there, using Python-native objects, numpy arrays, or Pandas DataFrames, depending on the preference of the community member who contributed the Expectation. When the calculation is complete, the temporary data is deleted.
How does that work with GX Cloud?
GX Cloud is a managed SaaS product, but it is similar to GX Core (described above) in behavior.
GX Cloud manages your Expectations’ configuration and metadata about your environment including Data Source information. But GX Cloud does not host your data, and does not copy your data or Data Sources to a GX-controlled machine or environment. Your primary data (that is, the data you are testing) is never passed through or persisted to any GX system.
Similarly, GX Cloud pushes compute to your environment for both performance and security reasons. GX Cloud users run an agent or use a GX-hosted runner to orchestrate compute.
Because GX Cloud displays your Validation Results and Data Docs in its user interface, Cloud does maintain and store the metadata that generates those things. This includes table and column names, as well as other metadata that is generated from your data in the course of creating your Expectation Suites and validating your data.
Summary: GX Cloud hosts Expectation Suites, Data Docs, and metadata (some of which you can additionally choose to disallow in the GX-hosted environment). Your actual data and compute for GX Cloud takes place as described above for the relevant Execution Engine.
GX does not change your data
GX never modifies your data in situ.
If you’re using a SQL DB, GX will create temporary tables in it to enable certain Expectations and improve performance. Typically, these tables are released after Validation or within 24 hours of their creation. They aren’t required, so you can disable them.
This blog post has more information about why we don’t do data transformations as part of Expectations. Since its publication in 2021, we’ve actually taken this idea even further: we no longer allow runtime querying of your data as described in that post. You now have to define your assets with a query if needed.
Summary
Great Expectations doesn’t download or move your data, host your compute outside of your systems, or modify your data.
In fact, GX is purposefully engineered so it can test your data without removing it from your systems. Everything possible happens in the location where your data is actually being stored. Everything else takes place on the machine local to your GX installation.
No matter which deployment of GX you use, you can be confident that your data never leaves your control. With GX Core, no GX-hosted or -controlled system ever stores, copies, or computes your data. With GX Cloud, only metadata (some of which you can configure) is stored on the GX-hosted platform.
If you had any questions about how GX interacts with your data, we hope this answers them!
You can reach the GX team via our Discourse forum or community Slack.
This post is an update of an earlier version.