backgroundImage

Why we don't do transformations for Expectations (and when we do)

In the V3 API, we are deprecating the arguments parse_strings_as_datetimes and allow_cross_type_comparisons.

Nathan Farmer
December 01, 2021
distribution-transformer

TL;DR: In the V3 API, we are deprecating the arguments

parse_strings_as_datetimes
and
allow_cross_type_comparisons
. We want to encourage transformations like these to be positioned in the ETL processes of your data pipeline, not the data validation processes.

Recently, there has been quite a bit of conversation in the community and internally about parameters such as

parse_strings_as_datetimes
and
allow_cross_type_comparisons
. This conversation has led us to try to appreciate why users are advocating for their inclusion in the V3 API, and to rethink what exactly these parameters are meant to accomplish in the battle against pipeline debt.

A Tale of Two Transformation Types

It is helpful to differentiate between two types of transformations: Metric Transformations and Input Transformations. Consider a typical data pipeline scenario, as seen in the figure below.

flow chart

In this example, various Data Assets have been produced as the product of extract, transform, load (ETL) processes. Your organization wants to know about the quality of each of the Data Assets.

Great Expectations uses Metrics to evaluate an Expectation and produce Validation Results. Metrics can be almost anything that is important for understanding data quality: they could be things like the mean value of numeric columns, or the number of values in a column that do not match a particular regular expression. To build those Metrics, we sometimes need to transform raw data, for example by computing a histogram. A “Metric Transformation” is a type of calculation that is core to the way we communicate our mental models about the shape of the data. It is part of the Metric itself.

flow chart data asset to expectation validation

Examples of Metric Transformations are:

  • Rolling averages
  • Binning for histograms
  • Minimum, maximum, and standard deviation
  • etc.

“But if transformations are core to Metrics and Expectations, why aren’t all types of transformations acceptable?”

There is a class of transformation that we will refer to as “Input Transformations,” which can be an insidious force acting against this objective. An Input Transformation does not seek to validate a Data Asset as-is, but instead bends the Data Asset into a new shape it never inhabited, prior to the validation stage. Applying this force to the Data Asset produces an Ephemeral Data Asset that lives in memory only until validation is complete. Along these same lines, when an Input Transformation is applied to a Data Asset, the quality of that asset is no longer being measured.

flow chart data asset to expectation validation

Let’s take the example of “stripping timestamps from datetimes”, and ask ourselves “why wouldn’t this be considered a Metric Transformation?” Another angle from which to view the distinction, is to ask yourself if the transformation changes the intent of the Expectation itself. In the case of Expectation expect_column_pair_values_A_to_be_greater_than_B, stripping the timestamp from either or both columns, results in an entirely different comparison than would be made if the grammar of the Expectation is to be taken literally.

  30 Sep 2021 12:30pm UTC is greater than 30 Sep 2021 12:00pm UTC

However

  30 Sep 2021 12:30pm UTC is greater than 30 Sep 2021 12:00pm UTC

So, instead of an Input Transformation stripping timestamps from datetimes, imagine deciding to treat this as a Metric Transformation. We could do that by creating the following new Expectation:

expect_column_pair_values_truncated_date_A_greater_than_truncated_date_B
.

The name of the new Expectation is much more explicit about what exactly it is you expect about the dataset. This verbosity enables teams to interpret each other’s Expectations in a straightforward way and more easily allows for integrations with additional post-validation actions to be taken depending on which Expectation was validated. The name is clear that it’s relying on a new metric “truncated date”, rather than on transforming data before applying an existing metric.

From here, you can start to see how using Input Transformations is not that different from performing an additional step in the ETL process. The Ephemeral Data Asset you are actually validating exists only as an input to an Expectation, the Data Asset that feeds your downstream applications is not truly even being tested, and the language used in the name of the Expectation does not accurately describe what it is you expect about the data.

Well, how did we get here?

Expectation Suit

Expectation Suit


The fact that parameters like

parse_strings_as_datetimes
and
allow_cross_type_comparisons
exist, shows that the core Great Expectations team has not always agreed on how to approach this issue. We know that users often lack the permissions to change the tables that they are validating due to dependencies on those tables in other areas of the organization. This means for example, that if a table doesn’t conform to the data types you need for validation, an ETL process must be undertaken first. Unfortunately, performing ETL on a table that is the single source of truth (SSOT) may defeat the purpose of testing, because applications use the SSOT as an input, not the transformed data you created.

In fact, our example shows that it’s usually possible to translate an Input Transformation into a Metric Transformation, or vice versa. Instead of viewing that sematic equivalence as a weakness, we’ve come to view it as a strength: explicit is the name of the game in data quality. And there’s a simple rule of thumb: if you’re about to add a new parameter to an Expectation, pause. Let the name of the Expectation carry the real meaning, and avoid parameters that transform data.

How does malformed data end up in production tables?

Organizational silos are not always aligned on who bears the cost of addressing pipeline debt that affects more than one team. This misalignment creates perverse incentives in which changes to legacy processes can cause data quality issues to surface. It is helpful to understand how something like this might happen with a real world example. It’s based on a real case, but we’ve removed the names.

Back in 2009, The Company had various vendors who made deliveries for them. These vendors were required to provide a data element for each delivery known as

delivery_number
. Since The Company had many different vendors who all had their own way to identify a delivery, it was determined that in the production table
delivery_number
would be of data type VARCHAR. This would allow for letters and special characters in the field provided by the vendors.

Fast forward to 2012 and The Company has decided to standardize

delivery_number
s across all vendors. This is primarily to increase trust in the data by ensuring the validity of
delivery_number
as it will now be generated by The Company, not external sources. The new
delivery_number
will be an integer that increases over time. This should typically be a new field added to the master table, but unfortunately 3 years later dozens of downstream applications rely on the original
delivery_number
field. In order to avoid making many costly changes to downstream applications, the leadership team decides to cutover to the “new style”
delivery_number
on January 21st 2012 (the Mayan Apocolypse).

mayan calendar

pipeline debt eliminated!

Now we have a field where we expect the values to be increasing integers, but the data type is still

VARCHAR
. Furthermore, all new employees without tribal knowledge of the “old style” delivery_numbers will instinctively create new production tables that have a delivery_number with data type
INTEGER
.

  • Compare a VARCHAR with an INTEGER
  • Check that an increasing integer kept in a field with data type VARCHAR is in fact increasing
  • Check that a datetime kept in a field with data type STRING is greater than some reference datetime
  • etc.

What to do when our mental model doesn’t align to an Expectation

Instead of treating malformed data like something to be “handled” (read: worked around) during the validation process, our aim should be to identify that data and take action to address it. What actions can be taken?

Add or Propose an Addition to the Grammar of Expectations

In the previous example, the

VARCHAR
field
delivery_number
could be validated with an Expectation called expect_column_values_to_be_string_integers_increasing. We acknowledge that using a new Expectation is functionally identical to using an existing Expectation with a parameter. The reasoning behind leaning into the former strategy is to embed this functional logic into the “grammar” of Expectations. Having Expectations with precise naming allows us to provide clearer translations for Expectations between systems and presents a more transparent code path for what your data quality assertions really are. Given this new path forward, we are happy to introduce a new Expectation expect_column_values_to_be_string_integers_increasing in our library today.

Why not produce different Expectation Suites for different time ranges of Batches?

We like this option! It is perfectly acceptable to have Expectations about the data that change over time. Your Expectations may be valid or not, depending on which batch they are applied to. Identifying a point in time when Expectations should change may be a key insight that your team needs to consume the data downstream.

If possible, this option is actually preferable to using the new Expectation.


Produce a Failed Validation

One phrase that is regularly heard across diverse organizations is “if you see something, say something.” Encountering data that seems to go against your mental model is an opportunity to course correct. If you aren’t empowered to change a production table, producing Data Docs with failed validations may just get the attention of someone with the resources to implement the change. As we discussed in the conversation about Input Transformations, you can avoid producing a failed validation by producing a new dataset and validating on that. This should only be done if the new dataset will be the SSOT for other downstream systems, otherwise you are validating data that is never used.

Don’t forget that your Expectations might actually be incorrect

In the vast majority of cases, the person who is validating the data does not have all of the context regarding why the data looks like it does. Expectations are meant to be just that - what we expect. We often encounter facts in life that are regarded as unintuitive, yet objectively correct. Those facts reshape how we expect the world to behave moving forward.


So long, Input Transformations

It is tempting to say that options like

allow_cross_type_comparisons
and
parse_strings_as_datetimes
put a band-aid on a data quality wound, but that analogy does not hold up. Using Input Transformations like these are more akin to dressing the wound with make-up and hoping infection doesn’t creep in. Having data with the wrong data type is a data quality issue that needs to be measured. Having Great Expectations transform data for an Expectation is, in essence, creating a new Ephemeral Data Asset which was never stored in the Datasource. We need to measure the quality of our data at every stage in the data pipeline, and that means that Expectation input data should be raw, otherwise their raw form will tend to go unmeasured.

Search our blog for the latest on data quality.


©2024 Great Expectations. All Rights Reserved.