Regular data quality testing is a great practice for any organization, but not all tests are created equal.
Choosing the right testing framework will help you create a data quality program that produces real, actionable insights. And it helps you build a data quality culture that includes and is accessible to everyone at your organization – not just data engineers.
Here are the three key factors – at Great Expectations, we call them the three E’s – to look for when evaluating data quality tests and their frameworks, and why GX’s Expectations embody them all.
Factor #1: the first E
The first factor to evaluate is the language and concepts that a framework allows you to use for tests. You should be able to tell your tests exactly what you need to do, no matter how precise.
If one column only needs to be at least 80% populated but another needs to be completely full, your test should be able to tell you whether those criteria are met.
The test should also be compatible with concepts and language you and other stakeholders already understand. You should not have to redefine simple things like ‘the column,’ and you should be able to use layperson-friendly concepts and phrasings like ‘there should be between 1 and 1.2 million rows.’
Critically, this test framework also needs to be composable – you should be able to combine simple, precise tests to evaluate complex concepts.
If a test does all this, it has the first E of great data quality testing: expressiveness.
Why expressiveness?
The expressiveness factor ensures that tests are rooted in domain knowledge.
For example, a dataset being used to train a machine learning model should receive training data that is representative of the population it will encounter in the wild. Good tests will allow you to check against population characteristics like age and gender distribution with specificity.
Factor #2: the second E
The next thing to look for is how easy it is to understand the results of a test, as well as what led to those results.
The inner workings of a testing framework and its tests should be accessible, both in the sense that stakeholders from outside the engineering team should be able to understand them and that you, as a data engineer, should literally be able to look inside and see their internal mechanisms.
A good litmus test is to show the test to a stakeholder who has never seen it before and see if they understand what the test is checking for and how it determines a pass or fail. If this person can immediately understand the test’s behavior, it’s a keeper.
If a test does all this, it has the second E of great data quality testing: explicitness.
Why explicitness?
Useful tests will make assertions about the way the data is generated, bringing in external knowledge instead of relying just on the data itself.
A common scenario in which explicitness is useful is checking against the possible values for content types to make sure you are observing all the kinds of data that you expect—and only those kinds.
For example, it’s a good idea to check specific actions that users take in a SaaS application to make sure that a data type hasn’t been added or removed without your team knowing it. This step is especially important when working with a new type of data or code for an event type because frequently more information needs to be conveyed alongside: why was it added, and does it change the meaning of the other events?
Factor #3: the third E
Finally, no data quality testing framework is of much value unless implementing it and its tests is smooth and easy.
Tests of the same kind should be defined and presented in the same way so you can share, compare, and combine them with minimal effort and no confusion. That lets you focus on what the tests mean instead of how to get them working in your environment.
The mechanisms for expressing and running them should be independent of, but interoperable with, the other parts of your data stack. And, once again, they should be efficient – if a test is defined relative to the data itself rather than the backend, you can use it to test data from multiple sources as long as the data is all of the same type.
If your tests do all this, they have the third E of great data quality testing: extensibility.
Why extensibility?
You should be able to take action based on the results of your tests.
For example, a test failure should be able to stop a pipeline run from loading bad data into production or send an alert to a team that can investigate.
You should also be able to let other data stakeholders add their own knowledge to the tests so the tests can encompass everything your organization knows about the dataset.
Both of these actions need your test framework to be comparable, efficient, and interoperable with your data stack and workflows.
What you get when you have all three E’s
Here are the benefits of using a data quality test framework and tests that embody the three E’s of expressiveness, explicitness, and extensibility:
Precision
You can be highly specific in the aspect or aspects of your data you choose to test, thereby minimizing the kind of ambiguity that can skew results.Flexibility
Data testing needs are rarely static. A good test is one that you can adjust as your needs change.Continuity
When your tests are accessible and easy to understand, they serve as a shared repository of knowledge that stakeholders from any team can count on to help them advance their goals.
Expectations embody all three E’s of data quality testing. They’re declarative statements that set clear, specific parameters around what aspect of your data you’re testing. They’re written in plain language, so you don’t have to be a data engineer to understand them. And implementing them couldn’t be easier with GX Cloud.
Try GX Cloud to take advantage of Expectations at your organization.