Great Expectations Case Study:
How Provectus uses GX to monitor data pipelines, validate accuracy, and produce observable test results
Combining GX with Allure and AWS services, GX tests for every data set
Great Expectations Case Study: Provectus
This case study describes how the data team at Provectus, an Artificial Intelligence consultancy and solutions provider, uses Great Expectations to monitor and fix data errors and reduce technical debt for data pipelines.
About Provectus
Provectus is an Artificial Intelligence (AI) consultancy and solutions provider, helping businesses achieve their objectives through AI. They focus on building Machine Learning (ML) Infrastructure that drives end-to-end AI transformations. They enable businesses to adopt the best AI use cases and scale their AI initiatives organization-wide across a wide variety of industries, including Healthcare, Life Sciences, Retail, CPG, Media, Entertainment, Manufacturing, etc.
The Challenge
As an AI consultancy, Provectus knows that clean, high-quality data is the lifeblood of any effective AI solution and any accurate ML model. They understand that data needs to be easily discoverable, manageable, observable, reliable, and secure. So, the data team at Provectus uses Great Expectations to tackle the many data quality assurance challenges they've faced.
While building AI solutions for its customers, Provectus collects data from different internal sources and aggregates it into internal data lakes. Management then uses this data to monitor key business indicators and make strategic and tactical decisions. One of the challenges of this process is that data coming from different data sources typically has multiple errors, ranging from duplicates and empty values to inputs in incorrect formats. Such errors affect end-users — be it engineering teams or business users (business intelligence, marketing, sales, etc.)
Data engineers can identify and resolve these issues, but because data degrades over time, the engineers must repeatedly return to fixes. This significantly increases the technical debt for data pipelines and taxes the team’s resources.
How the Provectus Team Uses Great Expectations
Improving the observability of test results
Provectus needs to closely monitor and observe both the data itself and the tests run on the data. To improve the observability of test results, they rely on reports from both Great Expectations and Allure. Allure, a well-established IT tool for testing and reporting, makes it easier to report test results from Great Expectations. It's a perfect solution for managers and non-technical professionals because it allows them to handle data comfortably. The reports can be accessed and managed with a self-written adapter from GE to Allure.
Monitoring of data pipelines
The efficiency of data pipelines depends heavily on their monitoring capabilities. To ensure that only clean, high-quality data passes through the pipeline, Provectus takes advantage of a combination of AWS services and Great Expectations.
The data is retrieved from available data sources using AWS Lambda and AWS Glue. At the same time, we run AWS Lambda with Pandas Profiling and GE Test Suite for every dataset to store or serve them as a static Amazon S3 website. The results from Great Expectations are converted into Allure results in AWS Lambda and stored in Amazon S3. The Allure results are used to generate Allure reports using AWS Lambda. AWS Lambda sends the reports to a Slack channel to keep the data team updated on any errors. The metadata is also pushed to Amazon DynamoDB (or Amazon S3 to reduce costs) to enable fast and efficient crawling of data by Amazon Athena. The data can be retrieved with Amazon Quicksight to make it easy for business users to understand it.
Comparing producer and consumer data
To validate data accuracy, the Provectus data team runs profiling for the producer data source and generates tests on the fly. These tests are then run against the consumer database. The tests should pass on both databases.
The Provectus team reports that the biggest benefit of using Great Expectations has been the flexibility of the tool. By combining Great Expectations with Allure and AWS services, the team has been able to ensure data quality through extensive data tests, monitor data pipelines, and give both engineers and business users the ability to understand data and the insights that are generated from it.
Learn more about Provectus here
We would like to thank the Provectus team (and specifically Andrew Khakhariev, Bogdan Volodarskiy, and Aleksei Chumagin) for their support in creating this case study!
Wide variety of industries
Healthcare, Life Sciences, Retail, CPG, Media, Entertainment, Manufacturing, and more
Better observability
Reports from both GX and Allure using a self-written adaptor
Improved data accuracy
Validation by profiling producer data and comparing it to consumer data
Components
- Allure
- AWS Lambda
- AWS Glue
- Pandas
- AWS S3
- AWS DynamoDB
- AWS Athena
- AWS Quicksight