Great Expectations Case Study:
How Calm uses GX to create data quality alerts and avert critical data issues in Airflow DAGs
Semi-automatic creation of Expectation Suites, Airflow operator for validation
About Calm
Calm is the #1 app for sleep, meditation, and relaxation. Over 75 Million people have downloaded the app, which is growing by 100,000 new downloads a day. Data is absolutely vital to everything that Calm does, whether that’s optimizing products through A/B testing or modeling content recommendations to create the best experience for users. Data at Calm also supports user acquisition and advertising, and has been crucial for Calm’s business team to deliver Calm to more businesses.
The Calm data engineering team is building and supporting a state-of-the-art data system using SQS, Firehose, Spark, Redshift, and Airflow, that is run in the cloud on AWS, deployed via Docker and Kubernetes, and has a codebase written with Go, Python, and SQL. The team manages pipelines that deal with data coming from the app, third parties, and other teams at the company, with its primary customer being Calm’s data science team.
The Challenge
Prior to implementing Great Expectations, data validation was mostly done in an ad-hoc manner, and only covered certain aspects such as checking for table row counts, but wasn’t comprehensive enough to catch some severe issues. One of the key challenges that Calm’s data engineering team faced was the late detection of data quality issues, where stakeholders were detecting issues before the data engineering team was aware of it - which meant that data problems would become a fire immediately.
How Calm Uses Great Expectations
Semi-automated creation of Expectation Suites
Calm was one of the early adopters of Great Expectations, which they tightly integrated into their data pipelines to semi-automatically create Expectations and validate the data. They build Expectation Suites using a custom script that checks the DDL whenever the team adds a new table to their pipelines and, based on the data type in each column, creates a JSON of all “default Expectations” for a given data type. An engineer or data scientist then reviews these Expectation Suites and manually adds any missing, content-specific Expectations, and checks the resulting JSON file into version control.
Validation in Airflow pipelines
On the validation side, the team uses a Great Expectations Airflow operator they built that validates a staging version of each table against those Expectation Suites and responds depending on the result of the validation and the severity level of the tests: The Airflow DAG either continues to run if the validation passes, it continues but sends a warning to Slack for “warning” level Expectation Suites, and fails the Airflow task and notifies the team for “critical” tests. The data science and data engineering team members then review those notifications and decide how to rectify the issue.
The data engineering team at Calm reports that the biggest benefit of Great Expectations is simply knowing about data issues before stakeholders do, which in turn provides a better experience for their stakeholders.
“Our stakeholders know that we’ll be ahead of data quality issues, and can be assured their decisions are based on accurate data, because we can add an Expectation for it and they won’t have to deal with it again!”
We would like to thank the data engineering team at Calm (and specifically Kamla Kasichainula) for their support in creating this case study!
75 million
Total app downloads
100,000
Downloads per day
#1
App for sleep, meditation, and relaxation
Components
- SQS
- Firehose
- Spark
- Redshift
- Airflow
- AWS
- Docker
- Kubernetes
- Go
- Python
- SQL