Great Expectations Case Study:
How Vimeo uses GX to ensure data freshness and overcome their data quality issues
GX monitors data pipelines and validates pre-ingestion and post-ingestion
About Vimeo
Vimeo is one of the world's leading service providers of HD streaming video, allowing over 200M professionals, teams, and organizations to create, collaborate and communicate. It offers HD video hosting for video creators, as well as professional help for marketers for creating and sharing promotional videos through Vimeo Create. The BI Engineering team at Vimeo works along with other internal teams like CRM, finance, sales, and marketing analytics with the goal of integrating streaming data from different advertising platforms to enable more accurate analytics. In doing so, the team deals with many of the typical issues data teams encounter when working with a large amount of data from a variety of sources. Here is how Vimeo uses Great Expectations to overcome their data quality challenges. Also, if you’d like to learn more about the innovative work being done by the Engineering team at Vimeo, we recommend checking out the Vimeo Engineering Blog.
The Challenge
One of the major challenges the BI engineering team faced had to do with detecting problems caused by service interruptions or upstream schema changes. In a typical scenario, those types of events would result in data not being properly processed by the data pipeline. This in turn would cause the downstream tables to contain missing data or data that was out-of-date. Without proper monitoring, this issue would potentially remain undetected, sometimes for days, until it resulted in a broken report that was noticed by stakeholders. Additional challenges were faced during data pipeline migrations, which involved migrating hundreds of DAGs while ensuring the data was kept up-to-date and consistent. The BI engineering team built an in-house validation framework to address some of these needs, but there was still a need for a more powerful tool, one that could be used to validate the data pre-ingestion, and post-ingestion, without having to depend on the data being loaded into a database before validation could be done.
How the Vimeo team Uses Great Expectations
Monitoring data pipelines that feed into the Data Warehouse
The data pipelines managed by the BI engineering team consume data from advertising platforms like Google and Facebook through Kafka. The data is extracted and transformed in Google Cloud using Airflow DAGs and Python, and loaded into a Snowflake Data Warehouse in 15 minute increments. Great Expectations is deployed as a part of the Airflow DAGs on Google Cloud. There, Great Expectations is used to monitor the data pipelines that feed into the Data Warehouse. If some problem prevents the DAG from consuming new data for longer than 24 hours, Great Expectations is configured to notify the team either by email or Slack. As a result, the team is able to promptly reach out to the external team regarding the issue, and also alert stakeholders of any delays.
Ensuring data freshness during pipeline outages
Recently, Great Expectations helped the team during a large database modification, which resulted in a number of upstream pipelines experiencing outages. Great Expectations alerted the team through Slack stating that the threshold for percent-not-null values was not being met for specific tables, which meant they now had missing data. This allowed the team to investigate, uncover the underlying problem, and perform a fix before any dependent teams opened a ticket or alerted them of a data quality issue. One of the members of the BI engineering team was happy to report back about their use of Great Expectations: “I think that we can say with confidence that we have gotten great value out of Great Expectations. [..] We are continuing to integrate Great Expectations into all of our new pipelines and will keep note of more incidents that we catch. Data freshness is a huge use case that Expectations has helped with.”
“I think that we can say with confidence that we have gotten great value out of Great Expectations. [..] We are continuing to integrate Great Expectations into all of our new pipelines and will keep note of more incidents that we catch. Data freshness is a huge use case that Expectations has helped with.”
We would like to thank the BI engineering team at Vimeo (and specifically Evan Calzolaio) for their support in creating this case study!
200M
Professionals, teams, and organizations use Vimeo
Automatic alerting
Slack and email notifications
Quick problem solving
GX helped them perform a fix before dependent teams opened a ticket
Components
- Kafka
- Google Cloud
- Airflow DAGs
- Python
- Snowflake