Great Expectations Case Study:
How Heineken uses GX to provide instant data quality validation and feedback to upstream data providers
GX integrated into a data uploading tool to validate Pandas dataframes
About HEINEKEN and the Global Analytics Team
HEINEKEN is one of the leading brewing companies in the world. Led by the Heineken® brand, the group has a portfolio of more than 300 international, regional, local and specialty beers and ciders. It employs over 85,000 employees and operates breweries, malteries, cider plants and other production facilities in more than 70 countries.
Global Analytics is part of the HEINEKEN Digital & Technology function who support HEINEKEN’s operating companies by developing tools, creating machine learning models, setting guidelines, and functioning as a “Centre of Excellence” for data analytics. The team’s focus is to build scalable products to be implemented many times. This is preceded by proving the value of the use cases they develop through experimentation and feasibility studies.
The Challenge
During the experimentation phase, the Global Analytics team deals with a variety of data from different internal and external sources, including the company’s Enterprise Resource Planning system, sales data, marketing and media data, and even weather data. With HEINEKEN being a federated company, the team integrates data coming from the regional operating companies in a multitude of ways, such as SQL access to an on-prem database, flat files that are uploaded on a regular basis, or via an API for weather data. Once a new approach to analyzing the data has proven valuable through experimentation, they then develop production data pipelines to scale these efforts globally.
The Global Analytics team deals with many of the typical issues data teams encounter when integrating data from a variety of sources: faulty formatting, incorrect data types, duplication of rows, and other issues ranging from obvious flaws to hard-to-detect problems such as shifted distributions of values. In order to deal with these inconsistencies, various engineers on the team implemented validation rules, but this, in turn, led to a proliferation of approaches to writing the same validation in different ways.
“Everyone has different ways of implementing the same validation, that just isn’t very efficient!”
(Madelon Hulsebos, Data Scientist at HEINEKEN)
How HEINEKEN uses Great Expectations
The Global Analytics team decided to use Great Expectations to validate their incoming data to standardize how validation was done across different engineers and data sets - no more re-implementation of the same tests by different people! Even though they considered the learning curve of Great Expectations to be fairly steep, they decided to invest in deploying it in their pipelines based on the functionality it provides, as well as the fact that it is an open-source library.
The group working with the company’s commerce and marketing data implemented Great Expectations in a particularly novel way: They integrated Great Expectations into a data uploading tool, which is used by upstream data providers to upload Excel files that are fed into machine learning models. The tool creates a Pandas dataframe of the uploaded data and runs validation using the Great Expectations. The tool then instantly provides feedback to the upstream data provider based on the validation result and accepts or rejects the uploaded file. This is a great example of how Great Expectations can be easily integrated into other tools to provide data validation, outside of the typical data pipeline use cases we usually see.
For experimentation, the team is also exploring a way to provide data validation with Great Expectations through an API that allows a user to send a data file and an Expectation Suite JSON file and receive back a validation result and observed values.
We’d like to thank the HEINEKEN team for their support in creating this case study, and for their participation in our inaugural Community Show & Tell!
Over 300
International, regional, local, and specialty beers and ciders
Wide variety of data sources
Enterprise Resource Planning system, sales data, marketing and media data, weather data, and more
Standardized validation
Consistent testing across different engineers and datasets
Components
- SQL
- Excel
- Pandas