In the field of marketing technologies (MarTech), we like to speak metaphorically of a jungle, because the subject proves to be a multifaceted landscape with different terrains and disciplines. Due to the sheer mass, one easily runs the risk of losing the overview. In the context of this metaphor, Data Science can be understood as a useful tool set that should not be missing for exploring the marketing technology jungle. In our blog post “Data Science – The toolkit for the first expedition into the MarTech jungle“, we have already explained what Data Science is and what advantages it holds for customers and companies alike. Now we would like to turn to the basic framework that an organization should ensure for the use of Data Science.
1. Data basis
The data basis is composed of historical customer data. This usually includes demographic data such as age, gender and marital status, and geographical data such as address and postal code. These base data are also referred to as independent variables.
Dependent variables or target variables, on the other hand, are those variables that are to be predicted using machine learning. For example: Will a customer or client bounce? Yes or no. What is the customer lifetime value? Depending on the context of the application, different data is relevant, for example:
For companies in e-commerce sales:
- Log-ins to the customer account
- Purchases in the store
- Categories of purchased products
- Sessions on the website Newsletter clicks
- Newsletter subscriptions and unsubscriptions
- Service center requests
The decisive factor with regard to the data basis is the uniformity of the data. So-called data silos often exist in large companies. These are isolated databases with data from different sources that are collected per department and are not related to each other. These silos must be broken down by standardizing the data and making it accessible to all company units at a central location. Customer Data Platforms (CDP) are a common solution for this.
Requirements for the data basis:
- it must be as large as possible and have sufficient quality. Only then can machine learning methods reliably identify relationships, patterns, correlations and other correlations. As a general rule, the larger the data set, the better.
- it should be “gapless” and complete. Information about socioeconomic status, but also purchasing behavior, should be available for all customers and not just for some.
- the individual information (especially the dependent variables) should be equally distributed. For example, a data set should include not only men but also women in a balanced proportion. Keyword: Data Imbalance.
- the data must be heterogeneous in order to identify meaningful patterns. For example, if only middle-aged men are observed, no patterns concerning the behavior of young women can be derived.
If necessary, descriptive metrics can be used to evaluate the data set. Individual attributes, such as the termination rate, are used for the analysis in order to derive recommendations for action.
Tip: It is generally advisable to check data sets for human error before analyzing them.
Just as important as the data basis is the evaluation of the analysis model used. For this purpose, the quality of the algorithms must be checked regularly. In a concrete case, this could look something like this:
It is examined how many customers have dropped out in the purchasing process (customer churn). The logical question to check is: How many of these customers did the analysis model recognize and how many did not?
Another important question in the evaluation: On what basis did the model make decisions, and which variables had the greatest influence on the prediction?
Tip: Regardless of the context, the so-called Gini index (a statistical measure of the inequality of a distribution) is a suitable tool for evaluating machine learning. However, only for binary target variables: Purchase: yes (0) – no (1)
Data Science can be used in many ways as a toolkit for the first expedition into the MarTech jungle. However, certain conditions must first be met for this. We have presented two conditions for common machine learning methods: the data basis and the evaluation. The decisive factor with regard to the data basis is its uniformity. Often there are so-called data silos in large companies. These are isolated databases with data from different sources that are collected per department and are not related to each other. These silos must be broken down by standardizing the data and making it accessible to all company units at a central location.
Just as important as the data basis is the evaluation of the analysis model used. To this end, the quality of the algorithms must be checked regularly. The logical question for checking is: How many of these did the analysis model recognize and how many did it not?
Another important evaluation question is: On what basis did the model make decisions, and which variables had the greatest influence on the prediction? If the descriptive framework conditions are fulfilled, the expedition into the MarTech jungle can start!