In the previous part we introduced the predictive model that we designed at Bonnier Broadcasting to assess the attrition rate of the customers at C More (our premium pay-TV service). In this last part, we will share some specific technical details of our solution, and some lessons learnt that we thought of interest for those who venture into automating one of the factors addressing customer retention.
The reader is advised on the technical deepness of this part, some basic knowledge on machine learning and distributed systems is assumed.
Data: preparation and cleansing
Data selection is one of the most important steps, if not the most, when it comes to training and evaluating a predictive model. If there is not any meaningful data describing the customer, the model will not be able to infer accurate behaviour either. Note that while data and information are similar, and often interchanged in literature, they are not the same. Data is the encoded form of information, more data does not necessarily imply more information (or better).
The team who builds such predictive model should sit together with those stakeholders in the organisation that work with the customer data, both directly and indirectly, to help identifying which data sources convey the information that lead to the actions taken by each stakeholder in their daily work. Moreover, it is likely that the discussion with the stakeholders result in new desired data sources, which may not be directly available in the organisation but if there was such a possibility, they would make their work easier and more meaningful.
The next task after selecting the information that is going to be needed and identifying the data sources that can possibly provide it, is data exploration, preparation and cleansing. Such cleansing task is extremely crucial because if not done properly (or ignored at all) the available data will likely affect the model whose output results will be either simply wrong or inaccurate.
For example, while we were cleansing some of the data we selected for our model, we discovered some corner cases that were not accounted for in our initial data exploration task. If we had not considered (and acted on) these scenarios they would have introduced a bias in the results of our predictive model. In the following figure we show an extract of one of the data sources that we selected related to the activity of our customers (note that some personal identifiable information has been either partially or totally blurred on purpose to protect the privacy of our users for this article).
We selected many more data sources, some of interest are the following,
- Financial data
Payment behaviour is extremely important when addressing attrition rate as it helps describing how it look like when a customer actually cancels the services. In our case, we focused particularly on the tenure of the customer (how long a customer has been paying continuously), the type of subscription the customer is paying for and how many different subscriptions the customer has had since she first joined the streaming service (on a side note, we detected some isolated subscription patterns that would be of great use for our future work in fraud detection models :).
- Behavioural data
Most of the knowledge was extracted from data sources tightly related to the activity of the customer, such as visited webpages or movies and series watched. We also derived new information when we combined other data sources that while were not primarly aimed at describing the customer we thought there was some potential aspect of interest. For example, by examining how much a customer would use our streaming services in relation to what other customers within a similar tenure category we derived some flags that would tell us whether a customer was ‘deviating’ from the main reference group in which she was (on a side note, we are quite similar to each other after all).
Our data warehousing, and most of the raw analytics, happens with BigQuery. Not only we can store hundreds of gigabytes seamlessly that we can process queries of terabytes (and we do have some of those) in seconds by means of ANSI SQL compliant queries (for the interested ones, BigQuery supports SQL:2011, ISO/IEC 9075:2011).
Having such capabilities in our data warehouse allowed us to benefit from a rapid and extremely fast trial-and-error approach because we have plenty of data but not everything was relevant (some of the data we have stored in our data warehouse has not yet been used).
Lesson learnt: try to have clean and meaningful data sources from the very beginning, you will save time and efforts when you need to understand such data sources.
And an extra is data minimisation, retrieve what you need because ”less is more” after all. Not only for privacy and ethical reasons but also because documenting and maintaining large amounts of data sources and understanding is a great headache if there is no use to them. Collect the data you need today, and reason whether you need to collect some other data that you believe you may need in the near future or not, but do not default to collection ”just in case”.
Feature engineering: extraction and derivation
Extracting the most important features that drive the attrition rate target is one of the most challenging tasks. Some say it is an art and in a way it is because on one side you need cleansed data sources to extract information from and on another side you need domain knowledge experts that help understanding such data sources, what do they mean and what information they convey.
After discussion with some of the domain experts at C More, mostly within the analytical departments, we selected a set of target features to try in our model, some were a one-to-one mapping with the raw data source such as the tenure of a customer, and most others were derived features that we believed had high prediction power.
For example, among the derived features, we created some flags to show if the recent page view of a customer was her account or one of the FAQ pages (it seems that when you check your account details often or visit the FAQ pages you are more likely to cancel your subscription). We also did some data aggregations, usually to account for some sort of historical information, for example, we averaged the streaming minutes of each customer in different periods (3 days, 1 week, 2 months, etc…), or the device that had been most frequently used.
In the following figure we show some of the features and other information that we use as a source of data for our predictive model (once again, we are blurring some personal identifiable information). Do you notice anything? If not… no worries, in the next section we will talk about it 🙂
Predictive model: algorithms and scoring
We started prototyping our predictive model as a random forest classification estimator with the TensorFlow framework in a Python’s notebook for quick development. TensorFlow is a relatively modern framework for numerical computation and machine learning in a data flow graph fashion, which is making a name for itself to become a standard to generate neuronal networks based models. We chose this framework because of the requirements of our cloud platform provider for this purpose, Cloud Machine Learning Engine, whose environment, as of this writing, only supports and runs such framework.
The first results and insights with our random forest implementation were quite positive although we struggled when we tried fitting the prototype to the required settings and structure for the serving part of TensorFlow. Moreover, the computation time of the training and validation tasks were rather long for the size of the sample we used. Though it is important to note that such random forest implementation is an adaptation to the standard data flow graph that TensorFlow runs and possibly not fully optimised. We also have to remark that our experiments with random forest were mostly run on modern laptops and we barely optimised for the production environment that promises auto-scaling (although the corresponding random forest implementation should be prepared for that as well).
After some research and many experiments in between, we decided to move to neuronal networks models (that is what TensorFlow is optimised for anyways) and chose one of the canned estimators that TensorFlow provides, a deep neuronal network combined with a linear classifier. We thought we would benefit from the best of both worlds and take advantage of the combined strength of memorization and generalization. Since our raw features are already classified, this was convenient for a supervised learning problem like ours. The implementation in TensorFlow with such estimator is rather standard and we leave it as an exercise for the reader (we believe that hands on is best).
Perhaps the reader has realised of something when looking back at the figure with the features in the previous section. In that figure the time is very prominent, in particular the date for the attrition date (as well as the corresponding flag for that day). However, with the implementation of the aforementioned canned estimator we cannot use the time as it is shown because the canned estimator’s implementation relies on a deep feed-forward neuronal network, hence the the neuronal network cannot learn anything about the time because the results will not propagate back to the neuronal network.
That is, we have actually done some rework of our features to account for time (recall the aggregations per time periods we mentioned in the previous article) so that such estimator can do its job in our scenario.
Currently, we are working on a new model that accounts for time in its true sense, which will rely on the full time series. Luckily, TensorFlow includes a contributed implementation of the type of neuronal networks that we need: recurrent neuronal networks.
Automatisation: distributed deployment
Most of our batch jobs in the data platform are managed with Apache Airflow, a dynamic, extensible and scalable platform to schedule, monitor and visualize workflows (pipelines) as directed acyclic graphs of tasks. We use it to schedule the extraction and processing of the raw data from our data sources, to run the predictive model in the distributed machine learning engine, and finally, to expose the results to third parties within the organisation. Apache Airflow implements many operators that expose the main functionality of the Google Cloud Platform easing the automatisation of our work and more importantly the monitorisation of the tasks.
Most data preparation, cleansing and processing, happens as scheduled tasks that we run with Dataflow before anything else happens. Once the processed data is available in our data warehouse, BigQuery, we typically extract the features we need for the model that we have in production as the result of SQL queries. Thereafter, we run our predictive model, in the Cloud Machine Learning Engine, with a subset of the features that we have for each customer for the time period that we are predicting on.
One of the challenging task as part of the deployment of the predictive model was the training because the data platform expects a fully trained and possibly tuned model. For our training, we ended up with a 70% (training), 20% (validation) and 10% (test) split of our historical data accounting for the last 120 days (longer activity period did not yield meaningful improvements in the accuracy of the model) and a subset of the features (recall that each model requires some features and even the same model may need new features because the currently used ones will change over time and may affect the accuracy of the model).
Taking as base line some of the accuracy and the area under the curve (AUC) results we got from our early random forest experiments, we optimised and fixed our model (and some of the data too) with the validation data subset. Thereafter, and with a minimum accuracy and AUC in mind, we took advantage of the Cloud Machine Learning Engine environment to tune the different parameters of the chosen canned estimator. Cloud Machine Learning Engine provides a useful semi-automatic hyper-parameter tuning feature where the user defines the variables and the thresholds to tune for and the target variable to optimize for, for example, minimise the loss or maximise the AUC. In the following figure we show the receiver operating characteristic curve (ROC) and the corresponding area under the curve (AUC) of our tuned predictive model,
Finally, we deployed our trained and tuned model for both batch (our typical use via Apache Airflow) and live predictions in Cloud Machine Learning Engine, a rather seamless process well described in the documentation.
In this article (the current part and the previous one), we have explained how a business, C More at Bonnier Broadcasting, can benefit from machine learning not only to learn about the reasons for which customers decide to end the service they have but also, and more importantly, to score the likelihood of current customers to end the service in the near future because customer retention is cheaper than customer acquisition.
Attrition rate is a complex problem where there are many actors and stakeholders involved. A whole organisation may end up getting involved, although in most cases it will rely on those teams that support the business and some of their tasks are driving solutions to reduce such the attrition rate. On the technical side, machine learning is one of those tools that will help scoring the customers, based on their individual behaviour and their relative similarity to other customers that will allow for inferences and generalisations.
Modern machine learning takes great advantage from neuronal networks and distributed data processing in a large-scale manner which was pretty much impossible years ago. However, machine learning in general and neuronal networks in particular should not be considered as the replacement for the decisions and the actions that the business units in the organisation must still take. After all, this is what data-driven means: support your decisions with evidence rather than intuition or personal experience.
For future work, we envision the following improvements, some of them already ongoing,
- More tests of unrelated business features that could affect the behaviour of the customer, for example, weather or geography.
- Better data preparation, for example, the assets that a customer watches at a specific time period must be represented as a multi-valued variable (sparse vector in fact) to account for the context of the 1-to-m relation.
- Improvement of performance when extracting features. It is important to optimise the time-series of each feature to avoid processing again data that was already processed, and aim at reusing what has already been calculated. For example, there are many features that require short historical windows, mostly daily, while there are some others using relative quantiles within a longer historical window. Therefore, customising for each feature (or set of features) may be more efficient rather than taking a global window for all.
- Implement and compare with more predictive models for cross-validation, for example, a selected choice of linear and non-linear algorithms.
- K-fold cross validation. Validating is important, and in our case we believe it is crucial as we have to account for seasonality as well (the hockey season doesn’t run all year long :).
- Aim at a more privacy-friendly data collection and analytics process. While we comply with the applicable legislation, and we are preparing for the upcoming General Data Protection Regulation (GDPR), we should aim at a much less individual data processing, and move to processes that classify the individual in a group. For example, differential privacy offers good trade-offs between privacy and results’ accuracy.
Guillermo Rodríguez-Cano & Maryam Olyaei
We would like to thank our colleagues, Kristoffer Adwent, Erik Ferm and Marcus Olsson, for their joint work on this project, and Josef Eriksson for helping us understand the quirks of some of the features we have derived for the predictive model, and many other people of other teams who have helped with their comments and efforts. We also thank David Hall for pointing out a privacy issue on some of our figures.