Churn Analysis Engine

Introduction

Churn is defined as number of people who have ceased to do business with a Company over a fixed period of time. Businesses which utilizes network effects are hit heavily both in terms of lost revenue and spent marketing efforts.

Sectors like Telecom, Retail, Hospitality and Subscription based services are few sectors where Customer churn is taken seriously. Being able to track Churn and predict it implications of:

  1. Improved user experience and thus revenues
  2. Optimization of Marketing expense

Churn Analysis is collection of activities that including collection of data from CRMs to process improvement across Operations and Marketing, which will reduce number of customers leaving for competition. If we’re able to identify and predict dis-satisfied customer we can take corrective actions.

Motivating Examples

So you run a Fitness Tracker business, your business model is to discount actual hardware and earn bulk of your revenues from selling subscription for coaching service. In this case you not only want to maximize sales of unit but also want to minimize discontinuation of subscription.

Since you have limited budget for marketing you want to focus your efforts and give great offers to customer who will give you better future revenues and thus profits. You have data inform of

  1. Structured Information:
    1. Metrics from fitness tracker
    2. Demographics information from welcome calls
  2. Un-structured Information:
    1. Chat log between Fitness Representative and Customer
    2. Customer service tickets and concerning information

You want to build a predictive model that will identify customer, who are likely to discontinue their subscription in next 90 days.

Implementation Approaches

From Machine Learning and Data Science perspective the problem of predicting churn we could model the problem as:

  1. Rule Based: Often times domain knowledge in terms of past expertise can be collected as rule and programmed into software. This is often called Rule Based system, and might be a good approach for a problem where there is no high variance.
  2. Neighbourhood Based: We can segment our customer based on behaviour profile, to clusters {these will be our representative profiles}. We can then proceed to propogate label to the nearest cluster that a candidate profile falls into.
  3. Machine Learning Based:
    • Regression Based: Try and predict score as a probability {between 0 and 1}. Zero can be interpreted as Customer won’t churn, the closer score to One implying Churn
    • Classification Based: You can try and predict a customer into buckets. If there are only two bucket {Churn or Not Churn} this is a binary classification problem. If we want to get more granular, say 5 risk level this is multiple-class classification problem
  4. Ensemble Based: Think these set of algorithms to be a committee of members where
    a collective vote is taken and merged into final results. They have proved to be producing better results compared to a single family of model. The exhibit what is known as strength of “weak” learners.

Since Rule Based approaches are limited in their applicability, they will not be covered under current scope.

Neighbourhood Based

As the old adage goes: “You are known by the company you keep”, this is at times true for data as well. Investing effort and time in Feature Engineering is what will get good results from your predictive models. Feature Engineering is a process by which you:

  1. Clean and extract data into a standard form. This allows us to represent data into a form that our machine learning algorithms can process
  2. Create derived features, based on existing features
  3. Prune features of low importance {this is called feature selection}

Once we’ve represented our dataset into features, we can feed it into set of algorithms called Clustering Algorithms

One of the most popular algorithms {not efficient} is kNN Algorithm . The k parameter is how many “buckets” you want this algorithm to divide our dataset into. The algorithms starts with selecting random points in dataset and keeps on selecting these random points such that it minimizes distance between points in dataset {this is called Error}. The algorithm terminates when a convergence happens, i.e. when we can not improve on the error.

Algorithm is extremely easy to understand and implement, on the down side it is computationally complex. So it doesn’t scale well on large algorithms.

Depending on how we model the problem {Binary or Multi-Bucket approach} we can choose appropriate k and accordingly label potential instance as will churn or will not churn.

Regression Based

Regression is one the classic techniques in Statistics, in context of Churn Analysis out target variable can be modelled as trying to predict risk score i.e. the Probability with which the customer will discontinue the service.

Regressions model is like a line fitting on given set of points {features in our case}. We can use training data to learn to draw a line that minimizes error {E.g. say popular Root Mean Squared Error - RMSE} and then use that line to predict the score of a candidate instance.

Regression models are easier to use and learn, but they do exhibit limitations when trying to learn a complex dataset {which could be a result of un-predictable behavior}

Classification Based

Complementary technique to Regression method is Classification. Depending on how we want to model this problem, it can be classified as:

  1. Binary classification
  2. Muli-class classification

Binary classification is relatively straight forward, lets talk a bit more about Mult-class problem. Lets say the marketing team has come-up with scheme for segmenting into following buckets

  1. Satisfied
  2. Neutral
  3. Low Risk
  4. High Risk
  5. Hostile

They have also gone ahead and created a sample of 10,000 customers based on Survey, Payment History and experience from Customer Service Representatives.

Now they want to identify Low Risk, High Risk and Hostile and see if they can offer some service or freebies to get them converted back to Satisfied or Neutral category.

This kind of segmentation is possible using Classification algorithms. Plethora of Classification algorithms exists, the trick is to quantify performance of these algorithms. Some of the common metrics used in measuring these models are:

  1. Precision and Recall {These are also known by Accuracy and Sensitivity}
  2. F1 Score {Think of this as a balanced score of Precision and Recall}
  3. ROC Curve

Ensemble Based

Ensemble methods is a technique where we combine multiple models into a single model which produces better results compared to any single model. The variations of ensemble can be

  1. Using variations of same algorithms
    • In terms of dataset being used to train them
    • In terms of parameters used to train them
  2. Using variations of different algorithms

In practice Ensemble model out-perform any of the techniques mentioned above most of the time. There are few downside to using Ensemble models:

  1. Increased complexity in terms of time required to train it and maintain
  2. Figuring out the right configuration for ensemble is challenging