top of page
  • Writer's pictureJoão Ataide

Prediction of customer evasion in telecommunications companies

Updated: Feb 15, 2023


 

Companies providing communications services often use analyzes of customer dissatisfaction, always identifying whether their customers are confirmed or not, these analyze are one of the main metrics to keep and acquire new buyers. In general, companies in these sectors tend to have customer service agencies, where one of their main tasks is to try to win back those lost customers, which can cost much more than new ones.


Due to this, companies invest in new technologies to perform these analyses, these commonly use the Churn rate, or simply Churn, one of the ways to measure this customer evasion rate. In addition to identifying this dropout rate, Churn helps to identify future cancellations, making a prediction that helps in decision-making and promotion of differentiated actions for these customers. A daily application of ours is in streaming companies like Spotify and Netflix.


Because it is such an important metric, Carlos Melo in his Data Science course, in practice, gave us a challenge, to implement from scratch a Churn solution for a telecommunications company, using real data. The data used in this project were originally made available on the IBM Developer teaching platform, such data is a typical problem of telecommunications companies, in which the complete dataset can be seen in this link.


The data provided by IBM does not present explicit information, but by advantage, the column names allow an easy understanding of the problem, as we can see in the first five entries below, which introduces a dataset of 7042 entries with 21 columns.



As an initial step for any Data Science project, a good exploratory analysis is of paramount importance. In this way, I performed some of the analyses that I found most important, starting with an assessment of the unique values.


In this evaluation, it was possible to notice that the layers in question present mostly more than two types of classification, being necessary to carry out some procedures later on to feed the model. Other assessments were necessary, such as: identification of outliers, descriptive statistics of the data and the presence of Bias.


However, descriptive statistics and assessment of the presence of outliers did not say much. So, I checked the Tenure layer, which represents the time that each customer used the service, being able to show the "service loyalty". Assuming that the time unit of the Tenure layer is in months.



Note that the length of service used shows a decrease from the first month to the others, with an increase in the last months from 70 to 72, but with a certain normality of the data. Of all the procedures mentioned above, I will highlight the presence of Bias, as the others did not present results that could compromise the model.


The biases analyzed were from the layers: gender (gender), Partner (partnerships) and churn (evasion rate), however, only the last one presented a compromising imbalance, being then necessary to carry out the balancing, as you can see below:



As for the other layers, you could use their unique values as the same thing, especially the service ones. Even if this hypothesis is completely valid, for the work in question I will use them as single variables.


In this step, I performed some manipulations, to prepare the data to build a good model. First, perform the pre-processing, using the LabelEcoder and Get dummies algorithms, to transform them into binary variables, leaving each class in a layer with a different numerical value, as you can see below:


And then, being able to generate our correlation matrix from the data.


With all this done, we can start preprocessing and creating our performance metrics. For this metric, it uses the Cross-Validation method as an error estimator, which I have already mentioned in another article. Such a method associated with the metric of the recall to find out which model is the best for this problem.


So, to facilitate the evaluation of the models, I defined a validation function:


#Função de validação

def valid(X, y, model, quite = False):

  X = np.array(X)
  y = np.array(y)
  pipeline = make_pipeline(StandardScaler(), model)
  scores = cross_val_score(pipeline, X, y, scoring="recall")

  if quite == False:
     print("Recall: {:.3f} (+/- {:.3f})".format(scores.mean(), scores.std()))

  return scores.mean()

Which returns the recall value and the score. Building then a baseline for comparison, for parameter to evaluate which models I used.


#Criando linha base
rfc = RandomForestClassifier()
score_baseline = valid(X_treino, y_treino, rfc)

Resulting in Recall: 0.485 (+/- 0.022), this validation technique was based on Under Sampling, and following recommendations from some literature, the data will be standardized before using this balancing technique, making it clear that only on the training basis.


As I don't know which model will give the best result, I will perform cross-validation for the following models:

  • random forest

  • Decision Trees

  • Stochastic Descending Gradient

  • SVC

  • Logistic Regression

  • LightGBM


It was then necessary to instantiate it and configure them according to the algorithm below:

#Instanciar os modelos
rc = RandomForestClassifier()
ad = DecisionTreeClassifier()
gde = SGDClassifier()
svc = SVC()
rl = LogisticRegression()
xgb = XGBClassifier()
lgbm = LGBMClassifier()
#Criar lista para resultados
modelo = []
recall = []
#Avaliar desempenho
for clf in (rc, ad, gde, svc, rl, xgb, lgbm):
  modelo.append(clf.__class__.__name__)
  recall.append(valid(X_treino_rus, y_treino_rus, clf, quite = True))

Recall = pd.DataFrame(data = recall, index=modelo, columns=["Recall"])

Generating the following result,

We can see in the results, the results of the XGBoost, Logistic Regression and SVC models were highlighted. In which, they had their instances in default. With that, I will take the best previous model (XGBoost) and modify its parameters looking for the best possible result, where you can find the complete project.


Evaluating some XGboost structures, I found the best result as:

#Modelo Final
xgb=XGBClassifier(learning_rate=0.001,n_estimators=50,max_depth=1,min_child_weight=1,gamma=0.0)xgb.fit(X_treino_rus,y_treino_rus)

Result the following forecast values:

And the Confusion Matrix:

The result then in an excellent model, with 0.89 of the data True Positives and 0.57 true negatives, with an ASC of 0.72.


In this project, we worked on a churn prediction problem, in which the main objective was to build a machine learning model capable of correctly identifying the largest numbers of potential customers that bottled. Therefore, it was necessary to carry out several procedures such as balancing and standardization, testing different models to obtain the best result, thus arriving at an XGboost model with an excellent result.

bottom of page