Sunny day in the middle of the Brazilian semi-arid region, a representative of a financial institution, specifically Fintech λλ, contacted me by email, requesting a service to execute a solution to improve its model for assessing customers' credit risk .
Knowing that of the areas that constitute these companies, one of the ones that cause more problems and discomfort is the default rates on the part of customer portfolios. Making these evaluation activities of these portfolios mandatory and extremely important, and may cause gigantic deficits in the trade balance.
I then investigated and discovered that, for these institutions, each customer is part of a credit portfolio, in which it indicates the possibilities in which customers are financially capable of paying the debt (loans, credit card and others), such an assessment portfolio is reported as ranking.
Due to this, they are increasingly investing in new technologies, with the aim of developing and improving their assessment systems, always aiming to minimize the risk of default (default) or non-compliance with the obligations and/or conditions of a loan. Always betting more and more on Machine Learning models.
Numerous solutions can be made to solve these problems. In this project I will create an algorithm to identify the probability that a client will not comply with its financial obligations, I note that this whole story is part of a fiction created by me, and this work is one of the practices of the Data course Science in Carlos Melo's Practice.
To start the work, it was necessary to evaluate the database provided by the institution. It is worth mentioning that, generally, the evaluations are made when the customer requests the card (usually in the first contact with the institution), however, we will use an already established database (Dataset), such a dataset has 45,000 entries and 43 columns and basically constitutes like a csv.
In the case of this project, it is extremely important to identify the meaning of each of the variables that will be used in the work, which can be seen below:
id - anonymous identification, represents a unique customer value per customer.
target_default - is the target variable, which we will use to analyze default risk.
The columns score_3, score_4 and score_5 are numeric, while score_1 and score_2' are coded in some way. We will have to check ahead if there are a number of classes that can be converted into useful information.
There are other variables that have some kind of encoding, such as reason, state, zip, channel, job_name and real_state that are encoded and will also need some further analysis to see if it is possible to extract any information from them.
profile_tags - contains a dictionary, a label assigned to each customer. target_fraud - would be the
target variable for another model, where the objective would be fraud detection.
lat_lon - is in string format, containing a tuple with the coordinates.
As the first phase of every project, I will start the exploratory analysis of the data, following this order: missing values, data types, unique values, data balancing and Descriptive Statistics.
When performing this calculation we can see that some columns like: target_fraud, last_amount_borrowed',last_borrowed_in_months,ok_since,external_data_provider_credit_checks_last_2_year have more than half of the missing data. The external_data_provider_credit_checks_last_year, credit_limit , n_issues layers have between 25 - 34% of their values missing. However, the count of unique values shows that the columns external_data_provider_credit_checks_last_2_year and channel present a single possible value, thus becoming layers that would not allow the contribution to the model, and can then be discarded.
Following the steps, when I analyzed the balance of our target variable target_default, proportion between defaulters False 77.95% and True 14.80%, being necessary to carry out the balance later.
Finally, the calculations of the main statistical information presented, we can highlight some observations:
The column external_data_provider_credit_checks_last_2_year has minimum, maximum and standard deviation values of zero.
The reported_income column has inf values, which will interfere with the analysis and model. We will replace values of type np.inf with np.nan to work with the data.
The column external_data_provider_email_seen_before has a minimum value of -999, which is strange considering the other information. After checking in more depth, it was concluded that these data are outliers or were inadequately treated. We will replace values equal to -999 with np.nan.
Having done most of the exploratory analyses, the next step is to perform the necessary manipulations to build the best model, where we can then start the pre-processing of the data.
For this, I performed the following manipulations for cleaning:
Replace inf with NaN, in the reported_income layer;
Drop the ids, target_fraud,external_data_provider_credit_checks_last_2_year, channel and profile_phone_number columns;
replace -999 in external_data_provider_email_seen_before with NaN;
Eliminate columns with no apparent information or that require more research;
Eliminate entries where target_default is equal to NaN.
With this first pre-processing done by analyzing the variables, I could then start pre-processing the input. This dataset has a lot of missing and null data, however, we don't have much information about the reasons for these issues. So, to solve this I will replace the missing values (Nan) by zero for variables that have no references, like: 'last_amount_borrowed', 'last_borrowed_in_months' and 'n_issues'. As for the others, I will substitute the median when it is numerical and the mode when it is categorical.
With all these pre-processing performed, the following dataset was left:
Getting 41,741 entries: and 25 layers. Note that most layers are encrypted, which I will use the LabelEncoder and GetDummies algorithm to transform categorical data into numeric data, separating those that have more than one binary into categories.
Getting to the result, from a dataset like this:
As we saw earlier, the target data is completely unbalanced, so it will be necessary to perform some manipulations, arriving at the boolean result of False 35,080 and True 35,025.
With all the manipulations done, I looked in the literature and in other data scientists, which model best fits this type of problem. Doing a systematic research I identified that the best model for this problem, in this case the most used is RandomForestt (RF). So, to identify the best RF model, I created a validation function that uses the Recall estimator, creating a general line base of how much the model could make mistakes and thus contacting which would be the best learning model to use.
#Validation function
def valid(X, y, model, quite = False):
X = np.array(X)
y = np.array(y)
pipeline = make_pipeline(StandardScaler(), model)
scores = cross_val_score(pipeline, X, y, scoring="recall")
if quite == False:
print("Recall: {:.3f} (+/- {:.3f})".format(scores.mean(), scores.std()))
return scores.mean()
rfc = RandomForestClassifier()
score_baseline = valid(X_treino_balanceado, y_treino_balanceado, rfc)
Reaching a result of Recall: 0.967 (+/- 0.003), being able to continue with the RF algorithm, then starting the tests using the GridSearchCV, identifying then the best criteria to be used, reaching the following final model.
forest = RandomForestClassifier(criterion='entropy', n_estimators=100)
forest.fit(X_treino_balanceado, y_treino_balanceado)
With the model ready, I could then make the predictions and then compare them with the result
from the test base, getting an optimal result, as you can see below:
With a confusion matrix showing 96% True Positives and 99% True Negatives.
In this project, I performed several procedures for the construction of a credit risk prediction model, which required the performance of several manipulations, from the exclusion of layers, balancing and standardization. Reaching an extremely satisfactory result for the customer, being able to be put into production. However, there is a gap in the use of other learning models to carry out this work, identifying if in fact RandomForest is the best for this problem, you can see the complete project in the notebook.
Considering the case of Fintech λλ, the result was indeed very satisfactory, seeing that they could improve their evaluation service. We can see this with a simple example of a customer with a credit portfolio of 1 million reais, my model could guarantee that the company would receive 990 thousand for the loans, with only 1% of the cases wrong, this result becomes the financial risk of the company. Acceptable company, as I would consider that it would maintain its credit portfolio in a very healthy situation, and the default rate well below the national averages.
Did you like this type of structure for blog articles better? Send your feedback on my social networks and email, if you liked it, share it to strengthen the work. :)
Comments