One of the main concerns of financial institutions is credit card fraud. In Brazil, around 12 million people have already been victims of fraud, reaching over R$1.8 billion in 2019 alone.

However, the failure to detect fraud should not only be a concern for financial institutions, as this problem can cause great damage to other financial areas and even consumers. Due to this, technical enumerations are assigned to carry out these detections, causing concerns such as false positives, which embarrass countless users due to refusal and blocking of the card when we make a purchase.

There is a growing investment in Artificial Intelligence applications, having large volumes of data as a historical basis, a machine learning algorithm only slightly better than the previous ones, already represents a savings of millions of reais. Thus, as part of the Data Science course of the Carlos Melo Practice, we are faced with the challenge of increasingly improving the use of algorithms in order to inhibit or avoid fraudulent transactions. Because it is a challenge, Carlos has already provided us with organized data without categorical variables, in order to use this work to build the best possible model.

The data in question here were made available by some European credit card companies, this represents the financial operations that took place in the period of two days, where 290 thousand transactions were detected.

A good thing about these sets is that their features are all numeric, uncharacterized (in order to assure users). So, each of the columns are represented by V1, V2... V28, the other layers represent, the following:

Time Number of seconds elapsed between this transaction and the first transaction in the dataset;

Amount - Transaction amount;

Class - 1 for fraudulent transactions, 0 otherwise.

As we can see in the first and last five entries.

**Primeiras entradas**

**Últimas entradas**

The original data can be found in Kaggle and these undergo transformations known as Principal Component Analysis (PCA). This technique allows the reduction of dimensionality while keeping as much information as possible. To achieve this, the algorithm finds a new set of features - the so-called components, which are a number less than or equal to the original variables.

With data imported into a Dataframe structure and there is no need for any further adjustment or configuration at this step, I was able to go straight to the exploratory analysis of the data and finally the assembly of the Machine Learning model.

So I started by analyzing the basic statistics. This being an important part of the work, where we will see which distributions balance each layer.

However, because it is a large dataset, the basic statistics do not represent much information, unless the data in question have unbalanced distributions, distributions which we will see later.

Before that, the next step checked the quality of the dataset, which is an important metric, which will indicate whether we can continue the project. When calculating the missing data, they were non-existent.

So I started the evaluation of the balance, which represent respectively 0 (Non-fraudulent) 284315 and 1 (Fraudulent) 492, it was possible to notice that both graphically and in percentage, that the amount of fraudulent transactions is much smaller than normal, being only 0.17% of the entire dataset.

Then I created the histograms of the two layers with different scales, in relation to Time and Data Class.

However, it was not possible to identify any information from the two frequency distributions. Then, I performed the histograms of the classes in relation to amount.

For the data in question, the distribution of the histograms represent an imbalance and a large presence of possible *outliers*. Thus, we will have to evaluate these using the box plot to carry out a more in-depth analysis.

The box plot showed several box plot outliers and clearly a very high imbalance, with its exact mean being 118.13 and median 9.21.

After evaluating the balance, it was possible to calculate the correlation, for this calculation I plotted a correlation matrix, hiding the calculations of the upper diagonal.

*#Criando uma máscara da diagonal superior*
mask = np.zeros_like(df.corr())
mask[np.triu_indices_from(mask)] = **True**

*#Plotar matriz de correlação*
plt.figure(figsize=(10,6))plt.title("Matriz de Correlação")sns.heatmap(df.corr(),mask=mask,cmap='Blues')plt.show()

The matrix in question showed little relationship between the data, caused by their imbalance.

To feed the Logistic Regression model that we will build, in this preparation stage we will:

Standardize the Time and Amount features, which are in another order of magnitude.

Split between training and test data.

Balancing the data to avoid underperformance to class 1 and overfitting.

First, I performed the normalization of the data, as I have used before here on the blog link. Technique that equals the scale of all data in the model, in this case I used the StandardScaler from the scikit learn library.

*#Nomalizar os dados*
standert=StandardScaler()X.Time=standert.fit_transform(X.Time.values.reshape(-1,1))X.Amount=standert.fit_transform(X.Amount.values.reshape(-1,1))

After that, I performed the division of the training and test bases and this way it was possible to perform the balancing of the data using the RandomUnderSampler algorithm from the imblearn library, getting then 369 entries for each class.

*#Balancear os dados*
rUS = RandomUnderSampler()
X_rus, y_rus = rUS.fit_sample(X_train, y_train)

After balancing, calculate the correlation again, where it was much higher after balancing, showing the importance of balancing data in building a good model.

*#Criando uma mascara da diagonal superior*
mask = np.zeros_like(X_new_df.corr())
mask[np.triu_indices_from(mask)] = **True**

*#Plotar matriz de correlação*
plt.figure(figsize = (10,6))
plt.title("Matriz de Correlação Balanceada")
sns.heatmap(X_new_df.corr(), mask= mask, cmap = 'Blues')
plt.show()

With everything ready and a good exploratory analysis performed, we can easily build the Logistic Regression model. This regression model uses the concept of likelihood, this feature allows estimating the probability associated with the occurrence of certain events in the face of a set of exploratory variables.

*#Construir modelo de regressão logistica*
model = LogisticRegression()
model.fit(X_rus, y_rus)

With the model trained and predictions made, I was able to analyze the performance. However, for this type of situation where the data is unbalanced, we cannot use accuracy as a metric. Thus, it will be necessary to create a confusion matrix and the classification report.

We managed to obtain very interesting results, with few false negatives and false positives, with 96% for true negatives and 86% for true positives.

Note that this project here is not a common problem, even though this is being done with clean and well treated data, made available by the course teacher, it was necessary to perform the balancing and normalization of the data and the CAP, to then build this model, can then obtain excellent results for the objective in question. However, this project is a didactic project, which did not have categorical variables or missing data. So, leaving the gap to build the entire kaggle dataset, you can see the complete project and all the code in my repository.

If you liked it, share ;)

## Comments