Cybercrime cases detected by Confusion Matrix

KARTHICK P
5 min readJun 7, 2021

What is a Confusion Matrix?

A Confusion matrix is the comparison summary of the predicted results and the actual results in any classification problem use case. The comparison summary is extremely necessary to determine the performance of the model after it is trained with some training data.

  • Positive(P): The predicted result is Positive (Example: Image is a cat)
  • Negative(N): the predicted result is Negative (Example: Images is not a cat)
  • True Positive(TP): Here TP basically indicates the predicted and the actual values is 1(True)
  • True Negative(TN): Here TN indicates the predicted and the actual value is 0(False)

False Negative(FN): Here FN indicates the predicted value is 0(Negative) and Actual value is 1. Here both values do not match. Hence it is False Negative.

False Positive(FP): Here FP indicates the predicted value is 1(Positive) and the actual value is 0. Here again both values mismatches. Hence it is False Positive.

Accuracy and Components of Confusion Matrix

After the confusion matrix is created and we determine all the components values, it becomes quite easy for us to calculate the accuracy. So, let us have a look at the components to understand this better.

Classification Accuracy

From the above formula, the sum of TP (True Positive) and the TN (True Negative) are the correct predicted results. Hence in order to calculate the accuracy in percentage, we divide with all the other components. However, there are some problems in the accuracy and we cannot completely depend on it.

Let us consider that our dataset is completely imbalanced. In this Scenario, 98% accuracy can be good or bad based on the problem statement. Hence we have some more key terms which will help us to be sure about the accuracy we calculate. The terms are as given below:

Type I error:

This type of error can prove to be very dangerous. Our system predicted no attack but in real attack takes place, in that case, no notification would have reached the security team and nothing can be done to prevent it. The False Positive cases above fall in this category and thus one of the aims of the model is to minimize this value.

Type II error:

This type of error is not very dangerous as our system is protected in reality but the model predicted an attack. the team would get notified and check for any malicious activity. This doesn’t cause any harm. They can be termed as False Alarm.

CyberSecurity in Confusion Matrix.

I got a Research paper by Danique Sessink while I was exploring the Confusion matrix. That research aims to detect ICT involvement in criminal court cases and classify these cases based on certain features. He used a confusion matrix for the model he proposed. The exact outputs he got are:

The top ten features for every label were extracted per class. The results are shown in Table 5. The features make sense as these words are often associated with these sorts of court cases.

Top ten features in his model for cybersecurity

The confusion matrix that was obtained from the classifier is depicted in Figure 2. It is in normalized form since the classes are imbalanced. The darker the blue, the better the classifier is at predicting files for this class. It is clear where the classifier gets ‘confused’. The ‘identity theft’ class does not seem to do well, which has a good reason. Through reading court cases, the discovery was made that ‘platform fraud’ is linked to ‘identity theft, as it appears that stolen identities are often used to commit platform fraud. In the confusion matrix, it is shown that ‘identity theft’ is often predicted as ‘platform fraud’.

From calculating the f1_score the accuracy proved to be 0.76, which means a criminal court case label can be predicted with an accuracy of 76%. This means 24% of all criminal court cases get misclassified as another class. However, since this accuracy is the weighted average of each f1_score of a class, it may be better to calculate accuracies per class as some classes are performing better than others. The f1_score per class is shown in Table 6. The confusion matrix in Figure 2 clearly indicates which classes the labels are misclassified, as well as the percentage per class. The accuracies can also be read from the diagonal in the confusion matrix. It appears ‘child pornography’ can be determined with high accuracy.

Conclusion

A confusion matrix is a powerful tool for predictive analysis, enabling you to visualize predicted values against actual values.it will take some time to get used to interpreting a confusion matrix but once you have done it will be an important part of your toolkit.

Thank you for reading the article !

_______________________________________________________________

--

--