Authors:
Roberto Saia
;
Salvatore Carta
;
Gianni Fenu
and
Livio Pompianu
Affiliation:
Department of Mathematics and Computer Science, University of Cagliari, Via Ospedale 72 - 09124 Cagliari, Italy
Keyword(s):
Business Intelligence, Decision Support System, Risk Assessment, Credit Scoring, Machine Learning.
Abstract:
The rating of users requesting financial services is a growing task, especially in this historical period of the COVID-19 pandemic characterized by a dramatic increase in online activities, mainly related to e-commerce. This kind of assessment is a task manually performed in the past that today needs to be carried out by automatic credit scoring systems, due to the enormous number of requests to process. It follows that such systems play a crucial role for financial operators, as their effectiveness is directly related to gains and losses of money. Despite the huge investments in terms of financial and human resources devoted to the development of such systems, the state-of-the-art solutions are transversally affected by some well-known problems that make the development of credit scoring systems a challenging task, mainly related to the unbalance and heterogeneity of the involved data, problems to which it adds the scarcity of public datasets. The Region-based Training Data Segmenta
tion (RTDS) strategy proposed in this work revolves around a divide-and-conquer approach, where the user classification depends on the results of several sub-classifications. In more detail, the training data is divided into regions that bound different users and features, which are used to train several classification models that will lead toward the final classification through a majority voting rule. Such a strategy relies on the consideration that the independent analysis of different users and features can lead to a more accurate classification than that offered by a single evaluation model trained on the entire dataset. The validation process carried out using three public real-world datasets with a different number of features, samples, and degree of data imbalance demonstrates the effectiveness of the proposed strategy, which outperforms the canonical training one in the context of all the datasets.
(More)