DEVELOPMENT OF DATA MESH DATA PLATFORM WITH ML DOMAIN OF DATA ANALYSIS

M. Fostyak, L. Demkiv

Abstract


A data mesh model with three input domains A, B, and C has been proposed. All domains have their own operational data source. Each domain team builds a data product for their domain, which includes only cleaned, processed, and selected data. The domain data products are then combined into a comprehensive aggregate containing all-encompassing data about the entities in the system. Next, consumer-aligned data products are created: a marketing data product, a company performance data product, and an ML data product. Thus, data mesh provides a decentralized and distributed data architecture for a project to deliver financial services to clients.

This study provides a detailed analysis of the creation of data products within Domain B and the ML domain, as well as their interactions. The data product for Domain B is constructed using data from the Open Banking API, which provides real-time data on clients' daily transactions, consistent with the information displayed on their bank statements. The data were categorized, aggregated, and anonymized, resulting in fifteen data columns across three sections: categorized expenditures, risky expenditures, and categorized revenues. Additionally, two new columns were derived to represent the net difference between income and expenditures.

The layer of data analysis includes the ML model domain. In this domain, data classification is implemented using various classifiers. It has been established that the highest classification accuracy of 0.98 and the highest classification metric ROC AUC of 0.98 are achieved when using XGBoost (XGB) and Random Forest (RF) classifiers on data obtained from Domain B after balancing and augmentation with a Generative Adversarial Network. The classification results and the Principal Component Analysis (PCA) method confirm that the data product constructed from Domain B ensures high classification accuracy. A thorough analysis of the classification results was conducted. Clients were segmented into groups based on their probability of obtaining a loan. It is proposed to incorporate the results of ML data analysis to enhance client classification accuracy, analyze financial credit risks, and determine the optimal interest rate.

Keywords: data storage models, data mesh, data domains, classification, ML data analysis.


References


  1. Dhaouadi A., Bousselmi K., Gammoudi M., Monnet S., Hammoudi S. Data Warehousing Process Modeling from Classical Approaches to New Trends: Main Features and Comparisons// Data 2022, 7(8), 113. https://doi.org/10.3390/data7080113
  2. Goedegebuure A., Kumara I., Driessen S., Monsieur G., Tamburri D., Nucci D. Data Mesh: a Systematic Gray Literature Review // arXiv:2304.01062v1 [cs.SE] 3 Apr 2023. https://doi.org/10.48550/arXiv.2304.01062
  3. Hjelkrem L.O., Lange P., Nesset E. The Value of Open Banking Data for Application Credit Scoring: Case Study of a Norwegian Bank // Journal of Risk and Financial Management 15(12):597, 2022. DOI:10.3390/jrfm15120597
  4. Shi S., Tse R., Luo W., D’Addona S., Pau G. Machine learning-driven credit risk: a systemic review // Neural Computing and Applications 34(2), 2022. DOI:10.1007/s00521-022-07472-2
  5. Strelcenia E., Prakoonwit S. A Survey on GAN Techniques for Data Augmentation to Address the Imbalanced Data Issues in Credit Card Fraud Detection // Machine Learning and Knowledge Extraction 5(1):304-329, 2023. DOI:10.3390/make5010019
  6. Khalid A.R., Owoh N., Uthmani O., Ashava M., Osamor J., Adejoh J. Enhancing Credit Card Fraud Detection: An Ensemble Machine Learning Approach// Big Data Cogn. Comput. 2024, 8(1), 6; https://doi.org/10.3390/bdcc8010006




DOI: http://dx.doi.org/10.30970/eli.27.2

Refbacks

  • There are currently no refbacks.