SUSTAINABLE OPTIMIZATION OF CONSOLIDATED DATA PROCESSING ALGORITHMS BASED ON MACHINE LEARNING AND GENETIC ALGORITHMS

Vasyl Lyashkevych

Abstract


Background. Automation of analytical report generation in industrial companies is gaining strategic importance due to the variety of document formats, increasing data volumes, and growing requirements for the rapid generation of multi-component analytical materials. Traditional ETL pipelines cannot cope with the complexity of modern information flows, especially when machine learning (ML), large language models (LLMs), and agent systems are integrated into the process. Due to the rapid progress of code generation and autonomous agent capability to perform complex analytical procedures, the task of automatically constructing reporting pipelines is becoming increasingly promising and scientifically sound.

Materials and Methods. An evolutionary model for constructing algorithms for processing consolidated reports based on genetic algorithms (GA) is proposed. For report generation, the algorithm defines a pipeline for constructing a visual component. The population grows as a tensor, enabling parallel evolution of a set of independent workflows. The operations are classified into four groups: ETL, ML, LLM, and VIS. The fitness function evaluates the constant length of the pipeline, the coverage of key types of operations, and their structural consistency.

Results and Discussion. Experimental results have shown that GAs rapidly evolve from random NOP-dominated structures to stable, logically consistent, and functional pipelines with a duration of 10-14 operations. The best chromosomes formed three full-fledged visual components: a predictive regression model, semantic clustering represented by embeddings, and a categorical diagram. This evolutionary pattern confirms that combined pipelines can be built automatically and adaptively, and the increase in the complexity of operations in the “meaning” space, which is the vector space of embedding, along with the development of code generation and agent architectures.

Conclusion. The proposed model demonstrates an effective mechanism for automated synthesis of multi-visual reports based on evolutionary pipelines.  The method's prospects grow with the development of AI agent systems and the increase in the number of operations in the content space, which paves the way for a complete system of autonomous analytical reporting of a new generation.

Keywords: consolidated data, sustainable optimization, machine learning, genetic algorithms, LLM, data analytics.


Full Text:

PDF

References


[1] Fan, J., Han, F., & Liu, H. (2014). Challenges of Big Data analysis. National Science Review, 1(2), 293–314. https://doi.org/10.1093/nsr/nwt032

[2] Fernandes, A. A. A., Koehler, M., Konstantinou, N., et al. (2023). Data preparation: A technological perspective and review. SN Computer Science, 4(425). https://doi.org/10.1007/s42979-023-01828-8

[3] Ahlawat, P., Borgman, J., Eden, S., Huels, S., Iandiorio, J., Kumar, A., & Zakahi, P. (2023). A new architecture to manage data costs and complexity. BCG. https://on.bcg.com/3HOP7vQ

[4] Kwon, N., Comuzzi, M. (2023). Genetic Algorithms for AutoML in Process Predictive Monitoring. In: Montali, M., Senderovich, A., Weidlich, M. (eds) Process Mining Workshops. ICPM 2022. Lecture Notes in Business Information Processing, vol 468. Springer, Cham. https://doi.org/10.1007/978-3-031-27815-0_18

[5] Shi, K., Saad, S. (2023). Automated feature engineering for AutoML using genetic algorithms. In Proceedings of the 15th International Joint Conference on Computational Intelligence (pp. 450-459). SCITEPRESS. https://www.scitepress.org/Papers/2023/120904/120904.pdf

[6] Hernandez, J., Saini, A., Ghosh, A., Moore, J. (2025). The tree-based pipeline optimization tool: Tackling biomedical research problems with genetic programming and automated machine learning. Patterns. 6. 101314. 10.1016/j.patter.2025.101314. https://doi.org/10.1016/j.patter.2025.101314

[7] Jiao, J., Yuan, J. (2025). GA-PRE: A Genetic Algorithm-Based Automatic Data Preprocessing Algorithm. In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO '25). Association for Computing Machinery, New York, NY, USA, 1371–1378. https://doi.org/10.1145/3712256.3726312

[8] Polonskaia, I. S., Nikitin, N. O., Revin, I., Vychuzhanin, P., & Kalyuzhnaya, A. V. (2021, June). Multi-objective evolutionary design of composite data-driven models. In 2021 IEEE Congress on Evolutionary Computation (CEC) (pp. 926-933). IEEE. https://doi.org/10.1109/CEC45853.2021.9504773

[9] Jing, Z., Su, Y., Han, Y., Yuan, B., Liu, C., Xu, H., & Chen, K. (2024). When Large Language Models Meet Vector Databases: A Survey. arXiv. https://arxiv.org/abs/2402.01763

[10] Sequeda, J., Allemang, D., & Jacob, B. (2025). Knowledge graphs as a source of trust for LLM-powered enterprise question answering. Journal of Web Semantics, 85, 100858. https://doi.org/10.1016/j.websem.2024.100858

[11] Instaclustr. (2024). Vector Databases and LLMs: Better Together. https://www.instaclustr.com/education/open-source-ai/vector-databases-and-llms-better-together/

[12] DeepFA AI. (2025). Multi-Agent Systems in Artificial Intelligence. https://deepfa.ir/en/blog/multi-agent-systems-artificial-intelligence

[13] Ramachandran, A. (2025). Revolutionizing Knowledge Graphs with Multi-Agent Systems: AI-Powered Construction, Enrichment, and Applications. ResearchGate

[14] Mehta, V., Batra, N., Poonam, Goyal, S., Kaur, A., Dudekula, K. V., & Victor, G. J. (2024). Machine Learning Based Exploratory Data Analysis and Diagnosis of Chronic Kidney Disease (CKD). https://doi.org/10.4108/eetpht.10.5512

[15] Da Poian, V., Theiling, B., Clough, L., McKinney, B., Major, J., Chen, J., & Hörst, S. (2023). Exploratory data analysis (EDA) machine learning approaches for ocean world analogue-mass spectrometry. https://doi.org/10.3389/fspas.2023.1134141

[16] Nayak, U. (2025). AI-Powered Data Pipelines: Leveraging Machine Learning for ETL Optimization. Journal of Software Engineering and Simulation, 11(6), 134-136.

[17] Heffetz, Y., Vainstein, R., & Katz, G. (2019). DeepLine: AutoML Tool for Pipelines Generation using Deep Reinforcement Learning and Hierarchical Actions Filtering. arXiv. https://doi.org/10.48550/arXiv.1911.00061

[18] Chanda, D. (2024). Automated ETL Pipelines for Modern Data Warehousing: Architectures, Challenges, and Emerging Solutions. The Eastasouth Journal of Information System and Computer Science, 1(03), 209–212. https://doi.org/10.58812/esiscs.v1i03.523

[19] Lyashkevych, L., Lyashkevych, V., & Shuvar, R. (2024). Exploratory data analysis possibility in the meaning space using large language models. Electronics and Information Technologies, 1(25), 9, 102–116. http://dx.doi.org/10.30970/eli.25.9

[20] Chen, M., Tworek, J., Jun, H., et al. (2021). Evaluating large language models trained on code. arXiv:2107.03374. https://doi.org/10.48550/arXiv.2107.03374

[21] Wang, Y., Yin, W., Li, B., et al. (2023). CodeT5+: Open code large language models for code understanding and generation. arXiv:2305.07922. https://doi.org/10.48550/arXiv.2305.07922

[22] Rozière, B., Gehring, J., Gloeckle, F., et al. (2023). Code Llama: Open foundation models for code. arXiv:2308.12950. https://doi.org/10.48550/arXiv.2308.12950

[23] Zhang, Y., Wang, C., Xie, T., & Huang, J. (2023). A survey on program synthesis with large language models. arXiv:2311.07989. https://doi.org/10.48550/arXiv.2311.07989

[24] Schick, T., Dwivedi-Yu, J., Dessì, R., Raileanu, R., Lomeli, M., Hambro, E., Zettlemoyer, L., Cancedda, N., & Scialom, T. (2023). Toolformer: Language models can teach themselves to use tools. In Proceedings of the 37th International Conference on Neural Information Processing Systems (Article 2997, pp. 1–13). Curran Associates Inc. https://dl.acm.org/doi/10.5555/3666122.3669119.

[25] Park, J. S., O'Brien, J., Cai, C. J., Morris, M. R., Liang, P., & Bernstein, M. S. (2023, October). Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th annual acm symposium on user interface software and technology (pp. 1-22). https://doi.org/10.48550/arXiv.2304.03442




DOI: http://dx.doi.org/10.30970/eli.32.3

Refbacks

  • There are currently no refbacks.