Hallucination-Reduced and Robust Accuracy Unit Test Generation

Main Article Content

Metin Deder
Simay Sahin
Merve Yilmazer
Mehmet Karaköse

Abstract

Automated unit testing methods in the software development process are crucial for reducing costs, improving product quality, and ensuring system reliability. While current Large Language Models (LLMs) are highly successful in general-purpose code generation, they may fall short in ensuring structural integrity and producing executable code in industrial fields such as C++ and ROS 2, where memory management is critical and external dependencies are frequently used. The proposed method fills this gap by focusing not only on high-level languages, unlike existing studies in the literature, but also on industrial embedded system architectures. The proposed method developed in this study aims to create high-accuracy unit tests by reducing the hallucination rate for systems without existing test scope, and to develop systems with existing test scope using developer logic. Recently distinguished by its success in code generation, the 7-billion-parameter Qwen 2.5 Coder model was selected as the base model. A multilingual dataset consisting of over 13,000 unique code-test pairs was created to reduce the model's computational costs and improve test code generation speed. The model was trained using QLoRA (Quantized Low-Rank Adaptation) and LLM fine-tuning methods. The proposed method has contributed to time savings and increased efficiency by accelerating test code generation speed by approximately 4 times compared to existing cloud-based approaches. Furthermore, unlike functionality-focused black-box testing and raw text-based approaches in the literature, the model's understanding of the project context is ensured by using Abstract Syntax Trees (AST), and the hallucination problem is significantly reduced by employing white-box and structural testing principles that examine the internal structure and dependencies of the source code. The proposed method addresses the limitations of leveraging large language models when generating unit test code and the key points in producing effective unit test code for industrial applications.

Article Details

Section

Regular Paper

How to Cite

Hallucination-Reduced and Robust Accuracy Unit Test Generation. (2026). International Journal of Management and Data Analytics (IJMADA), 6(1), 191-205. https://ijmada.com/index.php/ijmada/article/view/125

References

Akıncı, F. S., & Tuğlular, T. (2025). A contract-driven automated unit test maintenance approach with generative artificial intelligence for backend software projects. Journal of Smart Systems, 4(2), 74-97.

Celik, A., & Mahmoud, Q. H. (2025). A review of large language models for automated test case generation. Machine Learning and Knowledge Extraction, 7(3), 97. https://doi.org/10.3390/make7030097

Chen, Y., Hu, Z., Zhi, C., Han, J., Deng, S., & Yin, J. (2023). ChatUniTest: A framework for llm-based test generation. Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering.

Chu, B., Feng, Y., Liu, K., Guo, Z., Zhang, Y., Shi, H., Nan, Z., & Xu, B. (2025). Large language models for unit test generation: Achievements, challenges, and opportunities. ArXiv, abs/2511.21382.

Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2023). QLoRA: Efficient finetuning of quantized llms. ArXiv, abs/2305.14314.

Ferreira, M., Viegas, L., Faria, J.P., & Lima, B.M. (2025). Acceptance test generation with large language models: An industrial case study. 2025 IEEE/ACM International Conference on Automation of Software Test (AST), 1-11.

Genç, S., Ceylan, M.F., & Istanbullu, A. (2025). Software unit test automation with llm-based generative ai: Evaluating test quality through code coverage and edge-case analysis. 2025 10th International Conference on Computer Science and Engineering (UBMK), 242-247.

Guzu, A., Nicolae, G., Cucu, H., & Burileanu, C. (2025). Large language models for c test case generation: A comparative analysis. Electronics.

Hu, J.E., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., & Chen, W. (2021). LoRA: Low-rank adaptation of large language models. ArXiv, abs/2106.09685.

Jang, W., & Kim, R.Y. (2025). Automatic test case generation mechanism with natural language-based korean requirement specifications. IEEE Access, 13, 177305-177317.

Lira, W.A., Neto, P.D., Avelino, G., & Osório, L.F. (2025). Evaluating the effectiveness and cost-efficiency of large language models in automated unit test generation. Brazilian Symposium on Software Quality.

Pan, R., Kim, M., Krishna, R., Pavuluri, R., & Sinha, S. (2024). Aster: Natural and multi-language unit test generation with llms. 2025 IEEE/ACM 47th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), 413-424.

Santos, M.G., & Petrillo, F. (2021). Software engineering for robotic systems: a systematic mapping study. ArXiv, abs/2102.12520.

Chen, X., Gao, C., Chen, C., Zhang, G., & Liu, Y. (2024). An empirical study on challenges for llm application developers. ACM Transactions on Software Engineering and Methodology, 34, 1 - 37.

Yuan, Z., Lou, Y., Liu, M., Ding, S., Wang, K., Chen, Y., & Peng, X. (2023). No more manual tests? evaluating and Improving chatgpt for unit test generation. ArXiv, abs/2305.04207.

Zhang, Q., Fang, C., Gu, S., Shang, Y., Chen, Z., & Xiao, L. (2025). Large language models for unit testing: A systematic literature review. ArXiv, abs/2506.15227.

Zhu, H., & Zhang, H. (2025). Framework and performance evaluation of test case generation for large language models in software testing. 2025 4th International Conference on Electronic Information Technology (EIT), 642-647.

Yilmazer, M., & Karakose, M. (2025). Llm-based video analytics test scenario generation in smart cities. In 2025 29th International Conference on Information Technology (IT) (pp. 1–4). IEEE.

Ogdu, C. U., Gurbuz, S., Karakose, M., & Hanoglu, E. (2025). Medical implications of llm based clinical decision support systems in healthcare. In 2025 29th International Conference on Information Technology (IT) (pp. 1–4). IEEE.

Altundogan, T. G., & Karakose, M. (2025). QUBVIS: Query based multi-modal summarization system using CLIP based transformer and vision language models. SoftwareX, 31, 102303.

Zhang, Q., Shang, Y., Fang, C., Gu, S., Zhou, J., & Chen, Z. (2024). TestBench: Evaluating class-level test case generation capability of large language models. ArXiv, abs/2409.17561.

Dakhel, A.M., Nikanjam, A., Majdinasab, V., Khomh, F., & Desmarais, M.C. (2023). Effective test generation using pre-trained large language models and mutation testing. ArXiv, abs/2308.16557.

Wang, J., Huang, Y., Chen, C., Liu, Z., Wang, S., & Wang, Q. (2023). Software testing with large language models: Survey, landscape, and vision. IEEE Transactions on Software Engineering, 50, 911-936.

Fraser, G., & Arcuri, A. (2011). EvoSuite: Automatic test suite generation for object-oriented software. ESEC/FSE '11.

Gu, S., Zhang, Q., Li, K., Fang, C., Tian, F., Zhu, L., Zhou, J., & Chen, Z. (2024). TestART: Improving llm-based unit testing via co-evolution of automated generation and repair iteration.

Long, J., Qin, R., Jiang, Z., Duan, J., Li, S., & Qu, X. (2025). A python unit test generation method based on fine-tuned language models and coverage. 2025 18th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI), 1-6.

Similar Articles

You may also start an advanced similarity search for this article.