Methodology to Identify Issues and Improve the Robustness of AI Agents

Main Article Content

Aleksandr Meshkov

Abstract

AI agents based on large language models (LLMs) are becoming a key tool for automating complex tasks. Unlike general LLMs that simply generate text, modern agents are able to independently plan actions, call external tools and APIs, work with knowledge, and make decisions based on multi-stage analysis of the situation. However, with the increasing complexity of such types of systems, a critical problem of ensuring their robustness arises. This work presents a systematic approach to identifying and classifying problems in AI agent robustness. The provided problem taxonomy describes nine common problems, which might happen during the execution tasks in an AI agent. For practical usage, a comprehensive evaluation methodology is proposed, including metamorphic testing to evaluate the resistance to changes in input data, checking the correctness of working with information sources, analyzing the flow of tasks inside an AI agent, monitoring the tool usage, and evaluating the quality of the final results. The methodology contains specific metrics with success criteria and approaches to their implementation. It is shown that the proposed system covers all identified categories of errors and makes it possible to evaluate the robustness of AI agents not only at the level of components, but also the interaction part and as a system overall.

Article Details

Section

Regular Paper

How to Cite

Methodology to Identify Issues and Improve the Robustness of AI Agents. (2026). International Journal of Management and Data Analytics (IJMADA), 6(1), 1-28. http://ijmada.com/index.php/ijmada/article/view/108

References

A. Bandi, B. Kongari, R. Naguru, S. Pasnoor, and S. V. Vilipala (2025). The Rise of Agentic AI: A Review of Definitions, Frameworks, Architectures, Applications, Evaluation Metrics, and Challenges. ResearchGate. Available: https://www.researchgate.net

Q. Huang, N. Wake, B. Sarkar, and Z. Durante (2024). Position paper: Agent AI towards a holistic intelligence. arXiv:2403.00833 [cs.AI]

Grand View Research (2025). AI Agents Market (2025–2030). Tech. Rep. GVR-4-68040-471-3. Available: https://www.grandviewresearch.com

Massachusetts Institute of Technology (2025). The GenAI Divide: State of AI in Business 2025. Tech. Rep., Cambridge, MA.

Y. Shavit and S. Agarwal (2024). Practices for Governing Agentic AI Systems. White Paper, OpenAI, San Francisco, CA. Available: https://cdn.openai.com/papers/practices-for-governing-agentic-ai-systems.pdf

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao (2023). ReAct: Synergizing Reasoning and Acting in Language Models. arXiv:2210.03629 [cs.AI]

T. Schick, J. Dwivedi-Yu, R. Dessì, R. Raileanu, M. Lomeli, L. Zettlemoyer, N. Cancedda, and T. Scialom (2023). Toolformer: Language Models Can Teach Themselves to Use Tools. arXiv:2302.04761 [cs.CL]

T. Y. Chen, S.-C. Cheung, and S. M. Yiu (2020). Metamorphic testing: a new approach for generating next test cases. arXiv:2002.12543 [cs.SE]

H. Reddy, M. Srinivasan, and U. Kanewala (2025). Metamorphic Testing for Fairness Evaluation in Large Language Models: Identifying Intersectional Bias in LLaMA and GPT. arXiv:2504.07982 [cs.SE]

W. Wu, Y. Cao, N. Yi, R. Ou, and Z. Zheng (2024). Detecting and Reducing the Factual Hallucinations of Large Language Models with Metamorphic Testing. Proc. ACM Softw. Eng., vol. 1, no. FSE, article 42, pp. 1–23. doi: 10.1145/3660780

S. Giramata, M. S. Venkat, N. Gudivada, and U. Kanewala (2025). Efficient Fairness Testing in Large Language Models: Prioritizing Metamorphic Relations for Bias Detection. arXiv:2505.07870 [cs.SE]

H. Yu, A. Gan, K. Zhang, S. Tong, Q. Liu, and Z. Liu (2024). Evaluation of Retrieval-Augmented Generation: A Survey. arXiv:2405.07437 [cs.IR]

S. Es, J. James, L. Espinosa-Anke, and S. Schockaert (2024). RAGAS: Automated Evaluation of Retrieval Augmented Generation. In Proc. 18th Conf. European Chapter of the Association for Computational Linguistics: System Demonstrations (EACL '24), St. Julian's, Malta, pp. 150–158.

J. Saad-Falcon, O. Khattab, C. Potts, and M. Zaharia (2024). ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems. arXiv:2311.09476 [cs.CL]

S. Arcadinho, D. Aparicio, and M. Almeida (2024). Automated test generation to evaluate tool-augmented LLMs as conversational AI agents. arXiv:2409.15934 [cs.CL]

S. Gupta, R. Ranjan, and S. N. Singh (2024). A Comprehensive Survey of Retrieval-Augmented Generation (RAG): Evolution, Current Landscape and Future Directions. arXiv:2410.12837 [cs.IR]

K. Zhang and D. Shasha (1989). Simple Fast Algorithms for the Editing Distance Between Trees and Related Problems. SIAM J. Comput., vol. 18, no. 6, pp. 1245–1262. doi: 10.1137/0218082

P. Bille (2005). A Survey on Tree Edit Distance and Related Problems. Theoret. Comput. Sci., vol. 337, no. 1–3, pp. 217–239. doi: 10.1016/j.tcs.2004.12.030

K. Erol, J. Hendler, and D. S. Nau (1994). HTN Planning: Complexity and Expressivity. In Proc. 12th Nat. Conf. Artificial Intelligence (AAAI-94), pp. 1123–1128. AAAI Press.

D. Nau, T.-C. Au, O. Ilghami, U. Kuter, J. W. Murdock, D. Wu, and F. Yaman (2003). SHOP2: An HTN Planning System. J. Artif. Intell. Res., vol. 20, pp. 379–404. doi: 10.1613/jair.1141

G. E. P. Box, G. M. Jenkins, G. C. Reinsel, and G. M. Ljung (2015). Time Series Analysis: Forecasting and Control, 5th ed. John Wiley & Sons, Hoboken, NJ.

H. K. Khalil (2002). Nonlinear Systems, 3rd ed. Prentice Hall, Upper Saddle River, NJ.

M. Wooldridge (2009). An Introduction to Multiagent Systems, 2nd ed. John Wiley & Sons, Chichester, UK.

H. Ebbinghaus (1885). Über das Gedächtnis: Untersuchungen zur experimentellen Psychologie. Duncker & Humblot, Leipzig.

D. C. Rubin and A. E. Wenzel (1996). One Hundred Years of Forgetting: A Quantitative Description of Retention. Psychol. Rev., vol. 103, no. 4, pp. 734–760. doi: 10.1037/0033-295X.103.4.734

J. T. Wixted and E. B. Ebbesen (1991). On the Form of Forgetting. Psychol. Sci., vol. 2, no. 6, pp. 409–415. doi: 10.1111/j.1467-9280.1991.tb00175.x

J. R. Anderson and L. J. Schooler (1991). Reflections of the Environment in Memory. Psychol. Sci., vol. 2, no. 6, pp. 396–408. doi: 10.1111/j.1467-9280.1991.tb00174.x

S. Hochreiter and J. Schmidhuber (1997). Long Short-Term Memory. Neural Comput., vol. 9, no. 8, pp. 1735–1780. doi: 10.1162/neco.1997.9.8.1735

L. Breslau, P. Cao, L. Fan, G. Phillips, and S. Shenker (1999). Web Caching and Zipf-like Distributions: Evidence and Implications. In Proc. IEEE INFOCOM '99, vol. 1, pp. 126–134. doi: 10.1109/INFCOM.1999.749260

A. Avizienis, J.-C. Laprie, B. Randell, and C. Landwehr (2004). Basic Concepts and Taxonomy of Dependable and Secure Computing. IEEE Trans. Dependable Secure Comput., vol. 1, no. 1, pp. 11–33. doi: 10.1109/TDSC.2004.2

K. J. Åström and B. Wittenmark (2013). Adaptive Control, 2nd ed. Dover Publications, Mineola, NY.

N. G. Leveson (1995). Safeware: System Safety and Computers. Addison-Wesley, Reading, MA.

M. Rausand and A. Høyland (2004). System Reliability Theory: Models, Statistical Methods, and Applications, 2nd ed. John Wiley & Sons, Hoboken, NJ.

M. T. Nygard (2007). Release It!: Design and Deploy Production-Ready Software. Pragmatic Bookshelf, Raleigh, NC.

Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii, Y. J. Bang, A. Madotto, and P. Fung (2023). Survey of Hallucination in Natural Language Generation. ACM Comput. Surv., vol. 55, no. 12, article 248, pp. 1–38. doi: 10.1145/3571730

Y. Zhang, Y. Li, L. Cui, D. Cai, L. Liu, T. Fu, X. Huang, E. Zhao, Y. Zhang, Y. Chen, L. Wang, A. T. Luu, W. Bi, F. Shi, and S. Shi (2023). Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models. arXiv:2309.01219 [cs.CL]

T. L. Saaty (2008). Decision Making with the Analytic Hierarchy Process. Int. J. Services Sci., vol. 1, no. 1, pp. 83–98. doi: 10.1504/IJSSCI.2008.017590

P. K. Manadhata and J. M. Wing (2011). An Attack Surface Metric. IEEE Trans. Softw. Eng., vol. 37, no. 3, pp. 371–386. doi: 10.1109/TSE.2010.60

D. J. C. MacKay (2003). Information Theory, Inference and Learning Algorithms. Cambridge University Press, Cambridge, UK.

T. M. Cover and J. A. Thomas (2006). Elements of Information Theory, 2nd ed. John Wiley & Sons, Hoboken, NJ.

J. Neyman and E. S. Pearson (1933). On the Problem of the Most Efficient Tests of Statistical Hypotheses. Philos. Trans. Roy. Soc. London Ser. A, vol. 231, pp. 289–337. doi: 10.1098/rsta.1933.0009

C. M. Bishop (2006). Pattern Recognition and Machine Learning. Springer, New York, NY.

K. Popat, S. Mukherjee, J. Strötgen, and G. Weikum (2017). Where the Truth Lies: Explaining the Credibility of Emerging Claims on the Web and Social Media. In Proc. 26th Int. Conf. World Wide Web (WWW '17), pp. 1003–1012. doi: 10.1145/3038912.3052596

J. Thorne and A. Vlachos (2018). Automated Fact Checking: Task Formulations, Methods and Future Directions. In Proc. 27th Int. Conf. Computational Linguistics (COLING 2018), pp. 3346–3359. Association for Computational Linguistics.

N. Hassan, G. Zhang, F. Arslan, J. Caraballo, D. Jimenez, S. Gawsane, S. Hasan, M. Joseph, A. Kulkarni, A. K. Nayak, V. Sable, C. Li, and M. Tremayne (2017). ClaimBuster: The First-ever End-to-end Fact-checking System. Proc. VLDB Endowment, vol. 10, no. 12, pp. 1945–1948. doi: 10.14778/3137765.3137815

J. Thorne, A. Vlachos, C. Christodoulopoulos, and A. Mittal (2018). FEVER: A Large-scale Dataset for Fact Extraction and VERification. In Proc. 2018 Conf. North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2018), vol. 1, pp. 809–819. doi: 10.18653/v1/N18-1074

Similar Articles

You may also start an advanced similarity search for this article.