Research Article

AI-Assisted Incident Management in SRE: The Role of LLMs and Anomaly Detection

Authors

  • Karthickram Vailraj KLNCE-Anna University, India

Abstract

This article examines the transformative impact of artificial intelligence technologies on incident management within site reliability engineering (SRE) teams, with particular emphasis on financial platforms where reliability is paramount. The article explores how large language models (LLMs) and advanced anomaly detection systems are revolutionizing the entire incident lifecycle—from initial detection through resolution and documentation. Integrating these technologies enables a shift from reactive to proactive approaches, where potential issues can be identified and addressed before they manifest as service disruptions. The article encompasses technical foundations and practical implementations, drawing on case studies from trading platforms, payment processing systems, and client-facing financial applications. The article investigates the evolution from rule-based monitoring to intelligent observability, applying supervised and unsupervised learning techniques for anomaly detection, and the powerful capabilities LLMs bring to alert correlation and root cause analysis. While highlighting these technologies' substantial benefits, the article also addresses critical challenges, including explainability limitations, managing false positives, security concerns, and organizational adaptation requirements. The article concludes by exploring emerging research directions, including multimodal AI approaches, reinforcement learning applications, and the potential for autonomous remediation systems, presenting a comprehensive view of how AI is reshaping incident management in mission-critical financial environments.

Article information

Journal

Journal of Computer Science and Technology Studies

Volume (Issue)

7 (6)

Pages

649-658

Published

2025-06-17

How to Cite

Karthickram Vailraj. (2025). AI-Assisted Incident Management in SRE: The Role of LLMs and Anomaly Detection. Journal of Computer Science and Technology Studies, 7(6), 649-658. https://doi.org/10.32996/jcsts.2025.7.75

Downloads

Views

8

Downloads

5

Keywords:

Site Reliability Engineering, Large Language Models, Anomaly Detection, Financial Systems, Incident Management