AI in observability: Advancing system monitoring and performance

As modern IT environments grow increasingly complex, maintaining system performance and reliability has become more challenging than ever. Traditional monitoring tools, while effective in simpler contexts, often fall short in providing the deep insights required to manage today’s distributed and AI-driven systems. This is where observability comes into play—offering a more comprehensive approach to understanding system behavior and improving its performance.

At its core, observability is about gaining actionable insights from the telemetry data—metrics, events, logs, and traces (MELT)—generated by applications and infrastructure. However, as the volume and complexity of this data increase, manual analysis becomes impractical. AI itself is emerging as a key enabler, transforming how organizations approach observability by enhancing system monitoring, predicting potential issues, and optimizing performance. At the next stage of its evolution, with AI at its core, intelligent observability allows you to understand and proactively manage your complex IT environment.

Understanding observability in AI-driven systems

Observability provides a detailed view of your system's health and performance. It involves collecting and analyzing telemetry data, such as MELT, to understand not just what’s happening within a system, but why it’s happening. This deeper level of insight is crucial for identifying and resolving issues in real time, ensuring that systems perform optimally under various conditions.

AI-driven systems introduce additional layers of complexity to observability. These systems often involve intricate data pipelines, model training and inference processes, and dynamic scaling based on real-time data. Observability in this context must extend beyond traditional MELT data to include the specific behaviors and performance characteristics of AI components. For example, monitoring the performance of a machine learning (ML) model in production requires tracking metrics like inference latency, model accuracy, and resource utilization during inference. Logs might include details about data inputs, model versioning, and any exceptions encountered during the inference process. Traces can be crucial for understanding how data flows through various preprocessing steps before reaching the model, as well as how downstream services consume the model's output. However, teams must also be vigilant about potential issues like model drift, where a model’s accuracy degrades over time due to changing input data, and the performance of data pipelines that feed these models. Continuous monitoring of model accuracy and the efficiency of these pipelines ensures that AI systems remain reliable and performant, allowing teams to take proactive measures when issues arise.

Tools like New Relic play a key role in addressing these challenges by providing advanced observability features that help detect and respond to issues such as model drift and data pipeline inefficiencies. The image below shows the Model drift and Data drift of the ML model in New Relic.

Intelligent observability: How AI is revolutionizing observability

As we navigate through an era dominated by AI advancements, it's clear that AI is not only a driving force behind new applications and systems but also a transformative element in how we manage and monitor those systems. The complexity of modern IT environments, especially those infused with AI, has outpaced the capabilities of traditional observability practices. Here, AI itself becomes the solution, revolutionizing how observability is approached, implemented, and utilized in today's tech landscape. By incorporating AI into the observability platform itself, it becomes intelligent enough to keep up with ever-growing digital complexity.

Automated anomaly detection

AI significantly enhances the ability to detect anomalies by automatically analyzing vast amounts of telemetry data and identifying deviations from normal behavior. In traditional systems, anomaly detection might involve tracking metrics like CPU usage and triggering alerts when predefined thresholds are breached. AI goes a step further by learning what "normal" looks like in a dynamic environment and detecting subtle issues that might be missed by static thresholds. For instance, in cloud infrastructure, AI can identify an unusual spike in resource consumption that could indicate a potential scaling issue or a security breach, even if it doesn’t cross standard thresholds. Similarly, AI can monitor user behavior in a web application, detecting subtle changes that might signal a degradation in user experience before it becomes noticeable. This automated approach significantly reduces the mean time to detection (MTTD), enabling faster response times and minimizing system downtime.

Predictive analytics for preventive monitoring

AI doesn’t just help in detecting current issues; it also plays a crucial role in predicting future problems. Predictive analytics, powered by ML, can analyze trends in telemetry data to forecast potential system failures or performance bottlenecks before they occur. For instance, in a typical server environment, AI can predict potential disk space depletion based on current usage trends, allowing teams to address the issue before it causes downtime. In AI-driven systems, predictive analytics might forecast when an ML model will need retraining based on changes in data patterns or forecast network congestion during peak usage times. By anticipating these issues, teams can take preventive actions, such as scaling resources or adjusting configurations, to ensure continuous system performance and reliability.

Root cause analysis

When issues do arise, determining their root cause can be a complex and time-consuming process, especially in distributed systems with many interdependent components. Imagine an ecommerce application experiencing performance degradation during a sales event. Multiple alerts are triggered across different services: the web application shows increased latency, the database reports high query times, and the payment gateway logs numerous timeouts. In traditional settings, engineers would manually examine logs, metrics, and traces from each service to identify the problem, which can be time-consuming and error-prone.

Intelligent observability tools enhance this process by employing AI-driven data correlation techniques that automatically analyze and correlate data from multiple sources, helping to surface the most likely root causes. For instance, the recent spike in latency may be correlated to a recent deployment that altered database query patterns, leading to increased load and timeouts. By automatically linking related alerts and identifying significant changes in system behavior, the observability tools can reduce the mean time to resolution (MTTR) by quickly identifying the root cause, whether it's related to infrastructure, application logic, or external dependencies.

Alerting correlation and noise reduction

In complex IT environments, a single issue can trigger multiple alerts across various components, leading to "alert fatigue" where critical signals are buried in a flood of notifications. Consider a scenario in a microservices-based application during a peak traffic event. Multiple alerts start triggering across different services: abnormal CPU usage, high memory consumption, and increased error rates in the database. On their own, each of these alerts could indicate a potential issue, but when they occur simultaneously, they are often symptoms of a single underlying problem—such as a database bottleneck caused by a sudden surge in requests.

By using alert correlation techniques, these individual alerts can be grouped into a single incident, reflecting the broader issue rather than treating each symptom as an isolated problem. Modern observability practices can enhance this process by automatically correlating alerts based on patterns in the data, such as shared infrastructure components, timing, or similar error messages. This approach not only reduces the alert noise but also provides a more coherent view of what’s happening in the system reducing MTTR.

The image below shows multiple failure incidents that were monitored and reported, correlated across multiple locations, in New Relic:

Leveraging New Relic AI features for advanced observability

As AI continues to transform observability, New Relic has integrated several advanced AI-driven capabilities into its platform to help organizations better manage and monitor their complex systems.

New Relic AI monitoring

New Relic AI monitoring is specifically designed for AI applications that use large language models (LLMs) and similar advanced models. This tool provides comprehensive observability across the entire AI stack—from infrastructure and data processing to the models themselves. Engineers can monitor key metrics such as response times, token usage, and error rates for LLMs, ensuring these models perform optimally. For example, engineers can use AI Monitoring to track how efficiently their LLMs are processing requests, identify performance bottlenecks, and manage the cost implications of using these models.

The image below show the full trace view of an AI chatbot transaction in New Relic.

New Relic AI

New Relic AI is the first generative AI assistant for observability designed to make observability more accessible and efficient. One of its standout features is the ability to convert everyday language queries into New Relic Query Language (NRQL). This allows users to fetch insights from their data without needing to write complex queries, streamlining the process of gaining actionable insights. For instance, a user could ask the AI to "show the average response time for the last 24 hours," and the system would automatically translate that into the appropriate NRQL query, delivering the results in seconds. It also provides quick explanations for errors, automates synthetic checks to simulate user interactions, and offers context-specific recommendations for optimizing performance. For example, you can ask New Relic AI for "What's hot?" and it will provide an overview of the issues and actionable explanations to speed up troubleshooting. Additionally, the AI can help generate synthetic checks, ensuring that your monitoring aligns with real-world user behavior. These features empower teams to resolve issues faster and proactively manage their systems.

The video shows how you can use New Relic AI to get insights from heaps of telemetry data using everyday language.

Machine learning operations (MLOps)

MLOps in New Relic focuses on the lifecycle management of custom machine learning models in production. It provides monitoring and diagnostic tools that help track model performance, detect data drift, and ensure models are functioning as expected in real-world conditions. Additionally, it allows data teams the ability to collaborate directly with DevOps teams, which creates a continuous process of development, testing, and operational monitoring.

Artificial intelligence for IT operations (AIOps)

AIOps tools leverage ML to manage and reduce alert noise, automatically correlating related incidents to help teams focus on the most critical issues. These tools enhance incident management by prioritizing alerts that are most likely to indicate significant problems, allowing teams to respond more effectively and reduce downtime. In environments with high volumes of telemetry data, AIOps helps cut through the noise, ensuring that engineers can quickly identify and address the root causes of incidents.

These AI-driven features from New Relic are integral to modern observability practices, allowing organizations to effectively manage the complexities of today’s IT environments. By incorporating these tools, teams can enhance their ability to monitor, diagnose, and optimize their systems, ensuring that they remain robust and reliable even as they scale.

Conclusion

As AI continues to evolve, it plays an increasingly vital role in transforming observability practices. Traditional monitoring methods are no longer sufficient to manage the complexity and scale of modern IT environments, particularly those driven by distributed systems and AI applications. But at the same time, AI can be leveraged to gain deeper insights from your telemetry data.

The New Relic intelligent observability suite of AI-driven tools, including AIOps, New Relic AI, and AI monitoring empower organizations to maintain high-performance systems while efficiently managing the complexities of AI and modern infrastructure. By integrating these advanced capabilities, teams can ensure their systems remain reliable, scalable, and optimized for performance.

Next steps

Not already using New Relic intelligent observability? Sign up with New Relic for free and explore how modern observability tools can help you ensure that your systems remain robust and efficient.

Learn how to monitor and optimize AI applications, especially those using LLMs, with AI monitoring.
Discover how New Relic AI helps troubleshoot and manage observability data using generative AI.

By Mehreen Tahir, Software Engineer

Mehreen Tahir is a software engineer and technical writer at New Relic.

The views expressed on this blog are those of the author and do not necessarily reflect the views of New Relic. Any solutions offered by the author are environment-specific and not part of the commercial solutions or support offered by New Relic. Please join us exclusively at the Explorers Hub (discuss.newrelic.com) for questions and support related to this blog post. This blog may contain links to content on third-party sites. By providing such links, New Relic does not adopt, guarantee, approve or endorse the information, views or products available on such sites.

780+ integrations to start monitoring your stack for free.

See All Integrations

In this article