New Relic Now Start training on Intelligent Observability February 25th.
Save your seat.

As today’s systems grow increasingly distributed, ephemeral, and complex, the ability to understand the intricacies of your organization’s bespoke system architecture comes with a learning curve—as do the systems that help you observe them. While New Relic allows you to collect, report, and alert on metrics, events, logs, and traces from any source of telemetry data in one integrated platform experience, your ability to understand the health of your system may still be limited by your proficiency with New Relic Query Language (NRQL), our proprietary query language. New Relic AI reduces this learning curve by helping you onboard, analyze, troubleshoot, and debug using natural language prompts and platform-wide integrations. This enables both engineers and non-technical stakeholders to gain greater insight into their telemetry data and make more decisions informed by the health of their digital environments.

New Relic AI’s ability to convert natural language questions into NRQL is derived from cutting-edge large language models (LLMs), such as OpenAI’s GPT-4 Turbo. While these LLMs are robust and trained on extensive web data, including New Relic documentation, they don’t work perfectly with NRQL out of the box and require additional instructions to perform as expected. In this blog, you’ll learn about some of the techniques our engineering team utilized to optimize New Relic AI’s ability to generate NRQL queries based on natural language inputs—a skill we refer to internally as “NL2NRQL.” 

User-based context and few-shot prompting

Prompt engineering involves various strategies for crafting and enhancing prompts to effectively harness the capabilities of LLMs. This is typically done in phases; in addition to teaching them politeness and constructive responses, they require fine-tuning to promote specific behaviors and tasks. Although LLMs have the ability to understand NRQL, they sometimes mix its syntax with SQL—especially for complex queries—because the body of knowledge on SQL available to LLMs on the internet is far greater than what’s available in New Relic documentation. For instance, the NRQL expression to filter out week-old data is SINCE 1 week ago. But when trying to generate an NRQL clause, LLMs tend to default to the SQL syntax, returning something like WHERE timestamp >= (CURRENT_DATE - interval '1 week').

With each prompt that’s sent to the LLM via the assistant, various prompt engineering techniques are used to customize the LLM output to our needs and convert your queries into valid NRQL responses: 

  1. We first narrow down the task by getting a list of all New Relic database (NRDB) events accessible in your account and provide these, along with your query, to the LLM. This generates a list of the most relevant NRDB events for the LLM to choose from. 
  2. We then retrieve these events' schema from NRDB and supplement them with information from New Relic documentation. By using only NRDB data that’s available to your user permissions, we ensure no context is shared between other users and accounts. This also helps the AI manage situations where you may ask about events or attributes that are user-defined and/or not provided by New Relic.
  3. Next, we send a prompt to the LLM that’s generating an NRQL query that includes the following:
    • A general task description. For example, "You are an AI that translates user questions into New Relic Query Language (NRQL) queries. Your task is to generate a query based on a user’s question, user information, event schema descriptions, and example prompts."
    • An overview of how NRQL syntax contrasts with SQL.
    • A detailed schema of events from your account that best aligns with your question.
    • Examples of other user questions and expected NRQL answers.

The strategy of providing examples of similar questions to the LLM is known as few-shot prompting. We provide the LLM with samples of questions people might ask and the corresponding NRQL queries we expect it to output for such questions (as a kind of a cheat sheet), and the volume and quality of examples in the pool are crucial. The example pairs are stored in Pinecone, a vector database. When you ask a question that needs to be translated into NRQL, we transform it into an embedding vector (a numerical representation of the text) and then query Pinecone to retrieve examples where the embedding vector of a question is similar to the one we’re querying. 

Even with such preparations, the LLM can still make mistakes with NRQL syntax. To address this, we've implemented an additional validation service that checks the generated queries. If a query is syntactically incorrect, it triggers a feedback loop: the error message is fed back into the LLM along with snippets of NRQL syntax examples and examples of corrections of common mistakes. This typically results in a corrected NRQL query on the second attempt, which we then present to the user complete with visualized results. In instances where the correction is unsuccessful, we share the query with the user along with a disclaimer that we were unable to generate a valid NRQL to help ensure transparency. We monitor such cases and use them to improve upon our NL2NRQL pipeline based on these learnings.

Learnings, challenges, and limitations

Preventing syntax hallucinations

If you've used an LLM to translate natural language to a programming language, you may have encountered syntax hallucinations—where a code block is perfectly structured, but some keyword, function, or attribute doesn't exist. New Relic AI’s use of an LLM to translate natural language to NRQL makes it susceptible to the same errors. To avoid syntax hallucinations, we actively test the validity of our generated NRQL by running it through our parser and compiler for syntax evaluation, then against NRDB to ensure it returns results. Thus, New Relic AI effectively performs syntax validation continuously in real time, and only returns queries to users that are syntactically correct. You can read more about our methods for evaluating the performance of New Relic AI in this blog post by New Relic Senior Data Scientist Tal Reisfeld.

Addressing question ambiguity

Mapping a question in a natural language to a question in a well-defined query language requires our AI to perform the challenging task of understanding precisely what’s being searched for. For example, if you ask, “How many errors have happened recently?” our AI assistant has to make assumptions around which types of errors you’re asking about (mobile, transaction, or perhaps browser?), as well as the most reasonable time frame that can be considered “recent.”

To address potential ambiguities in user queries, we employ a multi-pronged approach. First, we provide contextual information about what part of the UI you’re using when you ask the question. This context includes details such as the entity you’re analyzing and the selected time picker values; in the case you’re actively editing an NRQL query, it will include the query itself. Second, we’re expanding our few-shot example set to better demonstrate the expected behavior and provide more comprehensive information about the schemas of New Relic-provided events and attributes. By using this contextual information and augmented examples, our language model better understands your intent and provides more accurate and relevant responses, reducing ambiguity.

Predicting unknown metadata

NRDB’s schemaless architecture serves as a fundamental differentiator of New Relic telemetry databases over competing offerings, providing you with superior flexibility in how your data is stored. And because NRQL is able to read and retrieve custom fields, you benefit from the same flexibility as you later access and analyze this data. But because these custom events and attributes lack standard documentation, it’s far more difficult to estimate the shape and content of the data stored in them. In such cases, the AI relies on the descriptive nature of user-defined events and attribute names to write correct queries.

Balancing cost and complexity

The research and development of highly-accurate NL2NRQL translation entailed significant financial and time investments. It was important for us to strike a balance between what level of complexity of the NL2NRQL pipeline we could accommodate while staying within budgeted time and financial resources. 

Here are a few examples of decisions we made to balance performance and resource usage:

  • We limit the time range of all NRQL queries in the backend to ensure that we constrain resource consumption and reduce the time required to execute the queries.
  • For prompts where we need to summarize the NRDB output, we select the representation format (for example, JSON or CSV) that minimizes the number of tokens used.
  • In non-user-facing LLM calls, we instruct the AI model to provide short, highly-structured responses. This approach ensures that the model's responses are faster, more cost-effective, and easier to parse.

Optimizing performance and resource usage is a continuous endeavor, and we persistently explore avenues to refine and enhance this process.

Ongoing improvements

Continuous testing of essential skills, including NL2NRQL translation, remains an essential part of enhancing our AI assistant further with new tools and features to ensure seamless end-user experiences. One key focus area for this stage of development is to create more integrations that proactively surface full-stack insights via intuitive buttons and prompts directly in user workflows through embedded platform experiences. Recent examples include an "Explain this error" button that helps you understand stack traces of errors from errors inbox, and a similar experience that helps you diagnose and remediate error logs. With one click, these integrations launch a curated prompt along with specific context and guardrails to the AI assistant to help streamline troubleshooting and help you improve end user experiences. These capability-specific tools are just one part of our compound AI system that democratizes access to deep telemetry insights for more data-driven decisions across the organization.