New Relic Now Dream of innovating more? Start living the dream in October.
RSVP Now

As artificial intelligence (AI) continues to evolve, chatbots have become indispensable tools for businesses, providing real-time assistance, personalizing recommendations, and automating customer service across various platforms. As these chatbots handle increasingly complex tasks, such as answering questions, retrieving specific information, and guiding users through multi-step processes, their performance becomes crucial.

If your chatbot is prone to errors or lags, it can frustrate your users, leading to poor experiences and potential loss of business. By using an observability tool like New Relic AI monitoring, you can observe key metrics such as response time, token usage, and error rates to ensure your chatbot performs optimally and delivers a smooth and efficient experience for your users.

In this guide, you’ll learn how to do exactly that! Here are the steps you’ll take to monitor and optimize the performance of a chatbot using New Relic AI monitoring:

  • Set up a chatbot application using OpenAI
  • Integrate New Relic for real-time monitoring of your chatbot
  • Track key performance metrics
  • Identify and resolve performance issues using New Relic

Setting up the environment

For this tutorial, you’ll use a demo application called Relicstraurants, a customer service chatbot that’s specifically designed to interact with a provided dataset so it can provide tailored responses based on user queries. The chatbot is built using Flask and OpenAI’s GPT-4o model.

Requirements:

Before getting started, ensure you have the following:

With these prerequisites in place, you can now proceed to set up the Relicstraurants chatbot.

  1. Clone the Relicstraurants chatbot application to your local environment.
git clone https://github.com/mehreentahir16/relicstraurants_chatbot.git
cd relicstraurants_chatbot
  1. Set up a virtual environment and install the required packages.
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
  1. Before running the application, you’ll need to export your OpenAI API key as an environment variable.
export OPENAI_API_KEY= your-openai-api-key
  1. Start the chatbot by running the following command.
python3 app.py

Once the application is running, you should see an output similar to the one below in your terminal:

* Serving Flask app 'app'
* Debug mode: on
WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
* Running on http://127.0.0.1:5000
Press CTRL+C to quit
  1. Navigate to your web browser at http://127.0.0.1:5000/ and start chatting!

With Relicstraurants up and running, you can explore its functionality and see how it leverages custom data to deliver tailored responses to users.

Integrating New Relic for real-time monitoring

With your Relicstraurants chatbot up and running, the next step is to install a New Relic agent for real-time monitoring. This will enable you to observe key performance metrics, identify potential issues, and optimize your chatbot.

  1. Log in to your New Relic account and click on Integrations and Agents from the left-hand navigation menu.
  2. From the listed capabilities, select Python.
  3. On the next screen, choose On a host (package manager) as your instrumentation method.
  4. Next, enter your credentials.
  5. Now follow the on-screen instructions to install the agent.
  • If you haven't already, install the New Relic package by running the following command:
pip install newrelic
  • Ensure that the Analyze your AI application responses feature is toggled on. This will allow you to monitor your chatbot's performance in handling AI-related queries.
  • Download the newrelic.ini configuration file and add it to your project’s root directory. 
  1. Once your configuration file is in place, add the following lines to the top of your app.py file to initialize the New Relic agent:
import newrelic.agent
newrelic.agent.initialize('newrelic.ini')
  1. Now restart your application using the following command:
NEW_RELIC_CONFIG_FILE=newrelic.ini newrelic-admin run-program python3 app.py

Once your application is running, start chatting with your chatbot to generate some traffic. The next step is to understand the performance metrics that New Relic provides. These metrics will help you monitor the health of your application and identify potential performance issues.

Tracking key performance metrics

Since New Relic AI monitoring is embedded within the Python APM agent, you get the best of both worlds—detailed insights into both your application and the AI-driven responses. This dual-layered monitoring ensures that while you’re keeping an eye on general app health, you’re also tracking how well the AI itself is performing. Let’s take a look at the key metrics displayed in New Relic.

  1. Navigate to your New Relic account, and choose your application from the APM & Services section.
  2. On the next screen, you’ll see a dashboard that presents essential performance metrics related to your application such as Web transaction time, Apdex score, and Throughput.
  1. As you scroll down the dashboard, you’ll find the AI Responses summary section. This is where you can find the AI-specific metrics, such as Response time, Token usage per response, and Error rate that give you a bird-eye view of how well your chatbot is performing.

These metrics are essential for identifying bottlenecks specific to your AI component. By monitoring these, you can quickly detect if your chatbot is taking too long to respond, consuming more tokens than expected, or facing errors in certain types of interactions.

Identifying and resolving performance issues

It’s time to dig deeper into specific issues affecting your chatbot’s performance. For instance, the above screenshot shows a significantly high Response time and Errors. These are clear indicators that performance bottlenecks exist. Let’s take a closer look.

Response time: Identifying slow transactions

  1. On the application summary page, you see the 5 slowest transactions (by total time). Select the slowest transaction.

In the following screenshot, you can see the span waterfall of the transaction along with a pie chart that shows which components consumed the most time during the transaction.

The trace duration is 29.77s, which is quite high for a chatbot application. You can also see that the API call to OpenAI (api.openai.com - httpx/POST) took up 99.92% of the total response time, making it the primary reason for the overall slow response time.

You can also choose to Focus on Slow spans as shown in the following image:

In this case, since the latency is introduced by an external API call, one solution to reduce latency is to cache commonly requested responses. If your chatbot frequently answers similar questions (for example, restaurant suggestions), caching these responses will prevent the system from making repeated calls to OpenAI for the same data and consuming additional tokens. Another possible solution is to access the OpenAI via the provisioned throughput unit (PTU) option depending on your use case.

Error tracking and resolution

No application is error-free, and in AI-based systems, errors may occur for various reasons, such as invalid inputs, API failures, or system overloads. However, errors aren’t always easy to diagnose. For instance, you might have noticed that the chatbot is encountering an error when searching for restaurants when no cuisine is specified.

Let’s take a close look at what’s happening.

Navigate to the AI Responses section from the entity menu in New Relic. Here, you’ll see detailed information about how your chatbot is interacting with users, its response efficiency, and any issues or errors that need to be addressed. Select any of the responses where the user is asking for restaurant suggestions.

This takes you to the Response details page along with the associated Trace, Logs, and Metadata.

In this view, you’ll see the Request Duration, Total Tokens, User input, and more. The Response Details panel is especially important as it provides context around the chatbot's actions. It specifies the tools, format, and user questions for the chatbot and asks for the response. You can also examine the Action and Action Input as a response to the user query.

The spans that are highlighted in red indicate calls where errors or performance bottlenecks occurred; in this case, we can see that the LangChain/invoke and LangChain/run calls encountered an issue. You can use these trace details to debug the error. Click on the LangChain/invoke span to open the details page.

Click on the Error Details to view the details of the error.

In the preceding screen shot, you see the exact error that occurred during the request. Specifically, the error details point to a builtins:TypeError, where a lambda function was passed an argument it wasn't expecting, with the message: "() takes 0 positional arguments but 1 was given”. This is a common issue when working with functions that don't account for incoming arguments.

Now that you’ve identified the exact issue using the trace and error details, the next step is to go back into the codebase and adjust the lambda function that’s handling this request. In this case, LangChain is calling the search_top_restaurants function with the additional None input. Modify the search_top_restaurants to account for this additional argument.

def search_top_restaurants(flattened_data, limit=5, _=None):
    results = []
    entries = flattened_data.split("\n\n")
    for i in range(min(limit, len(entries))):
        restaurant_info = format_restaurant_info(entries[i])
        results.append(restaurant_info)
    return "\n\n".join(results)

You'll also need to modify the LangChain tools to account for this argument.

Tool(name="search_top_restaurants", func=lambda _: search_top_restaurants(flattened_data), description="Use this tool when user wants to find restaurants without specifying a particular cuisine."),

Once you've applied the fix, restart your application and check if your application is now working as expected.

Monitoring token usage and optimization

In the Summary section of your chatbot application, as shown in the following image, you can also see the token usage for the Relicstraurants chatbot application.

The p99 (99th percentile) for tokens is reported as 1822, while the average token usage per request is around 579 tokens for 28 requests. Notice that there’s significant variation in token consumption with the upper limits reaching over 1800 tokens, suggesting that some requests might be highly complex, consuming more tokens than necessary.

To better understand the token usage metrics, navigate to AI Responses and choose any response with significant token usage. In the following image, the request is using around 1.74k tokens.

Notice the Token usage breakdown, which shows that a significant portion of the total token usage1386 tokens—is consumed during the search_restaurants_by_cuisine tool call. This indicates that most of the tokens are used for retrieving data (in this case, restaurant details for Indian cuisine), with an additional 284 tokens being used to generate the input and context leading to this action. The remaining 67 tokens are allocated to handling other elements of the conversation, such as processing the action input and formatting the final response. This suggests that the complexity of generating the input and context for LangChain’s decision-making is contributing heavily to the overall token usage, not the output, which is far smaller. For a chatbot like Relicstraurants, this token usage might seem on the high side and could result in higher latency and operational costs. Here are some of the ways you can optimize token usage in your chatbot application.

  • Optimize LangChain input processing: In this case, LangChain seems to be over-complicating the input it generates before passing it to OpenAI. Streamlining the logic for building these inputs or reducing the verbosity of the internal queries could help bring token usage down.
  • Switch to a smaller model: Consider switching to a less expensive model like GPT-4o mini or GPT-3.5 Turbo. While GPT-4 models (like GPT-4o) are more capable, they’re also expensive and tend to consume more tokens. GPT-4o mini or GPT-3.5 Turbo may perform sufficiently well for relatively straightforward tasks like restaurant recommendations.
  • Prompt tuning: Review the structure of the input prompts being generated. Is there redundant information? Are there excessive details being included that don't contribute meaningfully to the output? Reducing unnecessary details from the input prompt will reduce token consumption.
  • Monitor token usage: Continuously monitor token usage through the AI Responses section in New Relic. If the p99 token usage continues to increase, it could indicate issues in how complex or verbose the inputs are becoming. You can monitor individual transactions (as shown previously) to identify which queries consume the most tokens.

By adopting these optimization techniques, you can potentially lower your token consumption and save on operational costs without compromising the quality of the chatbot responses.

Conclusion

Monitoring and optimizing your chatbot is crucial for a seamless and efficient user experience. By integrating New Relic with your AI chatbot, you gain valuable insights into its performance, allowing you to detect and resolve issues quickly, optimize resource usage, and ensure that your chatbot remains responsive and reliable. As you continue to develop and refine your chatbot, the data collected through New Relic will serve as a foundation for future optimizations, helping you maintain a high level of service quality as user demands grow.

To get started, clone the Relicstraurants chatbot GitHub repository and follow along with the steps outlined in this blog. Happy coding!