Get instant Kubernetes observability—no agents required. Meet Pixie Auto-telemetry

Under the Hood of New Relic’s Lambda Extension

7 min read

As I touched on in my previous blog post, Ingest AWS Lambda Logs Directly to Reduce Cloud Spend, New Relic’s integration with AWS Lambda Extensions gives you direct access to your Amazon Web Services (AWS) Lambda log stream, independent from Amazon CloudWatch. This integration benefits you as an engineer by enabling you to manage and optimize your cloud spend without compromising observability—you can simply ingest Lambda Function logs and telemetry data directly to New Relic One’s Telemetry Data Platform. Now that AWS Lambda Extensions is generally available, both New Relic and AWS Lambda users can gain deeper insight into their functions.

In this post, I explore how the New Relic Lambda extension works as a lightweight tool to help collect, enhance, and transport telemetry data from your AWS Lambda functions to New Relic One.

Starting with the basics

Let’s start with some basics about the Lambda environment and the challenges of getting telemetry data out of functions that are stateless and ephemeral. Lambda telemetry comprises a variety of events, including:

  • Invocation events (AwsLambdaInvocation)
  • Error events (AwsLambdaInvocationError)
  • Distributed traces (Span events)
  • Any custom events a developer creates

This data is initially gathered by New Relic One code integrated with the function you want to observe. For simplicity’s sake, we'll refer to this as the language agent, though that only applies to Node, Python, and Go. Java and .NET use code based on the OpenTracing standard to gather telemetry.

The execution environment is where your Lambda function code runs. This environment is created on Amazon Linux and contains an executable file named bootstrap that initializes the runtime. The base image of these containers is Amazon Linux (usually version 2). Each container image contains an executable file named bootstrap that implements the handler lifecycle.

At startup, the Lambda service executes bootstrap and makes a local HTTP server available, which will return event payloads in response to blocking long-polling HTTP GET requests to an API called /next. If no event is immediately available for the function to handle, the call to /next blocks. (A long-poll request is a blocking request intended to synchronize HTTP client-server interaction and allow the server to send an event to the client.)

In addition to this basic lifecycle API, Lambda provides several runtimes that are pre-built implementations of the bootstrap executable that can host handlers in different languages. Bootstrap delegates event handling to these handlers.

After the function invocation is complete, the Lambda service freezes all processes in the execution environment. It maintains the environment for some time in anticipation of another function invocation. If the Lambda function does not receive any invokes for a period of time, the Lambda service spins down and removes the environment. Therefore, functions are necessarily stateless.

Previously, only the runtime process could influence the lifecycle of the execution environment. Now, using the Extensions API, extensions can also influence, control, and participate in the lifecycle of the execution environment. They can also register to receive a new lifecycle event —SHUTDOWN—to know when the Lambda service is about to spin down the environment. Then tools can perform operations outside of the function lifecycle. With the Lambda Extension API, the situation is somewhat different: In addition to performing invocation event notifications, it receives lifecycle events such as SHUTDOWN and can register for log events as well. This allows functions to maintain a non-durable state and execute a finalizer, which is similar to a Java finalizer. Note that the timing is unpredictable, so you can't use the API to implement time-sensitive logic.

Getting telemetry out of a container

Because the Lambda execution environment is stateless, we must collect telemetry and immediately get it out of the execution container. AWS has a rich and flexible permission system called AWS Identity and Access Management (IAM), and each function adopts an execution role that governs its interactions with other services. "Rich and flexible" is always a synonym for complex, however.

To get the telemetry from the container while keeping configurations simple, you're left with three options:

  • Make an HTTP request at the end of each invocation (the request must block the invocation response).
  • Emit the telemetry by printing it to standard out. By default, everything written to the stdout and stderr file descriptors is sent to CloudWatch Logs.
  • Use an extension to add state to the function. Telemetry will be collected, stored in a buffer, and sent in batches.

Initially, New Relic used the stdout/CloudWatch option to set up a CloudWatch log group to send filtered log batches to other places, such as another Lambda function. We provided a Lambda function that would parse the logs, extract the telemetry, and send it all to New Relic One. However, there were some disadvantages to this method:

  • It required using a Lambda function published in the AWS Serverless Application Repository.
  • Log subscription filters were limited.
  • CloudWatch ingest increased costs.
  • Latency could be an issue for customers needing a rapid time-to-glass.

Introducing an elegant solution

Instead, our New Relic Lambda extension overcomes these challenges to collecting, enhancing, and transporting telemetry data. Running inside the Lambda execution environment, the extension buffers telemetry and sends it periodically. The agent uses a simple language-agnostic IPC mechanism to send telemetry to the extension (a named pipe), which means we only need one extension code base to serve all our language agents.

The main downside to this approach is the container lifecycle in the Lambda execution environment. It's possible to accumulate a buffer of several invocations’ worth of telemetry, only to have the function stop receiving invocations. There isn’t much opportunity to send the accumulated buffer until the function finally gets invoked or shut down.

In either case, the telemetry sent to New Relic One needs the New Relic license key to identify and authenticate itself on behalf of the customer. For the log ingest Lambda I described in my previous post, this is easy: there’s only one per region, so having the license key in an environment variable is straightforward. For the telemetry extension, it was clear we would need a better solution to manage the license key. The answer is AWS Secrets Manager. We’ve integrated the service with the New Relic Lambda extension and made it the default path for license key retrieval. As a result, during setup (per region), we create the license key secret. Each function’s execution role then needs to include permission to read that secret. The secret is only read on function cold start, which avoids concerns around AWS Secrets Manager’s API latency and minimizes AWS costs.

Decorating telemetry for you

Now that we’ve solved the challenges of getting the telemetry data into the New Relic ingest pipeline, let’s talk about metadata decoration.

If you look carefully, you’ll see more fields on your AwsLambdaInvocation events when you query them than when the agent produces them. Why is this? It’s a result of metadata decoration. Here’s how it works.

New Relic One maintains an inventory of all the Lambda functions in your AWS account (which is why we need the account pairing step). Along with simple identity information, we collect configuration information from your Lambdas, such as runtime, tags, max memory, and timeout. This information is stored in New Relic One and available in the New Relic Explorer, Entity Explorer, regardless of whether you’ve instrumented a given function.

Most of this metadata isn’t available to the function itself, which is why we collect it separately by adding this metadata to invocation events at write time, letting us maintain a record of the state of your function as it was invoked, rather than as it is today.

Because part of the ingest pipeline’s job is to "decorate" the invocation events with the most recently gathered function metadata, you can slice and dice your event data by your function metadata in custom queries, dashboards, or NRQL alerts.

Understanding your serverless applications

The New Relic Lambda extension sends logs and telemetry from your Lambda functions directly to New Relic One. With this extension, you can observe and understand your serverless applications’ behavior and performance while minimizing latency and optimizing your cloud spend.

This post was updated from a previous version published February 17, 2021.