As I touched on in my previous blog post, Ingest AWS Lambda Logs Directly to Reduce Cloud Spend, New Relic’s integration with AWS Lambda Extensions gives you direct access to your AWS Lambda log stream, independent from Amazon CloudWatch. This integration benefits you as an engineer by enabling you to manage and optimize your cloud spend without compromising observability—you can simply ingest Lambda Function logs and telemetry data directly to New Relic One’s Telemetry Data Platform.
In this post, I explore how the New Relic Lambda extension works as a lightweight tool to help collect, enhance, and transport telemetry data from your AWS Lambda functions to New Relic One.
Starting with the basics
Let’s start with some basics about the Lambda environment and the challenges of getting telemetry data out of functions that are stateless and ephemeral. Lambda telemetry comprises a variety of events, including:
- Invocation events (
- Error events (
- Distributed traces (Span events)
- Any custom events a developer creates
This data is initially gathered by New Relic One code integrated with the function you want to observe. For simplicity’s sake, we'll refer to this as the language agent, though that only applies to Node, Python, and Go. Java and .NET use code based on the OpenTracing standard to gather telemetry.
Each function in the Lambda execution environment is a container image running in a custom execution environment. The base image of these containers is Amazon Linux (usually version 2). Each container image contains an executable file named
bootstrap that implements the handler lifecycle.
At startup, the Lambda service executes
bootstrap and makes a local HTTP server available, which will return event payloads in response to blocking long-polling HTTP GET requests to an API called
/next. If no event is immediately available for the function to handle, the call to
/next blocks. (A long-poll request is a blocking request intended to synchronize HTTP client-server interaction and allow the server to send an event to the client.)
Now here is where it gets interesting: none of the processes running in the container will be scheduled until a new event arrives. The container may also be terminated at any time while awaiting an event. Therefore, functions are necessarily stateless.
This situation is somewhat different with the Lambda Extension API, which in addition to performing invocation event notifications, receives lifecycle events such as
SHUTDOWN and can register for log events as well. This allows functions to maintain a non-durable state and execute a finalizer, similar to a Java finalizer. Note that the timing is unpredictable, so the API can’t be used to implement time-sensitive logic.
In addition to this basic lifecycle API, Lambda provides several runtimes that are pre-built implementations of the
bootstrap executable that can host handlers in different languages.
Bootstrap delegates event handling to these handlers.
Getting telemetry out of a container
Because the Lambda execution environment is stateless, we must collect telemetry and immediately get it out of the execution container. AWS has a rich and flexible permission system called AWS Identity and Access Management (IAM), and each function adopts an execution role that governs its interactions with other services. "Rich and flexible" is always a synonym for complex, however.
In order to get the telemetry off of the container without requiring complex permissions, we’re left with three options:
- Make an HTTP request at the end of each invocation (the request must block the invocation response)
- Emit the telemetry by printing it to
standard out. By default, everything written to the
stderrfile descriptors is sent to CloudWatch Logs.
- Use an extension to add state to the function. Telemetry will be collected, stored in a buffer, and sent in batches.
Initially, New Relic used the
stdout/CloudWatch option to set up a CloudWatch log group to send filtered log batches to other places, such as another Lambda function. We provided a Lambda function that would parse the logs, extract the telemetry, and send it all to New Relic One. However, there were some disadvantages to this method:
- It required using a Lambda function published in the AWS Serverless Application Repository
- Log subscription filters were limited
- CloudWatch ingest increased costs
- Latency could be an issue for customers needing a rapid time-to-glass.
Introducing an elegant solution
Instead, our New Relic Lambda extension overcomes these challenges to collecting, enhancing, and transporting telemetry data. Running inside the function’s container, the extension buffers telemetry and sends it periodically. The extension amortizes the need to block function execution while making the telemetry send requests to New Relic One across several invocations. The agent uses a simple language-agnostic IPC mechanism to send telemetry to the extension (a named pipe), which means we only need one extension code base to serve all our language agents.
The main downside to this approach is the container lifecycle in the Lambda execution environment. It’s possible to accumulate a buffer of several invocations’ worth of telemetry, only to have the function stop receiving invocations. No opportunity to send this accumulated buffer will arise until the function finally gets invoked or it gets shut down.
In either case, the telemetry sent to New Relic One needs the New Relic license key to identify and authenticate itself on behalf of the customer. For the log ingest Lambda I described in my previous post, this is easy: there’s only one per region, so having the license key in an environment variable is straightforward. For the telemetry extension, it was clear we would need a better solution to manage the license key. The answer is AWS’s management service for secrets. We’ve integrated the service with the New Relic Lambda extension and made it the default path for license key retrieval. As a result, during setup (per region), we create the license key secret. Each function’s execution role then needs to include permission to read that secret. The secret is only read on function cold start, which avoids concerns around AWS Secrets Manager’s API latency and minimizes AWS costs.
Decorating telemetry for you
Now that we’ve solved the challenges of getting the telemetry data into the New Relic ingest pipeline, let’s talk about metadata decoration.
If you look carefully, you’ll see more fields on your
AwsLambdaInvocation events when you query them than when the agent produces them. Why is this? It’s a result of metadata decoration. Here’s how it works.
New Relic One maintains an inventory of all the Lambda functions in your AWS account (which is why we need the account pairing step). Along with simple identity information, we collect configuration information from your Lambdas, such as runtime, tags, max memory, and timeout. This information is stored in New Relic One and available in the New Relic Explorer, Entity Explorer, regardless of whether you’ve instrumented a given function.
Most of this metadata isn’t available to the function itself, which is why we collect it separately by adding this metadata to invocation events at write time, letting us maintain a record of the state of your function as it was invoked, rather than as it is today.
Because part of the ingest pipeline’s job is to "decorate" the invocation events with the most recently gathered function metadata, you can slice and dice your event data by your function metadata in custom queries, dashboards, or NRQL alerts.
Understanding your serverless applications
By sending both logs and telemetry from Lambda directly to New Relic One, you can observe and understand your serverless applications’ behavior and performance. Using the New Relic Lambda extension to bypass CloudWatch lets you do this while minimizing latency and optimizing your cloud spend.
Learn more about the New Relic Lambda extension (and contribute as well) at our git repo.