We recently completed a major rewrite of the New Relic Lambda Extension, migrating the entire codebase from Go to Rust. This initiative was not merely a language change; it represented a fundamental re-architecture of how we monitor AWS Lambda functions.
The results have been significant. We reduced the billed duration by approximately 40% (across both cold and warm starts) and decreased memory usage by about 13%. Furthermore, the extension now demonstrates enhanced reliability.
In this blog, I’ll share why we felt the need to make this decision, the architectural changes we implemented, and how we achieved these numbers.
Why We Needed a Change?
We noticed that many other observability extensions focus too much on reducing "init duration" (the startup time). But the problem is, they often do this by cutting features or delaying tasks, which pushes that work into the "invocation time." This actually increases the overall billed duration for the user. We didn't want to do that.
We wanted to optimize the entire lifecycle. Our Go-based extension was working fine, but we kept running into specific issues that needed fixing:
- Reliability: We needed to be 100% sure that if our extension had an issue, it would never impact the customer's Lambda function.
- Memory Overhead: Go is good, but its Garbage Collector (GC) adds significant memory overhead. In the serverless world where you pay for every MB, this matters.
- Data Loss: In the old version, if sending telemetry failed, that data was simply lost. We had no retry mechanism to handle failures gracefully.
Why We Chose Rust?
After evaluating a few options, Rust stood out for us. The main reason is "Zero-Cost Abstractions." Rust gives you high-level features without slowing down the runtime. Since there is no garbage collector, the memory usage is very low and predictable.
Also, Rust is strict. Its ownership system prevents entire classes of bugs (like memory leaks and data races) at compile time—basically, it stops you from writing bad code before it even runs. For an extension that runs alongside customer code, every millisecond counts, and Rust is as fast as C but much safer.
The Results: Real Performance Improvements
Lambda billing is based on GB-seconds (memory × duration), so every millisecond we save actually saves money. When we compared the new Rust version against the old Go version, the improvements were clear.
We reduced billed duration by approximately 40% for both cold and warm starts. This happened because of our new async runtime and connection pooling. For a customer, this translates directly to lower AWS costs and faster response times.
Our memory optimization was equally impressive. We reduced memory usage by 13% on cold starts and 9% on warm starts. These savings come from Rust's lack of a garbage collector and our efficient, zero-copy data handling. This means customers can either allocate more memory to their own code or potentially downgrade to a lower (cheaper) memory tier.
Most importantly, we achieved Rock-Solid Reliability. We now have zero customer Lambda failures caused by the extension. If something goes wrong in our code, the extension simply enters a "no-op" mode. Plus, with our new smart batching and retry system, we reduced data loss by 95%.
A Fundamental Redesign: Go vs. Rust Architecture
We didn't just translate Go code to Rust line-by-line. We fundamentally rethought the architecture.
1. The Event Loop
In Go, we used a synchronous blocking approach. The main loop would block and wait for the next invocation. In Rust, we redesigned this around the Tokio async runtime. The extension now uses non-blocking I/O, allowing it to handle telemetry, network requests, and multiple invocations in parallel without getting stuck.
2. Managing State
In the Go architecture, we relied on global mutable state with mutex locks. This created bottlenecks. In Rust, we moved to a layered system. We now have per-request state for each invocation and a lock-free global state using concurrent structures like DashMap. Because Rust is "immutable by default," we don't have to worry about data being accidentally modified.
3. Telemetry Pipeline
Previously, our pipeline was simple: receive data, try to send it immediately, and if it failed, it was gone. Now, we have a sophisticated pipeline with buffering and batching logic. If a payload fails to send, we use a retry mechanism with exponential backoff.
4. Memory Management
Go’s garbage collector simplified development but added unpredictable overhead (up to 15MB) and random GC pauses. With Rust, we took explicit control. We use a Zero-Copy Design where data is shared via reference counting rather than being cloned. We also keep small, short-lived data on the stack. This eliminated the random pauses we used to see.
Challenges We Overcame
It wasn't all easy. Rust has a steeper learning curve than Go. Concepts like ownership, borrowing, and lifetimes take time to understand. We had to invest time in learning Rust's ownership model properly before starting.
Also, the ecosystem is smaller. Some Go libraries didn't have direct Rust equivalents, so we relied on high-quality crates like tokio, reqwest, and dashmap, and built custom logic where needed. We also had to adapt to the AWS SDK for Rust, which is newer and async-first.
Conclusion
Migrating from Go to Rust was a significant effort, but the results justify the work:
- Up to 40% faster (lower costs, better user experience).
- Up to 13% less memory (more room for customer code).
- Zero customer Lambda failures (rock-solid reliability).
The key insight for us is that Rust isn't just faster—it's fundamentally more reliable. The language forced us to think about error cases and memory safety upfront. If you are building infrastructure where efficiency and safety matter, Rust is definitely worth considering.
The views expressed on this blog are those of the author and do not necessarily reflect the views of New Relic. Any solutions offered by the author are environment-specific and not part of the commercial solutions or support offered by New Relic. Please join us exclusively at the Explorers Hub (discuss.newrelic.com) for questions and support related to this blog post. This blog may contain links to content on third-party sites. By providing such links, New Relic does not adopt, guarantee, approve or endorse the information, views or products available on such sites.