New Relic Now Start training on Intelligent Observability February 25th.
Save your seat.

Serverless is a dream. To be truly “serverless” there must be a perfect boundary between all the technical concerns that matter to your business and all the arbitrary, fiddly, un-fun platform hosting worries that you can leave to a vendor to handle. The actual, real-world products that are labeled “serverless” only approach this dream, and still leave you manually controlling details that, in a perfect world, would be automatic, handled by a vendor, and work perfectly for every user. The problems that users face with serverless are often inherent to the design: issues with security configuration, cold starts, and observability. They’re the same problems you’d expect to have when most, but not all details of running code are left to the vendor. This article will focus on the specific observability problem of serverless tracing.

When you divide microservices into discrete serverless steps, the architecture will naturally become event-based. I don’t have space here to go into the advantages of event-based architecture, but suffice to say most serverless developers will use it as a matter of course. The problem with an architecture where real-world actions emit one or more events handled by multiple services is one of observability. Observability—the goal of producing information on your technology that is easy to interpret and actionable—becomes difficult when multiple services are handling an event, possibly very quickly and asynchronously.

Let’s start with an idealized example. An e-commerce site has a report of a problem: Sometimes, when a customer enters a coupon code, the code is invalid when it should be valid. This problem isn’t consistent, so direct end-to-end testing can’t easily replicate the issue. That means you need to look at real-world performance. When you look at the dashboards for the various components in your stack, there are thousands of logged events, but the question of tracing gets more complex.

How does this event connect to the other events on our service?

Trying to connect events by time code or event attributes (like customer ID) can be overwhelming: with asynchronous processing, time codes will generally not match up, and some systems like queueing services or API gateways don’t offer straightforward logging methods. The question around which database event triggered a specific compute event can drive operations teams to distraction, especially during an active incident.

In general, the even bigger question that looms is: “How does this event connect to other events on our service?” The standard cloud introspection tools will answer some, apparently related, questions like:

  • Which service has permission to access other services?
  • Which services logged errors?
  • What’s the throughput rate on each service?

From these questions—and whatever parameters are available on the service components, such as the aforementioned customer ID—you can often see indicative trends, e.g., that when one service breaks, it drops the traffic on another to zero. But when investigating a limited pattern or a single event, these tools often fail us.

Plan for observability: Structured logs

When dealing with large, high-velocity data, people often say they’re “drinking from the firehose.” But when dealing with log data from multiple cloud services, it can be more like drinking from the ocean. Attempting to parse your logs for observability can be a nightmare. And if your logs are unstructured, you’ll end up spending time using regular expressions to try and filter them, and still get false positives.

AWS strongly recommends structured logs. Generally, JSON formatted such that it’s easy to make queries based on multiple parameters. This greatly simplifies the search for meaningful data.

But even with structured logs, you’ll still face problems with services that don’t allow you to do logging at all. Systems like API gateways or queueing services won’t necessarily let you configure logs into a tidy structure.

Build for observability: AWS X-Ray integration

At some point, only your cloud provider will have real insight into what’s going on between the components of your stack. Since the cloud provider manages and runs an internal routing layer, they should have a fairly complete understanding of the whole system. AWS X-Ray provides an end-to-end view of requests as they travel through your application, and shows a map of your application’s underlying components.

Tools like AWS X-Ray give you just that level of insight, but they have limitations. While X-Ray can offer deep insight into a single transaction’s path, it needs to be integrated with overall observability tools that highlight the things you’re most worried about.

After all, X-Ray provides detail into a single request and how it connects, but it does not provide insight into how requests perform as a whole. Whether you’re using New Relic, which now integrates with AWS X-Ray, or rolling your own solution, you’ll need some way to look at overall performance and then “zoom in” to the X-Ray level for insight.

The specter’s unfinished business: Observability begins in the code

The promise of serverless is that the hard, unnecessary work will be left to the vendor. And it’s tempting to try tools that promise perfect insight into our stack without doing any extra work.

The reality, though, is that while observability is hard, it is not unnecessary work. Only developers know what’s going on inside their code perfectly. So while tools like the New Relic agent running on your serverless code can offer great automatic insights, there will always be details whose importance only makes sense to your team. That means the people writing the code must make a plan for observability as they write their code.

To keep talking about how New Relic can give you deep insight into your applications, join our community Slack and share tips with others!