New Relic has been running important, high-throughput services written in Elixir for a few years. The stability and performance of Erlang’s virtual machine (VM), known as BEAM, has impressed us on the New Relic Unified API team. We’ve gained some expertise along the way, but expertise requires more than knowing how your tools can perform well. You must also understand the ways your tools can fail. 

Our team's desire to understand the failure scenarios for BEAM-based services drove us to do thorough load testing on an API gateway that uses GraphQL two years ago. We wanted to force the gateway into a failure state so we could better prepare for the future. What exactly was happening to the BEAM VM under stress? How could we mitigate risks or recover from a severe incident rapidly? 

Back then, we concluded:

  • When CPU-bound, we feel confident that throughput will maintain, but latency will increase. As expected, it should be monitored.
  • When memory-bound, the gateway will eventually be killed because it's out of memory (OOM), like any other service. While we didn't trigger this in our load tests, we have seen it in production and know it can happen.

Two years later, this assessment has proved correct. We've experienced more than 10 times the traffic spikes that increase CPU usage, with little impact other than a small increase in latency.  But sporadic OOM kills sometimes occurred, often under puzzling circumstances. They were never frequent or sustained enough to cause a disruption, and they were difficult to reproduce. A recent increase in our memory usage warnings prompted us to investigate once again.

This blog post covers what we learned, how we learned it, and how we ultimately addressed the underlying problem. My hope is there’s a little something for everyone here. If you’re unfamiliar with Elixir and Erlang, I’ll talk a bit about the features of the VM that they share and how it factored into our investigation. If you’re a novice Elixir user, I’ll introduce some handy tools that may be new to you. If you’re a pro, I think you’ll still find the outcome of our investigation interesting!

In the end, we reduced the API gateway’s memory usage during expensive queries by 60-70%. We learned a lot about diagnosing memory usage issues in the BEAM VM, and we finally got to the bottom of a long running issue while continuing to build our expertise in the Elixir/Erlang ecosystem.

The problem: memory spikes during requests to downstream services

After our capacity analysis two years ago, we created low-priority New Relic alerts that trigger when memory usage strays above our baseline. These memory spikes aren’t terribly common, but they do occur, and recently we started seeing more of them. This was fortunate timing! We weren’t reaching the point of an actual OOM kill, but we did have more data on what was happening as well as examples of queries that seemed to be triggering anomalous behavior.

With some investigation, our team was able to confirm two interesting things: 

  • Memory spikes happened when a single request to the service caused us to make a large number of HTTP requests to downstream services. Since one of our service’s roles is to act as a gateway, making many HTTP calls per request is common. At the same time, not every query that triggered a large number of requests caused a spike.
  • We found that the data returned from the request didn’t actually affect whether or not the memory spike occurred. So while memory usage would spike with a large number of downstream requests, the size of the payloads returned by downstream services didn’t appear to be the issue.

Ultimately, we were able to produce a query that we could run locally or against staging to consistently reproduce the problem. This was a big deal, because our search a valid reproduction in the past had been elusive. With a consistent reproduction, we were able to bring in a powerful tool from the Erlang ecosystem, the Erlang Observer.

Using the Erlang Observer for troubleshooting

Elixir is built on Erlang’s BEAM VM and features seamless interoperability with existing Erlang tools and libraries. While Elixir’s development began in 2011, Erlang has been in development since the early '90s. In 2013, Joe Armstrong pointed out that Erlang nodes were handling 50% of cell phone data traffic worldwide. It’s safe to say it's battle tested.

With that kind of pedigree comes a deep investment in tooling. The Erlang Observer is one of those tools. It’s a GUI application that can connect to a local or remote BEAM node (meaning you can run it against staging and prod) and provide extensive detail about the runtime status of the VM. 

This includes:

  • CPU usage
  • Memory usage, allocation, and garbage collection
  • Scheduler behavior
  • Per-Erlang-process details and their message queues
  • Visualization of the supervision tree

And there's much more. The power of this tool could be a series of blog posts in itself.

It provides an easy way to learn a lot about what’s happening inside your Elixir app in a very short amount of time. If you have an Elixir app, you can try it out by running :observer.start() in IEx.

The New Relic Elixir Agent collects most of this information, too. The agent reports things like BEAM VM memory usage split by category and top-consuming processes. and it’s helpful for a high-level metric view of your Elixir application.

Since we were curious about memory usage, we turned first to the Observer’s charts of memory usage over time. The next image shows a view in Observer of what happens when we run our problematic GraphQL query. The Carrier Size chart shows the VM’s overall memory usage.

As you can see, memory jumps from a few hundred megabytes to nearly 1.5 gigabytes right after the query is received and stays high for a few minutes. No wonder we were seeing memory spikes!

Next, we visited the System Overview tab in Observer. This showed us that most of the memory was being allocated to Erlang processes, which helped us narrow down our search!

Investigating process memory usage

What Erlang and Elixir call a process is not an operating system (OS) process. An Erlang process is an independent execution context that runs code sequentially. These processes are managed by the VM’s schedulers, which can juggle many thousands of Erlang processes concurrently. If you squint, they bear some resemblance to goroutines, green threads, or fibers in other languages. However, these Erlang processes come with a few more guarantees. They only communicate with each other via messages passing within a node or cluster of nodes, they do not share memory, and their failure is isolated. One process crashing will not affect the others, which is why large parts of an Elixir or Erlang application can fail and restart themselves without taking the entire system down. 

Elixir and Erlang make it simple and safe to spin up one of these internal Erlang processes, and our team uses this extensively. In our service, for example, every HTTP request is its own process. We have many other processes that manage background work, in-memory caches, isolate risky code, and more.

In our case, the Observer has shown us that, overall, the memory usage of all VM processes seems to balloon when our problematic GraphQL queries hit. The Observer provides various ways to view our VM processes, but the simplest is a flat table of every process currently running.

Here’s the process list right after the bad query comes in, sorted by memory usage:

In this case, the code that accesses our feature flags is expected to use substantial memory, so we focused on the many processes referring to SSL.

In the course of reproducing the problem, we noticed the processes referring to SSL and TLS were created as soon as the service received a problematic query. A few were using a bit of memory, but there were also a great number of tls_connection processes that used 3 MB each. This isn’t a lot by itself, but multiplied by hundreds it could cause issues with memory usage. 

Because these processes stayed alive for minutes at a time, we also knew this wasn’t the garbage collector racing to keep up. Both the Observer and the New Relic agent, show garbage collection stats, too. Garbage collection would only occur after the processes died.

The longer lifetime of these processes made sense. The HTTP client library we use keeps a pool of connection processes alive after a request. This is a performance optimization that fits our needs because it reduces overhead when we make many requests to the same host.

At this point, we were pretty sure that requests from our service were triggering the creation of a large number of connection processes. Each process had a bit more memory allocated than we would have expected, resulting in all the processes as a whole taking a lot more memory than expected. Memory usage ran high for exactly the lifetime of these connection processes, then returned to baseline when terminated.

Eventually, we found an Erlang forum post with the same question we had: Why was memory usage high among our pool of SSL connection processes?

Ingela Andin, a member of the Erlang core team, offered this answer:

How do you provide your certificates to ssl? If you provide your CA-certs as binary DER blobs to the connection they will be stored by the connection and be local to the connection. If you provide them through a file they will be stored in a ETS table and maybe referenced by other connections as there is a way to refer to them.

In other words, there are two ways to let Erlang know where to find your certificate authority (CA) root certificates for SSL connections. One method provides certs to each individual process that needs an SSL connection. Remember that VM processes do not share memory, so a redundant copy must exist for every VM process that needs certs. In our case, this was hundreds of processes given how many downstream requests we make.

The other method of supplying certs stores the certs in a central ETS table. ETS tables are extremely fast key-value stores built into Erlang. An ETS table can serve as a repository for data that needs to be accessed by many discrete VM processes, and you can optimize a table for various concurrency situations. That sounded like precisely what we needed.

We discovered that our HTTP client library was passing the certs to each process. With a little work, we figured out how to override this default behavior and force the HTTP client to use an ETS table to store the CA certs in one place. 

This is the result:

Memory usage decreased by 60-70% in the repro query. If you look at the process list again, you can see why:

Those tls_connection VM processes were now using 192 KB a piece instead of the 3000 KB they were using before.

Our conclusion was fairly straightforward:

  • In the past, queries that generated high numbers of downstream requests to other APIs meant large pools of processes were created, and each process held 3MB of redundant data.
  • Taken as a whole, this VM process pool ended up using a lot of RAM for the 2-minute lifetime of the pool.
  • We could reduce the memory footprint of each VM process by pulling CA certs from an ETS table.
  • This approach reduced memory usage dramatically in our tests.

We finally found our culprit!

Conclusion

While these memory spikes weren’t really affecting our service’s day-to-day operation or stability, finding the root cause allowed us to practice diagnosing and resolving lower-level issues in Elixir and Erlang applications. Gaining a better understanding of the causes and resolutions of failure states is always valuable in culture like we have at New Relic, where performance and resiliency are paramount.  In the end, we reduced NerdGraph’s baseline memory usage by 5-10% in addition to the 60-70% reduction we saw for extreme queries. Not bad for a configuration change!