In this Q&A, Wendy Shepperd, GVP of Engineering at New Relic, discusses New Relic's transition to AWS and to cell architecture. To learn more about New Relic's transition, see Andrew Hartnett's AWS re:Invent talk: Unlocking Scalability with Cells: New Relic’s Journey to AWS. If needed, you can sign up for a virtual pass. The talk covers all of these topics in much greater depth.

Why did New Relic switch to Amazon Web Services (AWS)?

Tens of thousands of engineers use New Relic on a daily basis to build and operate their most critical applications and infrastructure using the New Relic One Observability platform. Our New Relic data platform, also known as NRDB, currently ingests over 3 billion data points per minute and 150 petabytes of data per month, while serving 160 billion web requests per day. And we double that scale approximately every 12 months.

The New Relic data platform ingests over 3 billion data points per minute and 150 petabytes of data per month while serving 160 billion web requests per day.

In 2019, we were experiencing scalability challenges with our existing architecture because we had a massive single Kafka cluster in our data center with thousands of nodes processing data. We couldn't continue to scale horizontally due to architectural constraints. We wanted to move to a cell-based architecture so we could create multiple instances of NRDB in multiple regions.

We already had experience working with AWS and we were interested in their services, their global footprint, and their ability to scale to our needs. We make extensive use of Kafka, microservices, and containers. And, we were particularly interested in using cell architecture along with AWS-managed services such as Amazon Elastic Container Service for Kubernetes (EKS) and Amazon Managed Streaming for Apache Kafka (MSK).

How did New Relic use its platform to help transition to AWS?

We use New Relic every day, all day—it's built by engineers for engineers. We instrument every part of our stack. As we were migrating to the new architecture and infrastructure, we were able to compare our previous and new architecture to observe differences—in performance, for example. We instrumented the new cell architecture using New Relic so we could analyze the overall health and performance of the platform. Observability enables us to detect and remediate failures in minutes and observe ongoing trends.

We are one of the largest users of our own platform in terms of data ingest. Because we can support our own needs, we have a lot of confidence that we're able to support the needs of our biggest customers.

Visualizing change is critical during large-scale migrations. With New Relic One, you get a high-level view of your system with high cardinality and high dimensionality.

When did you realize that making the transition was the right decision?

We had built our first two cells and had data flowing into them. During a critical holiday period, we had an unexpected traffic spike that caused us to reach capacity in one cell. It was the first time that we shifted traffic out of an unhealthy cell to a healthy cell in production. We were then able to scale up both cells concurrently with no impact on customers. That’s when we knew cell architecture was really going to work for us, and we proceeded to accelerate our migration.

What advice do you have for other practitioners looking to transition to the cloud?

Use New Relic to instrument your applications and infrastructure so you can monitor, debug, and improve your entire stack. Define your SLOs (service level objectives), validate them with your customers to align on expectations, and configure your alerts to notify you according to your needs.

Establish a FinOps team early on to focus on cloud cost optimization. Tag your cloud assets upfront and define clear governance and processes for capacity management and leverage autoscaling where possible. 

When you move from data centers to the cloud, it changes your business model. You're moving from Capex (capital expenditures) to Opex (operation expenditures). If you do not understand how to manage capacity and costs, you can blow up your budget very quickly.

Our engineers are making decisions about costs on a daily basis. They're analyzing things like “What's the cost of my service? How can I implement auto-scaling? How can I reduce resource usage? Where am I not as efficient as I could be in capacity management?”

What best practices do you recommend for cell architecture?

We want our cells to have a lifespan of 90 days or less. It's a challenge to build and decommission cells that frequently, but otherwise they get stale. With a stale cell, you get drift in your configuration and deployed services, you're not picking up security patches and OS upgrades, you’re not getting new functionality from your cloud provider. It's really critical to keep those cells fresh.

We have a team dedicated to tooling and automation of cell builds and decommissions and have made tremendous improvements in the efficiency, consistency, and quality of cell builds. We are now building new types of cells and cells in different regions. We even have a random cell name generator and have seen some really interesting and funny cell names.

What is one lesson learned from the transition to AWS?

Don't plan for the happy path. Most people tend to be overly optimistic versus planning for new technology, discoveries, unexpected work, or things going wrong. We planned for the data platform migration to take one year. By taking an iterative approach and planning in time for unknowns, we were able to overcome challenges, make adjustments, and meet our goals.